Originally from University of Wisconsin-Madison CS/ECE 752.
Modified for ECS 201A, Winter 2024.
Due on 1/29 1:59 pm (PST): See Submission for details
You should submit your report in pairs. Make sure to start early and post any questions you might have on Piazza. The standard late assignment policy applies.
Use classroom: assignment 2 to create an assignment. You will be asked to join/create an assignment. If your teammate has already created an assignment, please join their assignment instead of creating one assignment otherwise create your assignment and ask your teammate to join the assignment.
In this assignment, you are going to:
This homework is based on exercise 3.6 of CA:AQA 3rd edition (the former textbook for this course) and was developed in part by Jason Lowe-Power et al., then modernized by Matt Sinclair and Jason Lowe-Power.
For this assignment we are going to use DAXPY as our workload. The DAXPY loop (double precision aX + Y
) is an often used operation in programs that work with matrices and vectors. The following code implements DAXPY in C++.
#include <cstdio>
#include <random>
int main()
{
const int N = 131072;
double X[N], Y[N], alpha = 0.5;
std::random_device rd; std::mt19937 gen(rd());
std::uniform_real_distribution<> dis(1, 2);
for (int i = 0; i < N; ++i)
{
X[i] = dis(gen);
Y[i] = dis(gen);
}
// Start of daxpy loop
for (int i = 0; i < N; ++i)
{
Y[i] = alpha * X[i] + Y[i];
}
// End of daxpy loop
double sum = 0;
for (int i = 0; i < N; ++i)
{
sum += Y[i];
}
printf("%lf\n", sum);
return 0;
}
You can find the definitions for the workload objects in gem5 under workloads/daxpy_workloads.py
. In this assignment, we will only be using DAXPYWorkload
. In order to create an object of DAXPYWorkload
you just need to call its constructor (__init__
) function.
In this assignment we are going to measure the impact of different pipeline latencies on the overall performance of the system. As we discussed in the class it is important to consider measured performance as a product of both software and hardware. In this spirit, it might be useful to get a picture of the instruction mix in our workload. You might find this information useful in later steps of your analysis. As part of this assignment, you will only modify/change the CPU model. Models for the board, cache hierarchy, and memory will remain a constant in your experiment.
components/boards.py
.
You will only be using HW2RISCVBoard
in this assignment.components/processors.py
.
There are a few classes defined in components/processors.py
.
However, the main classes (models) you will need to use are HW2TimingSimpleCPU
and HW2MinorCPU
.components/cache_hierarchies.py
.
You will only use HW2MESITwoLevelCache
in this assignment.components/memories.py
.
You will only use HW2DDR4_2400_8x8
in this assignment.4 GHz
for all of your simulations.In your role as a computer architect, it’s crucial to focus on the code segments that put the most strain on the specific hardware component you’re targeting. There are usually three segments to a program: (a) initialization, (b) computation, and (c) verification. As you might have guessed, segment (b) is the important section that you need to study. In the DAXPY code above, our ROI is the DAXPY loop.
// Start of daxpy loop
for (int i = 0; i < N; ++i)
{
Y[i] = alpha * X[i] + Y[i];
}
// End of daxpy loop
In gem5, you can annotate this region with gem5-specific instruction.
In the workloads/daxpy/daxpy.cpp
, the code is annotated as:
#ifdef GEM5
m5_work_begin(0,0);
#endif
// Start of daxpy loop
for (int i = 0; i < N; ++i)
{
Y[i] = alpha * X[i] + Y[i];
}
// End of daxpy loop
#ifdef GEM5
m5_work_end(0,0);
#endif
To compile this program, you need to include the gem5/m5ops.h
header file.
In the stats file generated after the simulation, you will only have statistics within the defined ROI.
For this assignment, the code is already compiled.
You can find the binary daxpy-gem5
and it’s assembly daxpy-gem5-asm
in the workloads/daxpy/
directory.
If you want to manually compile this program, follow this instructions:
CROSS_COMPILE=riscv-linux-gnu-
all: daxpy-gem5 daxpy-gem5-asm
clean:
rm daxpy-gem5 daxpy-gem-asm
daxpy-gem5:
$(CROSS_COMPILE)g++ daxpy.cpp -o daxpy-gem5 -static -O2 -I$(GEM5_ROOT)/include -DGEM5 -L$(GEM5_ROOT)/util/m5/build/riscv/out -lm5
daxpy-gem5-asm:
$(CROSS_COMPILE)g++ daxpy.cpp -o daxpy-gem5-asm -static -O2 -I$(GEM5_ROOT)/include -DGEM5 -L$(GEM5_ROOT)/util/m5/build/riscv/out -lm5 -S -fverbose-asm
You need to include the gem5.h
file. More on including and linking gem5’s m5
can be found here..
If you want to dump the dynamic assembly instructions, you need to use tools like objdump
to do so.
objdump daxpy-gem5 # or riscv-linux-gnu-objdump
In your configuration scripts, make sure to import exit_event_handler
using the command below.
from workloads.roi_manager import exit_event_handler
You will have to pass exit_event_handler
as a keyword argument named on_exit_event
when creating a simulator
object. Use the template below to create a simulator object.
simulator = Simulator(board={name of your board},
full_system=False,
on_exit_event=exit_event_handler)
Complete the following steps and answer the questions for your report. Collect data from your simulation runs and use simulator statistics to answer the questions. Use clear reasoning and visualization to drive your conclusions. You are allowed to submit your reports in pairs and in PDF format.
Before starting simulation and analysis, you should be able to identify the ROI of a program.
m5_work_begin
and m5_work_end
.Before running any simulations try to answer these questions. Try to make an educated guess.
integer
, floating point
, and memory
instructions, do you think each category would constitute equal parts of a program?Recommended Reading: I recommend you read up on these concept(s): arithmetic intensity, roofline model.
TimingSimpleCPU
is an internal CPU model in gem5’s code base that models the execution of non-memory instructions as a single cycle CPU.
This CPU model is a useful tool for extracting information on the instruction mix of a program.
You can find the definition of HW2TimingSimpleCPU
which is based on TimingSimpleCPU
in components/processors.py
.
Write a configuration script that will simulate the execution of DAXPYWorkload
on HW2TimingSimpleCPU
.
Make sure to track the simulation outputs for later use. In the statistics output look for committedInstType
.
This statistic represents a distribution of different operation classes executed by the processor.
In your report, answer the same questions after simulation supported with data.
Use HelloWorldWorkload
from workloads/hello_world_workload.py
as a second program to compare instruction mixes.
A complete set of simulation data for this step should include two configuration (one for DAXPYWorkload
and one for HelloWorldWorkload
).
For this step, write a configuration script that allows you to simulate DAXPYWorkload
with HW2MinorCPU
.
Make sure to understand how to instantiate an instance of HW2MinorCPU
.
NOTE: Although you can call its constructor function (__init__
) without any input arguments passed, you will need to set those values for your experimentation.
Please make sure to read the documentation for HW2MinorCPU
and understand what each of the input arguments to __init__
mean.
MinorCPU
is one of gem5’s internal CPU models that models an in-order pipelined CPU.
HW2MinorCPU
is based on MinorCPU
.
The default pool of functional units for MinorCPU
includes two integer units and one floating point and SIMD unit.
Modify your configuration script to allow for changing issue latency, and floating point operation latency.
For your reference, issue latency measures the number of cycles between injection two consecutive instructions into the pipeline.
An issue latency of 4 cycles
means that an instruction is injected to the pipeline, every 4 cycles.
On the other hand, floating point operation latency refers to the number of cycles it takes to complete the execution of a floating point instruction.
In this step, measure your simulated performance for different combination of these two latencies.
For simplicity’s sake, start with an initial value of 4 cycles
for issue latency and an initial value of 2 cycles
for floating point operation latency.
Moreover, assume you can trade issue latency with floating point operation latency.
In addition, assume that the product of issue latency and floating point operation latency will always remain at a constant of 8
.
For your simulations, evaluate the performance of the configurations shown below.
# | issue latency | floating point operation latency |
---|---|---|
1 | 4 | 2 |
2 | 2 | 4 |
3 | 8 | 1 |
NOTE: Make sure to keep track of your simulation outputs for all of your simulation runs for your later analyses.
In your report, answer the following questions after simulation supported with data. A complete set of simulation data for this step should include three configurations (three possible combinations of issue latency and floating point operation latency).
For this step, modify your configuration script to allow for changing integer operation latency and floating point operation latency.
Let’s assume our processor has a very fast decode stage that can issue both integer and floating point instructions in 1 cycle
.
Next, let’s focus on integer operation latency and floating point operation latency.
Let’s assume an initial value of 6 cycles
for integer operation latency and an initial value of 12 cycles
for floating point operation latency.
For your experimentation, suppose you can only reduce one of these latencies by
a factor of 2.
This means that you can build a processor with an integer operation latency of 3 cycles
and a floating point operation latency of 12 cycles
or a processor with an integer operation latency of 6 cycles
and a floating point operation latency of 6 cycles
.
For your experimentation, simulate the baseline case and the two possible improved cases.
Here is a table showing all possible combinations of the latencies that you need to experiment with.
# | integer issue latency | integer operation latency | floating point issue latency | floating point operation latency |
---|---|---|---|---|
1 | 1 | 6 | 1 | 12 |
2 | 1 | 3 | 1 | 12 |
3 | 1 | 6 | 1 | 6 |
In your report answer the following questions.
Hints:
DAXPY
loop below (you can also find the complete assembly for it under worklaods/daxpy/daxpy-gem5-asm
). Can you point out some dependencies between the instructions? Do you think only looking at the instruction mix gathered from Step I provided enough information to apply Amdahl’s law?.L35:
# daxpy.cpp:27: Y[i] = alpha * X[i] + Y[i];
fld fa4,0(a5) # MEM[(double *)_56], MEM[(double *)_56]
fld fa5,0(s2) # MEM[(double *)_49], MEM[(double *)_49]
# daxpy.cpp:25: for (int i = 0; i < N; ++i)
addi a5,a5,8 #, ivtmp.133, ivtmp.133
addi s2,s2,8 #, ivtmp.132, ivtmp.132
# daxpy.cpp:27: Y[i] = alpha * X[i] + Y[i];
fmadd.d fa5,fa5,fa3,fa4 # _5, MEM[(double *)_49], tmp181, MEM[(double *)_56]
# daxpy.cpp:27: Y[i] = alpha * X[i] + Y[i];
fsd fa5,-8(a5) # _5, MEM[(double *)_56]
# daxpy.cpp:25: for (int i = 0; i < N; ++i)
bne s1,a5,.L35 #, _14, ivtmp.133,
NOTE: Make sure to keep the simulation output for all of your simulation runs for your later analyses.
As mentioned before, you are allowed to submit your assignments in pairs and in PDF format. You should submit your report on gradescope, In your report answer the questions presented in Analysis and simulation, Analysis and simulation: Step 0, Analysis and simulation: Step I, Analysis and simulation: Step II, and Analysis and simulation: Step III. Use clear reasoning and visualization to drive your conclusions. Submit all your code through your assignment repository. Please make sure to include code/scripts for the following.
Instruction.md
: should include instruction on how to run your simulations.Like your submission, your grade is split into two parts.
You are required to work on this assignment in teams. You are only allowed to share you scripts and code with your teammate(s). You may discuss high level concepts with others in the class but all the work must be completed by your team and your team only.
Remember, DO NOT POST YOUR CODE PUBLICLY ON GITHUB! Any code found on GitHub that is not the base template you are given will be reported to SJA. If you want to sidestep this problem entirely, don’t create a public fork and instead create a private repository to store your work.