You should submit your report in pairs. Make sure to start early and post any questions you might have on Piazza. The standard late assignment policy applies.
Use classroom: assignment 4 to create an assignment. You will be asked to join/create an assignment. If your teammate has already created an assignment, please join their assignment instead of creating one assignment. Otherwise, create your assignment and ask your teammate to join the assignment.
In this assignment, you’ll be investigating the performance impacts of different cache architectures and different algorithm designs on matrix multiplication. The goals of this assignment are:
In this assignment, you are going to implement different techniques for multiplying matrices.
All the ready made workloads discussed require one input argument.
You will have to pass matrix_size
as an input argument to the constructor of a workload (__init__
) which describes the size of matrices A
, B
, and C
.
When running the workload on native hardware (e.g., your host machine), you should pass the same one argument in the command line following the name of your binary.
NOTE: for all the parts of this assignment assume:
$C = A \times B$
You will have to implement different techniques for matrix multiplications to see the effect of software implementation on the overall system performance. Below you can see a short description of your starter code.
In programming, matrices are stored as two dimensional arrays. Otherwise said, a matrix is stored as an array of arrays. There are two ways how a matrix could be stored this way.
Below is a simple implementation of matrix multiplication given in workloads/matmul/ijk_multiply.h
.
For this assignment, assume that all A
, B
, and C
matrices are stored in row major order.
I.e., matrices are indexed like A[row number][column number]
.
The starter code first iterates over the rows of A
(i), then iterates over the columns of B
(j), and then iterates over the elements in the selected row and column (k).
void multiply(double **A, double **B, double **C, int size)
{
for (int i = 0; i < size; i++) {
for (int j = 0; j < size; j++) {
for (int k = 0; k < size; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
The animation below shows how the order of operations during the multiplication operation.
You can import this workload into your configuration script from workloads/matmul_workload.py
as IJKMatMulWorkload
.
I have provided a ready made binary for this implementation.
However, if you need to compile your own binary, you can run the command below in workloads/matmul
.
make mm-ijk-gem5
CAUTION: the above command generates a binary that could only be run with gem5. Running this binaries on your host will result in errors.
As you can see in the animation, the previous implementation of the matrix multiplication does not expose much locality in its accesses to matrices A
and B
.
Rather, it exposes locality when accessing matrix C
.
We can reorder our nested loops and still get the correct answer.
In fact all permutations of our three nested loops will result in the same C
matrix.
Moreover, when it comes to the complexity of the algorithm itself, reordering the for loops does not change the complexity.
However, reordering the for loops, changes the access pattern to the memory.
In turn, it could increase/decrease our cache hit rate.
For this step, you can see a slightly different implementation of the matrix multiplication program below.
You can also find it in workloads/matmul/ikj_multiply.h
void multiply(double **A, double **B, double **C, int size)
{
for (int i = 0; i < size; i++) {
for (int k = 0; k < size; k++) {
for (int j = 0; j < size; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
}
As you can see, two for loops have been interchanged (j and k) from our previous implementation. The animation below shows how the order of operations during the multiplication operation.
You can import this workload into your configuration script from workloads/matmul_workload.py
as IKJMatMulWorkload
.
I have provided a ready made binary for this implementation.
However, if you need to compile your own binary, you can run the command below in workloads/matmul
.
make mm-ikj-gem5
CAUTION: the above command generates a binary that could only be run with gem5. Running this binaries on your host will result in multiple errors.
You can improve the cache behavior of matrix multiplication by using a blocked algorithm. In this algorithm, rather than streaming through all of the inputs, you operate on one block at a time. Look at this short article to learn more about the blocking technique used to increase locality in matrix multiplication.
Similar to loop interchange, there are multiple different ways you can choose to block accesses to your matrices in the matrix multiplication algorithm.
One example is shown below where k
and j
are blocked and i
is streamed.
For this assignment, you need to implement three different variants of the blocking scheme.
workloads/matmul/block_ij_multiply.h
.
After implementation, you can build your binary by running the command below in workload/matmul
.make mm-block-ij-gem5
After building your binary, you can import it as a workload into your configuration script from workloads/matmul_workload.py
as BlockIJMatMulWorkload
.
workloads/matmul/block_ik_multiply.h
.
After implementation, you can build your binary by running the command below in workload/matmul
.make mm-block-ik-gem5
After building your binary, you can import it as a workload into your configuration script from workloads/matmul_workload.py
as BlockIKMatMulWorkload
.
workloads/matmul/block_kj_multiply.h
.
After implementation, you can build your binary by running the command below in workload/matmul
.make mm-block-kj-gem5
After building your binary, you can import it as a workload into your configuration script from workloads/matmul_workload.py
as BlockKJMatMulWorkload
.
You will have to pass matrix_size
(describes the size of matrices A
, B
, and C
) and block_size
(describes the size of the block in your blocking scheme) to the constructor of all of the workloads you have implemented.
If you have to run a workload on native hardware (e.g. your host machine), you should pass the same two arguments in the same order in the command line following the name of your binary.
NOTE: You can use the command below in workloads/matmul
to create the binaries for all the workloads discussed.
make all-gem5
CAUTION: the above command generates binaries that could only be run with gem5. Running these binaries on your host will result in multiple errors.
To test your blocked implementation you may want to run it on your local machine (or on the codespace you’re using). To do this, see Running on native hardware below.
In this assignment, we will investigate the effect of caching on overall performance.
We have already looked at how software implementation can help improve caching in matrix multiplication.
In regards to hardware models, we will use different cache hierarchies to see the effect of cache size and latency on the performance.
Under the components
directory, you will find modules that define the different models that you should use in your configuration scripts.
HW4RISCVBoard
in components/boards.py
.HW4O3CPU
in components/processors.py
.components/cache_hierarchies.py
.
You can find three models for your cache hierarchy.
They all have an L2 cache with the size of 128 KiB
.
They also have the same L1I cache.
However, there have different L1D cache design.
You can find a short description of each L1D design below.
48 KiB
L1D cache with higher latency.32 KiB
L1D cache with medium latency.16 KiB
L1 cache with lower latency.Make sure you understand their similarities and differences.
HW4DDR4
in components/memories.py
.2 GHz
for all of your simulations.In your configuration scripts, make sure to import exit_event_handler
using the command below.
from workloads.roi_manager import exit_event_handler
You will have to pass exit_event_handler
as a keyword argument named on_exit_event
when creating a simulator
object. Use the template below to create a simulator object.
simulator = Simulator(board={name of your board}, full_system=False, on_exit_event=exit_event_handler)
As part of your analysis and simulation you will have to run your workloads on both real hardware and gem5. Before running any workloads, let’s take a look at our working set size. Working set size is the number of bytes that have to be moved between the processor and the memory subsystem.
In your report answer the following questions.
matrix_size
and size of a double double_size
.
Plug in 128 as matrix_size
and 8 as double_size
and calculate the working set size.matrix_size
, block_size
, and double_size
.
Plug in 128 as matrix_size
, 8 as block_size
, and 8 as double_size
.For your simulation, create a configuration script that allows you to run any of the workloads with any of the cache hierarchies. For this step:
HW4SmallCacheHierarchy
BlockIJMatMulWorkload
with all the cache hierarchies.In your report answer the following questions.
HW4SmallCacheHierarchy
, which blocking scheme performs the best? Why?BlockIJMatMulWorkload
, which cache hierarchy performs the best? Why?Use a matrix_size
of 128 and a block_size
of 8 for all your simulations.
Hints: In your answer use cache related statistics such as hit rate
.
Moreover, think of how average memory access time (AMAT) could help you explain your results.
I also recommend thinking about how the access pattern could influence memory access time.
For this step, you will run 6 configurations in total.
In this step, you will use your conclusions in the last step to predict a combination of blocking scheme and cache hierarchy that performs best. In your report, answer the following questions.
Now, run the matrix multiply workloads on some real hardware (not gem5).
You can run it on the CSIF machines, your laptop, or the codespaces computer.
To build all your binaries for your native hardware (e.g., your host), use the command below in workloads/matmul
.
make all-native
You can also build them separately using the commands below.
make mm-block-ij-native
make mm-block-ik-native
make mm-block-kj-native
Before running your workloads on real hardware, answer the following questions in your report.
lscpu
or /proc/cpuinfo
and Google should help)When running on native hardware, I recommend using matrix_size
of at least as big as 256.
After running the workloads on real hardware answer the following question in your report.
As mentioned before, you are allowed to submit your assignments in pairs and in PDF format. You should submit your report on gradescope. In your report answer the questions presented in , Analysis and simulation: Step I, Analysis and simulation: Step II, Analysis and simulation: Step III, and Analysis and simulation: Step IV.
Use clear reasoning and visualization to drive your conclusions. Submit all your code through your assignment repository. Please make sure to include code/scripts for the following.
Instruction.md
: should include instruction on how to run your simulations.workloads/matmul
.Like your submission, your grade is split into two parts.
Reproducibility Package (50 points): a. Instruction and automation to run simulations for different section and dump statistics (20 points) - Instructions (5 points) - Automation (5 points) b. Configuration scripts and source codes (40 points): 5 points for configuration script(s) used to run your simulations as described in Analysis and simulation: Step II, 10 points for implementing each of the 3 blocking schemes as described in Blocked matrix multiplication, and 5 points for a scripts used to run workloads on native hardware as described in Analysis and simulation: Step IV.
Report (50 points): 5 points for each question presented in Analysis and simulation: Step I, Analysis and simulation: Step II, Analysis and simulation: Step III, and Analysis and simulation: Step IV.
You are required to work on this assignment in teams. You are only allowed to share you scripts and code with your teammate(s). You may discuss high level concepts with others in the class but all the work must be completed by your team and your team only.
Remember, DO NOT POST YOUR CODE PUBLICLY ON GITHUB! Any code found on GitHub that is not the base template you are given will be reported to SJA. If you want to sidestep this problem entirely, don’t create a public fork and instead create a private repository to store your work.