Due on 2/16 1:59 pm (PST): See Submission for details
You should submit your report in pairs. Make sure to start early and post any questions you might have on Piazza. The standard late assignment policy applies.
Use classroom: assignment 3 to create an assignment. You will be asked to join/create an assignment. If your teammate has already created an assignment, please join their assignment instead of creating one assignment. Otherwise, create your assignment and ask your teammate to join the assignment.
In this assignment, you’ll be investigating the performance impacts of different out-of-order core designs on a set of RISC-V benchmarks. The goals of this assignment are:
In this assignment, you are going to use four workloads for the evaluation of the performance of your systems. Read more about each workload and how to use those workloads for your experiments below.
You are going to use the same matrix multiplication program from assignment 1 as the first workload.
Please note that matrix size is hardcoded for this assignment.
You do not have to pass matrix size as an input argument to the __init__
function for your workload.
Below you can find the C++ implementation of the matrix multiplication workload.
#include <iostream>
#include "matrix.h"
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
void multiply(double* A, double* B, double* C, int size)
{
for (int i = 0; i < size; i++) {
for (int k = 0; k < size; k++) {
for (int j = 0; j < size; j++) {
C[i * size + j] += A[i * size + k] * B[k * size + j];
}
}
}
}
int main()
{
std::cout << "Beginning matrix multiply ..." << std::endl;
#ifdef GEM5
m5_work_begin(0,0);
#endif
multiply(A, B, C, SIZE);
#ifdef GEM5
m5_work_end(0,0);
#endif
std::cout << "Finished matrix multiply." << std::endl;
return 0;
}
Take a look at workloads/matmul_workload.py
to learn more about instantiating a matrix multiplication workload.
BFS is an algorithm for traversing a graph in breadth order.
The algorithm traverses vertices in a graph in an order corresponding to each vertex’s relative depth to a specific vertex.
I.e. the algorithm starts traversing vertices in the graph starting from a specific vertex called root
.
During the traversal, the algorithm makes sure to visit vertices that are only 1 edge/hop
away from root
before it visits those vertices that are 2 edges/hops or more
away.
The BFS algorithm is one of the building blocks of graph analytics workloads.
Graph analytics workloads have common use cases such as modeling relationships and processes in physical, biological, social, and information systems.
Beamer et al. discuss the BFS algorithm and how it could be parallelized in their paper.
Below you can find a basic C++ implementation of the algorithm discussed in the paper.
#include <iostream>
#include <vector>
#include "graph.h"
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
int main()
{
std::vector<int> frontier;
std::vector<int> next;
frontier.clear();
next.clear();
frontier.push_back(0);
std::cout << "Beginning BFS ..." << std::endl;
#ifdef GEM5
m5_work_begin(0,0);
#endif
while (!frontier.empty()) {
for (auto vertex: frontier) {
int start = columns[vertex];
int end = columns[vertex + 1];
for (int i = start; i < end; i++) {
int neighbor = edges[i];
if (visited[neighbor] == 0) {
visited[neighbor] = 1;
next.push_back(neighbor);
}
}
}
frontier = next;
next.clear();
}
#ifdef GEM5
m5_work_end(0,0);
#endif
std::cout << "Finished BFS." << std::endl;
return 0;
}
Take a look at workloads/bfs_workload.py
to learn more about instantiating a BFS workload.
Bubble sort is the starter algorithm for sorting arrays. This program sorts an array by replacing each element with the smallest element to its right (if sorted in an ascending order). Below you can find the C++ implementation for bubble sort.
#include <iostream>
#include "array.h"
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
int main()
{
std::cout << "Beginning bubble sort ... " << std::endl;
#ifdef GEM5
m5_work_begin(0,0);
#endif
for (int i = 0; i < ARRAY_SIZE - 1; i++) {
for (int j = i + 1; j < ARRAY_SIZE; j++) {
if (data[i] > data[j]) {
int temp = data[i];
data[i] = data[j];
data[j] = temp;
}
}
}
#ifdef GEM5
m5_work_end(0,0);
#endif
std::cout << "Finished bubble sort." << std::endl;
return 0;
}
The N-Queens problem is a classic computer science problem. It involves placing N chess queens on an N×N chessboard so that no two queens threaten each other. This means that no two queens can share the same row, column, or diagonal. The most common method to solve the N-Queens problem is to use backtracking. Below you can find the C++ implementation for the same:
#include <bits/stdc++.h>
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
void printSolution(int **board, int N) {
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++)
printf(" %d ", board[i][j]);
printf("\n");
}
}
bool isSafe(int **board, int row, int col, int N) {
for (int i = 0; i < col; i++)
if (board[row][i])
return false;
for (int i=row, j=col; i>=0 && j>=0; i--, j--)
if (board[i][j])
return false;
for (int i=row, j=col; j>=0 && i<N; i++, j--)
if (board[i][j])
return false;
return true;
}
bool solveNQUtil(int **board, int col, int N) {
if (col >= N)
return true;
for (int i = 0; i < N; i++) {
if (isSafe(board, i, col, N)) {
board[i][col] = 1;
if (solveNQUtil(board, col + 1, N))
return true;
board[i][col] = 0;
}
}
return false;
}
bool solveNQ(int **board, int N) {
if (solveNQUtil(board, 0, N) == false) {
printf("Solution does not exist");
return false;
}
printSolution(board, N);
return true;
}
int main(int argc, char *argv[]) {
// the user needs to input the size of the chessboard
if (argc == 2) {
int size = atoi(argv[1]);
// initialize the board
int **board = new int*[size];
for (int i = 0 ; i < size; i++)
board[i] = new int[size];
for (int i = 0 ; i < size ; i++) {
for (int j = 0 ; j < size ; j++) {
board[i][j] = 0;
}
}
#ifdef GEM5
m5_work_begin(0,0);
#endif
solveNQ(board, size);
#ifdef GEM5
m5_work_end(0,0);
#endif
}
else {
printf("N-Queens program. Usage \n $ ./queens <chess-board-size>\n");
}
return 0;
}
In this assignment, you are asked to design your own out-of-order processor
models for your experiments.
In regards to the rest of the components in your system:
HW3RISCVBoard
as your main board
for your computer system.
You can find the model for the board in components/boards.py
.HW3MESICache
as your cache hierarchy for your computer system.
You can find its model in components/cache_hierarchies.py
.HW3DDR4
as your memory
in your computer system.
You can find its model in components/memories.py
.2 GHz
as you clock frequency clk_freq
in your system.For your processor, you are going to use HW3O3CPU
to model a high-performance and efficient processor core.
HW3O3CPU
is based on O3CPU
which is an internal model of gem5.
Read up on the O3CPU
in gem5’s documentation.
Below, you can find details about HWO3CPU
and its parameters.
For the purposes of this assignment, you need to configure a processor with the same width in all stages.
In the constructor of HW3O3CPU
this attribute is named as width
.
This is the number of entries in the reorder buffer (ROB).
This will constrain the number of instructions that can be “in flight” in the processor.
In the constructor of HW3O3CPU
this attribute is named as rob_size
.
This is the number of registers in the physical register file. A processor renames architecture registers to physical registers to resolve false dependencies. It also tracks true dependencies (read-after-write) in the register file. To learn more about register renaming, read up on Tomasulo’s algorithm.
HW3O3CPU
has two physical register files.
One register file for integer registers and one for floating point registers.
In the constructor of HW3O3CPU
, num_int_regs
refers to the number of integer physical registers and num_fp_regs
refers to the number of floating point physical registers.
NOTE: Both register files must be larger than the 32 entries or gem5 will hang. This is because the number of physical registers must be bigger than or equal to the number of logical registers. RISC-V ISA defines 32 logical registers.
In this assignment, you are required to design your own high-performance and efficient cores.
You need to add two core designs to components/processors.py
.
In components/processors.py
, create a model based on HW3O3CPU
and name it HW3BigCore
.
This core will be your high-performance core for the assignment.
Create another model in components/processors.py
and name it HW3LittleCore
.
This core will be your efficient core for the assignment.
You can use the information available on the internet on the different core micro-architectures to configure your HW3BigCore
and HW3LittleCore
.
As a starting point, take a look at WikiChip and AnandTech.
Your instructor has loosely modeled their HW3BigCore
on Intel Sunny Cove and their HW3LittleCore
on Intel Gracemont.
NOTE: You are not required to match the specifications of any core.
Use information online as a guideline for your design.
You might find it impossible to match the specifications found online.
E.g. the base model HW3O3CPU
assumes the same width for all the stages of the pipeline which is not usually the case with modern processor designs.
Tips and TODOs: When designing your cores and caches, I recommend taking note of the following:
HW3LittleCore
.
Remember that in computer design, there are almost always diminishing returns.
A beefy HW3LittleCore
will result in a HW3BigCore
that is not much more performant than HW3LittleCore
.
I recommend dividing the values you see on specifications by half for HW3LittleCore
.width
below 4 would result in some instability.
However, you might want to set width
to 3 or less for HW3LittleCore
.
I recommend starting with very small numbers for the rest of the parameters, especially rob_size
, and gradually increasing them to get around this issue.Before running any simulations answer the following questions in your report.
What will be the average speed-up of HW3BigCore
over HW3LittleCore
? Can you predict an upper bound using your pipeline parameters?
Hint: You can predict an upper bound for the speed-up using Amdahl’s law with optimistic values.
Do you think all the workloads will experience the same speed-up between HW3BigCore
and HW3LittleCore
?
Now that you have completed the design process of HW3BigCore
and HW3LittleCore
, let’s compare their performances using our four workloads as benchmarks.
Simulate each workload with each core.
For each workload compare the performance of HW3BigCore
with the performance of HW3LittleCore
.
Answer the following questions in your report. Use relevant and correct reasoning in your answer. Use simulation data with proper simulation data to strengthen your reasoning.
HW3BigCore
over HW3LittleCore
for each workload?HW3BigCore
compared to HW3LittleCore
over all the workloads?
CAUTION: Make sure to use the correct mean to report average IPC improvement..c
and .s
files may be useful) and speculate the algorithm characteristics that influence the IPC difference between HW3BigCore
and HW3LittleCore
. What characteristics do applications have that lead to low performance improvement and what characteristics lead to high performance improvement?HW3BigCore
? What is unique about this workload?Hints: Take a look at the assembly code of the ROI for inspiration.
In this step, you are tasked with finding a middle ground between HW3BigCore
and HW3LittleCore
.
This core needs to perform as closely as possible to HW3BigCore
while using as little resources as HW3LittleCore
.
We will refer to this core as HW3MediumCore
.
To pick our sweet spot for the design of HW3MediumCore
, we need to develop a methodology.
First, we need to define a cost function for increasing hardware resources.
We will use the area as the cost of fabricating new hardware.
Not all parameters of the pipeline have the same effect on the area.
E.g. Increasing the width of the pipeline has a quadratic effect on the area of the hardware while increasing register file entries has a linear effect.
Moreover, the cost of increasing two of these resources at the same time should be bigger than the sum of increasing each resource.
We will use an equation like below to score the area of a pipeline using the 4 parameters of width
, rob_size
, num_int_regs
, and num_fp_regs
.
You can also get the area score for a pipeline design by calling the method get_area_score
on the processor.
Now that we have our cost function, let’s devise a method for measuring our gains. In your report answer the following question.
Now that we have devised functions to measure costs and gains, configure 4 middle ground designs for the pipeline. Not all of these designs will have the “best” cost-gain tradeoff. In your report include the following figure.
Answer this question in your report.
Based on your experiments and insights, can you answer the following question?
As mentioned before, you are allowed to submit your assignments in pairs and in PDF format. You should submit your report on gradescope. In your report answer the questions presented in Analysis and simulation, Analysis and simulation: Step I, Analysis and simulation: Step II, and, Analysis and simulation: Step III. Use clear reasoning and visualization to drive your conclusions. Submit all your code through your assignment repository. Please make sure to include code/scripts for the following.
Instruction.md
: should include instructions on how to run your simulations.components/processors.py
.
There should be 6 core definitions in components/processor.py
.
They should include one design for HW3BigCore
, one design for HW3LittleCore
, and four designs for HW3MediumCore
.
You can add numbers to designs for HW3MediumCore
to distinguish their design.
E.g. HW3MediumCore0
, HW3MediumCore1
, HW3MediumCore2
, and HW3MediumCore3
.Like your submission, your grade is split into two parts.
Reproducibility Package (50 points): a. Instruction and automation to run simulations for different sections and dump statistics (20 points) - Instructions (5 points) - Automation (5 points) b. Configuration scripts and correct simulation setup (40 points): 10 points for configuration script(s) used to run your simulations and 5 points for implementing each of the 6 processor models as described in Analysis and simulation: Step I, and Analysis and simulation: Step II.
Report (50 points): 5 points for each question presented in Analysis and simulation, Analysis and simulation: Step I, Analysis and simulation: Step II, and, Analysis and simulation: Step III.
You are required to work on this assignment in teams. You are only allowed to share your scripts and code with your teammate(s). You may discuss high-level concepts with others in the class but all the work must be completed by your team and your team only.
Remember, DO NOT POST YOUR CODE PUBLICLY ON GITHUB! Any code found on GitHub that is not the base template you are given will be reported to SJA. If you want to sidestep this problem entirely, don’t create a public fork and instead create a private repository to store your work.