Due on 2/3 11:59 pm (PST): See Submission for details.
GitHub Classroom link for 154B: https://classroom.github.com/a/TOfrsoGz
GitHub Classroom link for 201A: https://classroom.github.com/a/f4SMUSm8
In this assignment, you’ll be investigating the performance impacts of different out-of-order core designs on a set of RISC-V benchmarks. The goals of this assignment are:
What is the bottleneck in the performance of an out-of-order processor?
You will be given a design of an out-of-order processor. Your task is to investigate the performance of this processor on a set of benchmarks and determine what the main bottleneck is. Then, you will design a new processor that improves the performance of the original processor with the minimal impact on area.
The questions you will be answering in this assignment are:
In this assignment, you are going to use four workloads for the evaluation of the performance of your systems. Read more about each workload and how to use those workloads for your experiments below.
You will be using the following workloads for your experiments detailed below.
matrix_multiply_riscv_run
)bfs_riscv_run
)bubble_riscv_run
)queens_riscv_run
)Note that all of these workloads have been compiled with the gem5 annotations enabled.
Together, these workloads are called the “comparch-benchmarks”.
This is a suite of workloads to use for your evaluation.
When using obtain_resource
you can also get a suite which is a list of workloads.
You can iterate over the suite to run all the workloads in the suite using the multisim feature of gem5.
These workloads run in approximately 1-10 minutes each.
Note: Multisim in gem5 24.1 has a couple of bugs. However, it should work if you use 1 or 2 simultaneous simulations.
You are going to use the same matrix multiplication program from assignment 1 as the first workload.
Please note that matrix size is hardcoded for this assignment.
You do not have to pass matrix size as an input argument to the __init__
function for your workload.
Below you can find the C++ implementation of the matrix multiplication workload.
#include <iostream>
#include "matrix.h"
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
void multiply(double* A, double* B, double* C, int size)
{
for (int i = 0; i < size; i++) {
for (int k = 0; k < size; k++) {
for (int j = 0; j < size; j++) {
C[i * size + j] += A[i * size + k] * B[k * size + j];
}
}
}
}
int main()
{
std::cout << "Beginning matrix multiply ..." << std::endl;
#ifdef GEM5
m5_work_begin(0,0);
#endif
multiply(A, B, C, SIZE);
#ifdef GEM5
m5_work_end(0,0);
#endif
std::cout << "Finished matrix multiply." << std::endl;
return 0;
}
BFS is an algorithm for traversing a graph in breadth order.
The algorithm traverses vertices in a graph in an order corresponding to each vertex’s relative depth to a specific vertex.
I.e. the algorithm starts traversing vertices in the graph starting from a specific vertex called root
.
During the traversal, the algorithm makes sure to visit vertices that are only 1 edge/hop
away from root
before it visits those vertices that are 2 edges/hops or more
away.
The BFS algorithm is one of the building blocks of graph analytics workloads.
Graph analytics workloads have common use cases such as modeling relationships and processes in physical, biological, social, and information systems.
Beamer et al. discuss the BFS algorithm and how it could be parallelized in their paper.
Below you can find a basic C++ implementation of the algorithm discussed in the paper.
#include <iostream>
#include <vector>
#include "graph.h"
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
int main()
{
std::vector<int> frontier;
std::vector<int> next;
frontier.clear();
next.clear();
frontier.push_back(0);
std::cout << "Beginning BFS ..." << std::endl;
#ifdef GEM5
m5_work_begin(0,0);
#endif
while (!frontier.empty()) {
for (auto vertex: frontier) {
int start = columns[vertex];
int end = columns[vertex + 1];
for (int i = start; i < end; i++){
int neighbor = edges[i];
if (visited[neighbor] == 0) {
visited[neighbor] = 1;
next.push_back(neighbor);
}
}
}
frontier = next;
next.clear();
}
#ifdef GEM5
m5_work_end(0,0);
#endif
std::cout << "Finished BFS." << std::endl;
return 0;
}
Take a look at workloads/bfs_workload.py
to learn more about instantiating a BFS workload.
Bubble sort is the starter algorithm for sorting arrays. This program sorts an array by replacing each element with the smallest element to its right (if sorting ascendingly). Below you can find the C++ implementation for bubble sort.
#include <iostream>
#include "array.h"
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
int main()
{
std::cout << "Beginning bubble sort ... " << std::endl;
#ifdef GEM5
m5_work_begin(0,0);
#endif
for (int i = 0; i < ARRAY_SIZE - 1; i++) {
for (int j = i + 1; j < ARRAY_SIZE; j++) {
if (data[i] > data[j]) {
int temp = data[i];
data[i] = data[j];
data[j] = temp;
}
}
}
#ifdef GEM5
m5_work_end(0,0);
#endif
std::cout << "Finished bubble sort." << std::endl;
return 0;
}
The N-Queens problem is a classic computer science problem. It involves placing N chess queens on an N×N chessboard so that no two queens threaten each other. This means that no two queens can share the same row, column, or diagonal. The most common method to solve the N-Queens problem is to use backtracking. Below you can find the C++ implementation for the same:
#include <bits/stdc++.h>
#ifdef GEM5
#include "gem5/m5ops.h"
#endif
void printSolution(int **board, int N) {
for (int i = 0; i < N; i++) {
for (int j = 0; j < N; j++)
printf(" %d ", board[i][j]);
printf("\n");
}
}
bool isSafe(int **board, int row, int col, int N) {
for (int i = 0; i < col; i++)
if (board[row][i])
return false;
for (int i=row, j=col; i>=0 && j>=0; i--, j--)
if (board[i][j])
return false;
for (int i=row, j=col; j>=0 && i<N; i++, j--)
if (board[i][j])
return false;
return true;
}
bool solveNQUtil(int **board, int col, int N) {
if (col >= N)
return true;
for (int i = 0; i < N; i++) {
if (isSafe(board, i, col, N)) {
board[i][col] = 1;
if (solveNQUtil(board, col + 1, N))
return true;
board[i][col] = 0;
}
}
return false;
}
bool solveNQ(int **board, int N) {
if (solveNQUtil(board, 0, N) == false) {
printf("Solution does not exist");
return false;
}
return true;
}
int main(int argc, char *argv[]) {
// the user needs to input the size of the chessboard
if (argc == 2) {
int size = atoi(argv[1]);
// initialize the board
int **board = new int*[size];
for (int i = 0 ; i < size; i++)
board[i] = new int[size];
for (int i = 0 ; i < size ; i++) {
for (int j = 0 ; j < size ; j++) {
board[i][j] = 0;
}
}
#ifdef GEM5
m5_work_begin(0,0);
#endif
solveNQ(board, size);
#ifdef GEM5
m5_work_end(0,0);
#endif
printSolution(board, size);
}
else {
printf("N-Queens program. Usage \n $ ./queens <chess-board-size>\n");
}
return 0;
}
We use an input of 18 for the chessboard size in the workload.
In this assignment, you will use the following base configurations for your computer system.
RISCVBoard
as your main board
for your computer system.
You can find the model for the board in components/boards.py
.MESITwoLevelCache
as your cache hierarchy for your computer system.
You can find its model in components/cache_hierarchies.py
.DDR4
as your memory
in your computer system.
You can find its model in components/memories.py
.1 GHz
as you clock frequency clk_freq
in your system.For your processor, you are going to use OutOfOrderCPU
to model a high-performance and an efficient processor core.
OutOfOrderCPU
is based on O3CPU
which is an internal model of gem5.
Read up on the O3CPU
in gem5’s documentation.
Below, you can find details about OutOfOrderCPU
and its parameters.
For the purposes of this assignment, you need to configure a processor with the same width in all stages.
In the constructor of OutOfOrderCPU
this attribute is named as width
.
The width is one of the main constraints for how many instructions can be processed in parallel.
This is the number of entries in the reorder buffer (ROB).
This will constrain the number of instructions which can be “in flight” in the processor.
In the constructor of OutOfOrderCPU
this attribute is named as rob_size
.
The number of in flight instructions is also a main contraint for how many instructions can be processed in parallel.
In this case, if instructions take more than 1 cycle to execute, you need to have enough space in the ROB to keep track of the instructions.
If you don’t have enough space in the ROB, the processor will stall.
This is the number of registers in the physical register file. A processor renames architecture registers to physical registers to resolve false dependences. It also tracks true dependences (read-after-write) in the register file. To learn more about register renaming, read up on Tomasulo’s algorithm.
OutOfOrderCPU
has two physical register files.
One register file for integer registers and one for floating point registers.
In the constructor of OutOfOrderCPU
, num_int_regs
refers to the number of integer physical registers and num_fp_regs
refers to the number of floating point physical registers.
NOTE: Both register files must be larger than the 32 entries or gem5 will hang. This is because the number of physical registers must be bigger than or equal to the number of logical registers. RISC-V ISA defines 32 logical registers.
The number of physical registers can also constrain the amount of parallelism in the processor. If there are not enough physical registers, the processor will stall.
So, you need to balance the width, the number of physical registers, and the ROB size to get the best performance per unit area or per unit power. The processor is out of balance, then you will be wasting resources and not getting the best performance.
You are already given a run script which runs the “comparch-benchmarks”.
We have also included the output for two different configurations of the OutOfOrderCPU
in the results
directory.
The first configuration we call the “little” core, and your job is to find the main bottlenecks in this design. The second configuration we call the “big” core. The “big” core has a large width, a large ROB size, and a large number of physical registers, leading to a very large area. You can see below that the speedup of the “big” core over the “little” core is not as large as you might expect. The details of the configurations and the performance results are shown below.
Core Type | Width | ROB Size | Number of Integer Registers | Number of Floating Point Registers |
---|---|---|---|---|
little | 4 | 32 | 64 | 64 |
big | 12 | 284 | 512 | 512 |
Note: There’s a bug in gem5’s out-of-order model such that when running with small widths sometimes there’s an assertion failure. Therefore, we have set the width to 4 for the “little” core.
Workload | Little Core IPC | Big Core IPC | Speedup |
---|---|---|---|
BFS | 0.525410 | 0.809980 | 1.54 |
Bubble Sort | 0.879611 | 0.895811 | 1.02 |
Matrix Multiplication | 1.608251 | 1.685272 | 1.05 |
N-Queens | 1.000015 | 1.122047 | 1.12 |
Now that you have to run multiple workloads for multiple configurations, you need to automate the process.
You can use the multisim
feature of gem5 to run multiple simulations in parallel.
Note:
multisim
has some bugs, so if you experience hangs, you can run workloads one-by-one. However, we still suggest usingmultisim
even if you manually run experiments because it helps keep things organized.
We have provided a script experiments.py
that runs the big and little designs.
You can extend this script to run your experiments.
In experiments.py
, we first define a function to get the board.
Then, we have a set of configurations that we want to run.
You can uncomment the next lines to print the area of each configuration (used to answer some of the questions).
Finally, there is a nested for loop that will run each benchmark for each configuration.
In this for loop, instead of calling simulator.run()
we add the simulator to the multisim object which will run all the simulations in parallel when we run the script with the multisim module (as shown below).
from components import (
RISCVBoard,
MESITwoLevelCache,
DDR4,
OutOfOrderCPU,
)
from gem5.simulate.simulator import Simulator
from gem5.utils.multisim import multisim
from gem5.resources.resource import obtain_resource
multisim.set_num_processes(10)
def get_board(width, rob_size, num_int_regs, num_fp_regs):
cache = MESITwoLevelCache()
memory = DDR4()
cpu = OutOfOrderCPU(width, rob_size, num_int_regs, num_fp_regs)
board = RISCVBoard(
clk_freq="1GHz", processor=cpu, cache_hierarchy=cache, memory=memory
)
return board
configurations = {
"little": {"width": 4, "rob_size": 32, "num_int_regs": 64, "num_fp_regs": 64},
"big": {"width": 12, "rob_size": 384, "num_int_regs": 512, "num_fp_regs": 512},
}
# for name, config in configurations.items():
# board = get_board(**config)
# print(f"Area of {name} configuration: {board.get_processor().get_area_score()}")
# exit(0)
for workload in obtain_resource("comparch-benchmarks"):
for name in ["little", "big"]:
config = configurations[name]
board = get_board(**config)
board.set_workload(workload)
simulator = Simulator(board=board, id=f"{name}-{workload.get_id()}")
multisim.add_simulator(simulator)
To run the script with the multisim module, you can use the following command.
gem5 -re -m gem5.utils.multisim experiments.py
When you run this, you will find the output in the m5out
directory with one directory for each simulation.
The names come from the id
parameter of the Simulator
object.
We also use the -re
flag to redirect the stdout and stderr of each simulation to a file in the m5out
directory.
You can list all of the names of the simulations with the following command.
gem5 experiments.py --list
Then, you can run a specific simulation with the following command.
gem5 experiments.py <id>
It is particularly useful to run a single simulation if you make changes to the parameters and need to re run just a small subset of the simulations.
Remember, the research question you are trying to answer is: What is the bottleneck in the performance of an out-of-order processor?
You run three experiments to answer this question to determine if the bottleneck is the width, the number of physical registers, or the reorder buffer size. You will need to run the following experiments:
Run your experiments using the experiments.py
script.
After running the experiments, compile the results in a table.
Include the speedup of each configuration over the “little” core.
board.get_processor().get_area_score()
method, calculate the area of each configuration. What is the area-efficient design?The workloads show different behaviors with some more sensitive to the width of the processor and others more sensitive to the number of physical registers or the ROB size.
You will submit this assignment via GitHub Classroom.
questions.md
file.Make sure you include both your runscript, an explanation of how to use your script, and the questions to the questions in the questions.md
file.
Include a detailed explanation of how to use your script and how you use your script to generate your answers. The code included in the “Example command to run the script” section should be able to be copied and pasted into a terminal and run without modification.
questions.md
.questions.md
.questions.md
.questions.md
.
You are required to work on this assignment individually. You may discuss high level concepts with others in the class but all the work must be completed by yourself.
Remember, DO NOT POST YOUR CODE PUBLICLY ON GITHUB! Any code found on GitHub that is not the base template you are given will be reported to SJA. If you want to sidestep this problem entirely, don’t create a public fork and instead create a private repository to store your work.