DINO CPU Assignment 4: The Why of Caching.
Originally from ECS 154B Lab 4, Winter 2023.
Caching is among the most influential ideas in computing. In fact, the concept reaches many fields of computer science. Examples can be found in from almost all hardware architectures, such as the idea of translation lookup buffers (TLBs) or the use of memory cache system, to various software designs, such as DNS caching or web caching.
In essence, the idea of caching is to store previously acquired data in a cache that is close to the component uses the data. The idea utilizes the assumption that the cost of making a copy of the data is low, as well as the the cost of querying for the data in a cache is less expensive compared to acquiring the data again. As a result, if the data is frequently requested, the cache could help bringing down the average cost of requesting for that data. In other words, we are trading the cost of acquiring the data with fast storage capacity.
However, caching does not inherently improve the performance of a system. Caching adds an extra cost of querying for the existence of data in the cache, and since the capacity of a cache is limited, the cache system needs extra work to maintain the entries in the cache. Due to physical constraints, such as area, power, and latency, the cache closer to the core tends to have drastically smaller capacity than the ones further away from the core. Thus, designing a performant memory cache in a core imposes a huge challenge to the designers.
In this assignment, we will investigate the performance of a computer system with a pipelined CPU core, a memory system, and various cache design decisions. The assignment is designed as follows: we will introduce each of the components of the computer system that we are going to investigate, then we will introduce the benchmarks that will be subsequently used for performance evaluation and performance analysis. In terms of implementation, most of the system is implemented. However, we will ask you to complete the hazard detection unit for the new pipelined CPU.
The GitHub Classroom page for the class is located at https://github.com/ECS154B-WQ24.
The assignment 4 template repo is located at https://github.com/ECS154B-WQ24/dinocpu-assignment4.
Follow the following link to access assignment 4: https://classroom.github.com/a/j83-CMv1.
The above link will automatically create a repo in the GitHub Classroom page that only you have the access to.
In the event that the template repo is updated, your own repo won’t be automatically updated. You don’t need to keep track of the template repo, unless we found an error in the assignment, in which case, we will make an announcement on Piazza and provide ways to update your repo.
Figure 1. Illustration of the systems that we use for evaluating different memory cache structures. On each system, the left arrows are wires responsible for sending instruction memory requests/responses, while the right arrows are wires responsible for sending data memory requests/responses.
We will use the same pipeline that we built in the assignment 3. However, the memory interface is slightly tweaked to support memory devices of which the latency of a memory request is unknown to the core [1].
Note that both instructions and data come from memory devices. As a result, extra care should be taken to ensure correct instructions enter the pipeline, as well as the correct data are received. For this assignment, you will update the hazard unit to make sure that the the core receives correct instructions/data from the cache system.
We will use one-level cache system for this assignment. We call this a split L1 Cache.
The split L1 Cache consists of two components: an L1 Instruction Cache (L1I) and L1 Data Cache (L1D). The Instruction Cache is optimized for read-only accesses. The Data Cache supports both read and write operations. Each of the L1I and L1D is 4-way associative cache having 32 entries. The cache block size is 8 bytes. Each cache uses the LRU replacement policy, and the write-back write policy.
Each cache component consists of one CacheTable
and one CacheMemory
.
A CacheMemory
is an array of registers containing all cache blocks.
Note that CacheMemory
only has the data, it doesn’t have metadata.
A CacheTable
is a table of CacheEntry
s, where each CacheEntry
contains
the metadata uniquely identifies a cache block, as well as metadata indicating
the status of the cache block, and a pointer to the cache block data in the
CacheMemory
.
A CacheEntry is structured as followed,
class CacheEntry(val numEntryIndexingBits:Int, val numTagBits: Int) extends Bundle {
val valid = Bool() // Whether this entry contains valid data.
val dirty = Bool() // Whether this entry is written to by the core. Default to False, set to True upon a write request.
val tag = UInt(numTagBits.W) // Contains the tag bits.
val memIdx = UInt(numEntryIndexingBits.W) // Where is it in the cache memory.
val age = UInt(32.W) // When was the most recent access to this cache block.
}
For example,
idx 27 -> CacheEntry(valid -> 1, dirty -> 1, tag -> 211, memIdx -> 27, age -> 17614)
means,
valid -> 1
: The entry contains valid data.dirty -> 1
: The entry was written to.tag -> 211
: The tag bits are 211 in decimal, or 0xd3 in hexadecimal.memIdx -> 27
: The 28th entry of CacheMemory has data of this cache block.age -> 17614
: The most recent access to this cache block was in the cycle of 17614.The RAM Device is the lowest level memory device in the DINOCPU system. All accesses to the RAM device incurs a 30 CPU cycles latency.
For this assignment, we evaluate the effectiveness of the cache system using
a benchmark called stream
, which is inspired by
the popular STREAM benchmark.
The workloads are named as followed,
stream-<n>-stride-<stride>[-noverify].riscv
where,
n
: number of memory accesses per array per iteration.stride
: the distance between two consecutive elements.-noverify
: if the workload verifies the result array.For this assignment, we will use stream-64-stride-1.riscv
and
stream-64-stride-4.riscv
for verifying the correctness of the
pipelined implentation. The stream-64-stride-1-noverify.riscv
and stream-64-stride-4-noverify.riscv
are used for performance
evaluation.
Figure 2. The main part of the stream benchmarks consists of calling the copy()
function twice. When running each of the benchmarks, the CPU spends most time
executing the instruction inside the copy()
function.
In this part, you will complete the hazard detection unit for the non combinational
pipelined CPU in components/hazardnoncombin.scala
.
The descriptions on top of the file that file should give explanation for the meaning
of each signal.
The following subsections describe the changes in the cores as well as in the memory.
Note that you don’t have to change any component other than the hazard unit for the
non combinational core.
Figure 3. The new memory interface, which includes a bundle of valid/ready/good signals
that are used for synchronization between the core and the memory subsystem. Both imem
and
dmem
use the new memory interface. Other than the new signals, the input signals and output
signals of the memory are the same. E.g., for the instruction memory interface, the memory
request signal consists of the address
signal, and the memory response signal consists of
instruction
signal.
good
signal is 1. This means, if the good
signal is 0, the response might contain garbage.good
signal is only set to one for 1 cycle. This means, the response is only guaranteed to
be correct in that cycle.The pipelined CPU that is compatible with non combinational memory components is located
at pipelined/cpu-noncombin.scala
file.
You don’t have to modify the cpu-noncombin.scala
file.
The DINO CPU have a new memory interface for both imem
and dmem
for this
assignment.
The code for the Non Combination Pipelined CPU is mostly the same as the Pipelined CPU.
However, we are going to use the HazardUnitNonCombin
rather than HazardUnit
.
There are tools implemented for debugging the hazard unit,
runMain dinocpu.singlestep <test_name> 1 pipelined-non-combin
You can uncomment this line
here
of cpu-noncombin.scala
to see the PCs of the instructions in the pipeline for each cycle.
You can uncomment this code block
here
of cpu-noncombin.scala
to see the sequence of PCs of the instructions that are in the writeback stage.
For debugging the hazard unit when running large applications, it is useful to compare the committed instruction traces of a benchmark when running the single-cycle cpu and the pipelined-non-combin cpu.
An instruction is “committed” means that instruction is complete its execution. For the single-cycle cpu, since it executes all instructions in 1 cycle, every instruction the cpu executes are committed. In a 5-stage pipelined cpu, it means the instruction goes through all 5 stages. Since if an instruction ever reaches the WB stage, it will complete all of its 5 stages. Therefore, you can track all instructions reached the WB stage and see the sequences of instructions being committed. Since each binary has one program order regardless of the microarchitectures, if the 5-stage cpu is correctly implemented, its commit trace must be exactly the same as the single-cycle cpu commit trace for every binary.
By comparing the traces, you can figure out the problematic cycle, and use the single stepper to step to that cycle and see what happened in that cycle.
Lab4 / testOnly dinocpu.SmallTestsPipelinedTesterLab4
Lab4 / testOnly dinocpu.FullApplicationsPipelinedTesterLab4
To run a test,
runMain dinocpu.simulate_predefined_system <system_name> <riscv_binary>
where,
<system_name>
is one of default
, single-cycle
, pipelined
, system1
, system2
, system3
, system4
, system5
, system6
, system7
, and system8
.
The default
system should be used for testing purposes.
The default
system is structurally the same as system1
except that the RAM latency
is 1 cycle instead of 30 cycles in other systems.
The single-cycle
and pipelined
are both without caches and can access the RAM with latency of 0 cycle.
Notes:
SmallTestsPipelinedTesterLab4
tests: It is expected that, if you are using a system
with data caches, some of the store tests, ones start with s
prefix, will fail
because there are data in the cache that are not written back to the RAM yet!
Those store tests only check the data in RAM rather than in cache, hence the test
failures.FullApplicationsPipelinedTesterLab4
tests: This test suite includes
stream-64-stride-1.riscv
and stream-64-stride-4.riscv
, which are used to ensure
the benchmarks we used in Part II and Part III run correctly.For this assignment, we will consider 4 pipelined systems and 4 single-cycle systems,
System 4: Non Combinational Pipelined CPU + Instruction Cache + Data Cache
To run a benchmark on a system,
runMain dinocpu.simulate_predefined_system <system_name> <riscv_binary>
where,
<system_name>
is one of default
, single-cycle
, pipelined
, system1
, system2
, system3
, system4
, system5
, system6
, system7
, and system8
.
The default
system should be used for testing purposes.
The default
system is structurally the same as system1
except that the RAM latency
is 1 cycle instead of 30 cycles in other systems.
The single-cycle
and pipelined
are both without caches and can access the RAM with latency of 0 cycle.<riscv_binary>
is name of the benchmark.Notes: Since there are multiple ways of dealing with timing in a CPU, each correct implementation might slightly differs in terms of the optimality, and thus we don’t expect to see exactly the same amount of cycles per each data point when comparing your implementation with the solution. However, as long as the correctness is ensured, the general trend of the number of cycles should not differ. I.e., there are some systems that are strictly slower than others. Therefore, we opt to use graphs rather than exact numbers for reporting data.
Notes: You don’t have to report raw data or how the calculation was done for question 2, question 3, and question 4. If you are unsure about the correctness of the graph, you can show the formula that you used to generate the data for the graph. However, the graphs must have the X-axis and Y-axis with labels and units.
Determine the number of dynamic instructions of the
stream-64-stride-1-noverify.riscv
and the
stream-64-stride-4-noverify.riscv
.
Hint: Regardless of microarchitectures, the amount of dynamic instruction
must be the same for each binary compiled using the RV64IM instruction set.
The single-cycle
might be useful for this question as a memory request
is completed within a cycle.
Hint: DINOCPU simulations output the number of cycles, and does not output the number of simulated instructions. Part of the Iron Law might be useful for the calculation step.
Create a graph representing the CPI of the Non Combinational CPU in 8 systems
described above with the stream-64-stride-1-noverify.riscv
benchmark and the
stream-64-stride-4-noverify.riscv
benchmark.
The X-axis should be grouped by systems.
Figure 4. An example of a CPI graph with the collected data. The data on the graph are not the real CPI of running the mentioned benchmarks on the four systems.
Assuming that the pipelined non combinational CPU is clocked at 2.5GHz and the
single-cycle non combinational CPU is clocked at 1.5GHz. Create a graph representing
the execution time of the 8 systems with the stream-64-stride-1-noverify.riscv
benchmark and the
stream-64-stride-4-noverify.riscv
benchmark.
Assume that the pipelined non combinational CPU is clocked at 2.5GHz.
Create a graph illustrating the effective bandwidth of system 4 when running
each of stream-64-stride-1-noverify.riscv
and
stream-64-stride-4-noverify.riscv
benchmarks.
Effective bandwidth is defined as the amount of data used by the
core per second.
For example, when the core executes an ld
instruction, it uses 8 bytes
(64 bits) of data in that cycle.
Note: The unit can be bytes/second, or a multiple of bytes/second, such as KiB/second and MiB/second.
Note: Instructions are not counted as data.
Hint: Even though a store instruction results in a memory read request and a memory write request the number of processed elements is still one element per store instruction. However, the number of cache accesses per each store instruction is two. Therefore, the cache stats are not reliable for calculating the efficient memory bandwidth without further analysis.
Hint: The number of data elements can be derived from the assembly code or the C code depicted in Figure 2. The vast majority of memory accesses of the benchmarks are produced by the loops in Figure 2, so you can ignore the memory accesses outside these loops.
Create a graph representing the L1 data cache hit ratio and the L1 instruction
cache hit ratio when running each of stream-64-stride-1-noverify.riscv
and
stream-64-stride-4-noverify.riscv
benchmarks with system 4.
Between data cache and instruction cache, do you think which cache has more impact on performance? Explain why using the data from part II.
From the data from Part II, you should see system 4 performs better
when running stream-64-stride-1-noverify.riscv
compared to running
stream-64-stride-4-noverify.riscv
. Explain why using the data from part II.
Use the CPI and execution time from Part II to compare the performance between the pipelined and single-cycle non combinational CPU. Explain why the single-cycle non combinational CPU might have lower CPI than the pipelined non combinational CPU.
Hint: Look at the source code of the benchmarks and think about what hazards that exists in the pipelined CPU but not in the single-cycle CPU.
Part I will be automatically graded on Gradescope. See the Submission section for more details.
Part | Percentage |
---|---|
Part 4.1 | 20% |
Part 4.2 Question 1 | 5% |
Part 4.2 Question 2 | 10% |
Part 4.2 Question 3 | 10% |
Part 4.2 Question 4 | 15% |
Part 4.3 Question 5 | 10% |
Part 4.3 Question 6 | 10% |
Part 4.3 Question 7 | 10% |
Part 4.3 Question 8 | 10% |
Extra Credits | 10% |
Warning: read the submission instructions carefully. Failure to adhere to the instructions will result in a loss of points.
You will upload the following files to Gradescope on the Assignment 4 - Code Portion assignment,
src/main/scala/components/hazardnoncombin.scala
Once uploaded, Gradescope will automatically download and run your code. This should take less than 40 minutes. For each part of the assignment, you will receive a grade. If all of your tests are passing locally, they should also pass on Gradescope unless you made changes to the I/O, which you are not allowed to do.
Note: Make sure that you comment out or remove all printing statements in your submissions. There are a lot of long tests for this assignment. The auto-grader might fail complete grading within the allocated time if there are too many output statements.
You will upload your answers for the Assignment 4 - Written Portion assignment to Gradescope. Please upload a separate page for each answer! Additionally, I believe Gradescope allows you to circle the area with your final answer. Make sure to do this!
We will not grade any questions for which we are unable to read. Be sure to check your submission to make sure it’s legible, right-side-up, etc.
You are to work on this project individually. You may discuss high level concepts with one another (e.g., talking about the diagram), but all work must be completed on your own.
Remember, DO NOT POST YOUR CODE PUBLICLY ON GITHUB! Any code found on GitHub that is not the base template you are given will be reported to SJA. If you want to sidestep this problem entirely, don’t create a public fork and instead create a private repository to store your work. GitHub now allows everybody to create unlimited private repositories for up to three collaborators, and you shouldn’t have any collaborators for your code for this assignment.
Improve the performance of the non combination pipelined CPU when running any of
application in the Full Application tests and their -loops-unrolled
variants.
You can find the tests here
and here.
The new CPU performance should match one of the following cases,
Constraints:
However, you can make any changes to the core such as adding more components to the core. You also can change the internal of the memory components, such as changing the cache replacement policy, or allowing the memory interface to send a response and receive a request within the same cycle, or optimize the state diagram of the cache component.
To submit the extra credits portion, email either TAs or the instruction with the following files,
The assignment should show that, on a realistic setup, it is very hard to keep all stages of the 5-stage pipeline busy when every memory access is expensive. Even with fast caches, keeping the IPC for this pipeline close to the ideal 1.0 is unattainable if you want to keep the CPU frequency high.
Not only does the 5-stage pipeline suffer from the problem of memory being slower than the core, but to keep the pipeline as busy as possible, modern architectures have a fetch stage with ability to fetch and issue multiple instructions at a time, and speculatively fetches instructions to instruction cache, and fetches data to data cache. These techniques, along with multi-issue, multi-way execution, and using load/store queues, are for exploiting ILP.
However, exploiting ILP can only help improving the performance so much until the CPU has to run a memory-intensive applications. Even if a CPU can hold limited amount of load/store instructions at a time, there’s a limit on how many requests can be processed in parallel. In the next assignment, you will explore a case where, even though the applications are memory-intensive, if the data can be independently processed, there are DLP techniques in hardware that helps increasing the throughput of memory accesses.
On the hardware/software interaction side, for this assignment, the simple benchmarks should reveal the effectiveness of the cache system on different program behaviors. Having a cache system, even a simple one like in the DINO CPU, drastically improves the performance of the system on a lot of real world applications. That is why even a low-end chip has some short of a cache system, and why a high-end chip tends to dedicate a lot of its area for the cache system.
However, a cache system does not always improve the performance for all workloads. In fact, by introducing a cache system, we are making the latency of pulling data from memory higher. There are algorithms that are naturally cache unfriendly, and it is still an open question on whether the cache system should be able to adapt to a variety algorithms, or the software developers should reprogram the program to maximize the utilization of the cache system.
On the other hand, as you can see from the assignment, the timing within a computer system is significantly dependent on the behavior of workload itself, and a small change in a system, like a slightly larger cache size, might significantly change the performance of a system. Thus, performance evaluation of a new system is usually heavily relied on simulations. For the next assignment, you’ll investigate the performance of a system with an even more sophisticated cache system where the cache system has to handle accesses from multiple CPU cores while having to ensure the correctness of the program. We will use another simulator, gem5, which comes with a variety of cache coherency protocols allowing investigating the performance of a multiple cores system.
[1] This is not because the latency of a memory device is hidden from the CPU. This is a complication caused by internal activities of a memory device that are abstracted away by the CPU/Memory interface. Consider a cache miss in an L1D cache which causes the L1D cache to pull a cache block from memory to L1D cache. However, since the cache is full, it needs to evict an entry to memory. Note that, the cache sends to response to the CPU as soon as it sends the eviction request to memory. In the next instruction, there could be an L1D cache miss again. However, the memory might be busy with the previous eviction request, so the L1D cache must wait for the memory to complete that request before it can pull a cache block to memory. Thus, the cache miss latency might vary from access to access.