Computer Organization and Design
Section 4.10
There is a limit to the rate we can execute instructions in a pipelined design. Specifically, things like loads and branches cause hazards which push our CPI above 1 by forcing us to insert bubbles and stall the processor. In the video below, we explore how we can reduce the number of hazards by slightly changing our program (or, in other words, leveraging the compiler).
This is a video on how to get parallelism by statically finding independent instructions and how to improve performance with static instruction scheduling.
So, now that we have seen how we can use the compiler to statically “rewrite” our code to hide/overcome hazards and stalls in our pipeline, let’s think of how we can take this to the next step. Part of what we did above was find instructions that were independent which means that there is no reason not to execute them at the same time. Using this insight, we can develop a new kind of ISA in which we can explicitly tell the hardware what instructions are independent so that the hardware can execute them in parallel.
How would you modify the pipelined CPU design discussed previously to implement a VLIW instruction set?
Obviously, there are pros and cons to VLIW-like ISAs. In fact, as you can probably guess, there are more cons than pros (we haven’t talked about VLIW until now, right?). The next video discusses some of why we don’t see VLIW systems today.
So, it’s difficult to always find independent instructions statically. But what about dynamically?
Can you think of a way to modify the pipelined design to issue two instructions (or more!) every cycle?
Rather than listen to me talk about the history of the Itanium, let’s go back to the Turing Lecture and have Dave Patterson explain why it failed!
Use Canvas to complete the quiz
Computer Organization and Design
Section 4.10
Now, can we do the same thing as we discussed above, but do it automatically in hardware?
Dynamic Instruction Scheduling Conway et al.
https://ai.eecs.umich.edu/people/conway/ACS/DIS/DIS.pdf
DIS is one possible implementation of dynamic ILP. It’s one of the first descriptions of a hardware mechanism for ILP. Interestingly, this idea was lost for decades. Computer historians overwrote this history, and if you read your book or most other references it will say that the first algorithm for dynamically issuing instructions is Tomasulo’s algorithm.
Norm Hardy said of DIS: “It was superscalar, and the instructions were rather simple – it was the first design that I had heard of with the idea of issuing more than one instruction per clock cycle – I can recall thinking that that was barely credible at the beginning. It turns out, in retrospect, to be the right way to build a machine. But I can remember being quite in awe of the very idea–in fact, I thought that executing one instruction per clock cycle was quite remarkable”
I strongly encourage you to read about Lynn Conway’s story. Her website is a treasure trove of fascinating computer architecture history. Not only did she likely invent the first method for out of order execution, but she essentially started the VLSI revolution that was a major driver of Moore’s Law!
She has some great talks that she’s given as well. If you have time, please check these out. It’s a great way to see some insight into the human aspects of computer science history. This video is long, but similar to the talk she gave at Davis a couple of years ago. At around the 20 minute mark is where she starts talking about computer architecture.
The dynamic instruction scheduling algorithm developed by Lynn Conway is a quite elegant algorithm. Interestingly, as I discuss in the video below, I didn’t learn about this until a couple of years ago when Lynn came and gave a talk at UC Davis. I discuss more details in the video, but Lynn is a transgender woman, and this identity, at least in part, caused her amazing innovations to be overshadowed by others. This is an example of a less privileged person being “talked over” by history.
The video below discusses some of Lynn’s story (see above for more details) and begins to introduce dynamic instruction scheduling. This video also introduces the concept of “false” dependencies like write after write or WAW and write after read or WAR.
Now, this video dives deeper into the details of the DIS algorithm and how it can be implemented in hardware.
Finally, this video discusses how to overcome WAW and WAR hazards by using register renaming.
Do you think that a dynamically issued out of order processor with register renaming (wow, that’s a mouthful!) consumes more or less power than an in order pipeline (like assignment 3)?
If so, how would you figure out if this is extra power is “worth it”?
Use Canvas to complete the quiz