CS 351 Homework 3/4

Links: [Course Home] [Schedule] [Resources] [Study Guide] [Moodle]


Due Friday, November 10 at 2:00 PM. No late submissions accepted. Solutions will be posted to the course website and Moodle by the end of the day. Please note that this assignment is worth 200 points and counts for 2 homework assignments.

Instructions

Complete the following problems. In order to receive credit, you must show enough work to make it clear how you arrived at your answer.

You may work in groups of up to three students and turn in one solution for the group. Make sure that all group members' names are on the submitted document.

You should submit your assignment as a single PDF via Moodle. It is OK to submit a single PDF consisting of photos of handwritten pages, as long as your work is CLEARLY legible. Corrupted or unreadable files will not receive any credit, and submissions in formats other than PDF may lose points.


Problem 1: Trace the "simple" LEGv8 implementation [40 points]

Use the LEGv8 diagram handed out in class and posted to Moodle to answer this question. A few notes:

Answer the following questions for the following instruction:

CBZ X29, -1

Assume the instruction is stored in memory in bytes 2000-1004. The current value in register X29 is 0.

  1. [8 points] Break down the fields of this instruction. You do not necessarily need to convert it to binary machine code, but you should identify its format (R, I, etc.), and give the value of each of the instruction fields (opcode, rs, etc.).
  2. [24 points] Give the specific numeric (binary, hex, or decimal) values consumed (i.e. input) and produced (i.e. output) by each unit where possible. Remember that most of these units are doing something every cycle, even if that work ends up being thrown away.

    Also, indicate whether the work done by each unit affects the outcome of the instruction. In other words, is the work done by this unit used (even indirectly) to modify the system state, or is it just computed and thrown away?

  3. [8 points] List the value of each control signal for this instruction.

Problem 2: Extend the "simple" LEGv8 implementation [40 points]

The full LEGv8 ISA has a branch-and-link instruction:

BL label

This instruction does two things: changes the PC to (PC + SignExt(label) * 4) for the next instruction, and saves the value (PC+4) in register LR (X30).

For each module listed below, answer the following questions:

Sub-problems

  1. [10 points] Answer the questions above for the register file. (That is, which registers need to be read and written? For register(s) that need to be written, where does the data come from?)
  2. [10 points] Answer the questions above for the ALU.
  3. [10 points] Answer the questions above for the data memory.
  4. [10 points] Answer the questions above for the hardware that selects the next PC.

Problem 3: Timing paths [15 points]

If your processor only had to support ADD instructions, calculate the minimum possible clock cycle time using the latency information at the beginning of Exercise 4.7 on p. 370 of your textbook. For these purposes, you should remove all hardware that doesn't pertain to ADDs or is unnecessary if ADD is the only instruction that will ever be executed.

Problem 4: Pipeline performance [50 points]

For this problem, refer to the information at the beginning of Exercise 4.26 on p. 377 of your textbook. For the EX stage, use "EX (full FW)."

Answer the following questions:

  1. [15 points] Based on this information and on the processor diagram posted to Moodle, what is the minimum clock cycle time for a non-pipelined processor? What is the best-case CPI for this processor, and is it possible for the actual CPI to be higher?
  2. [15 points] Based on this information and Figure 4.45 on p. 311, what is the minimum clock cycle time for a pipelined processor? What is the best-case CPI for this processor, and is it possible for the actual CPI to be higher?
  3. [10 points] How many cycles will the non-pipelined processor take to execute 1 million instructions, assuming no hazards? What about the pipelined processor?
  4. [10 points] Which one is faster for this sequence of 1 million independent instructions, and what is its speedup over the slower one?

Problem 5: Pipeline diagrams and data hazards [50 points]

For the following sequence of instructions:

LDUR X10, [SP, #8]
ADD X10, X10, X4
ORR X4, SP, X10
STUR X4, [SP, #8]
EOR X2, X4, X10
  1. [10 points] List the read-after-write data dependencies among these instructions. For each dependency, be very clear which two instructions and which register are involved. You should list all inter-instruction read-after-write dependencies, even if the instructions are far apart.
  2. [10 points] Draw a pipeline diagram showing the execution of this snippet of code under the following assumptions:
  3. [10 points] Under the assumptions in Part B, how many cycles does this sequence of instructions take to execute? What is its overall CPI?
  4. [10 points] Repeat Part B with all possible forwarding paths enabled. Clearly state which forwarding path is used, if any, in each cycle (e.g. EX-EX).
  5. [10 points] Under the assumptions in Part D, how many cycles does this sequence of instructions take to execute? What is its overall CPI?

Problem 6: Control hazards and branch prediction [5 points; the rest is just for practice]

Consider the B.GE branch in the assembly code for factorial on p. 105. For all of the questions below, assume that the factorial function only ever gets called with an argument of 10 (fact(10)). You are considering the steady-state accuracy of this branch, after it has been executed enough times to warm up the branch predictor.

Assume that branches are resolved in the EX stage.

  1. [5 points] What is the accuracy of predicting that the conditional branch will never be taken? How many total stall cycles or cycles of wasted work will we have during the execution of this loop?
  2. [0 points] What is the accuracy of the 1-bit predictor (assuming we have encountered this loop before)? How many total stall cycles or cycles of wasted work will we have during the execution of this loop?
  3. [0 points] What is the accuracy of the 2-bit saturating branch predictor (assuming that we have executed this loop enough times to reach a steady state)?