Pipelining

Mips Pipelining

Can we go faster?
Instruction Execution =

  1. Read contents from storage elemts
  2. Perform computation through some combination logic
  3. Write result to storage element

Single-cycle processor

Performance

We have to choose for the longest instruction time total as our cycle time

But because we choose the longest, sometimes instruction will finish early
To execute 100 instruction = 100 * 8ns = 800ns
We will define a cycle time for a single cycle processor.
The clock frequency we select is a sum of  Tk where n is the number of stages for each Tk

Tk: time that is taken for each step
n: Number of stages for each Tk

The total execution time for I instruction is cycle * cycleTime
Each Instruction takes one cycle to execute

Multicycle Implementation

- Break up instruction into execution steps
- Each execution steps takes one clock cycle (Shrinking cycle time, increase clock freq)
- Choose to enable certain stages for certain instruction
- Instructions don't take the same number of cycle to finish

Average CPI is the average number of cycles taken.


  1. Determine cycle time by choosing max cycle time
  2. Choose the avg CPI
  3. Total execution = I * AverageCPI * CT(multi for each of the cycles)

Pipelining

Stages

  1. IF (Instruction Fetch)
  2. ID (Instruction decode and Register Read)
  3. EX (Execute and Operation or calculate an address)
  4. MEM (Access an Operand in Data memory)
  5. WB (Write back the result into a register)

Realised that the Branch Target Address (To computer) and the result (To store) goes from right to left unlike the other instructions

Datapath

We will use a latch that will maintain a stable value while other instructions are working
These are pipeline registers, a latch that updates only when the clock is on the upper edge
These pipeline registers allow us to isolate the stages
(Shaded right side = read)

IF/ID
Read the instruction memory from PC and store it into the pipeline register. We also realised we store the PC+4. Instruction bits (32) will be memorised.
Read data 1 and read data 2, 16 bit immediate will be sign-extended to 32 bits.

ID/EX
- Register values read
- 32 Immdeiate value
- PC+4

EX/MEM
PC+4 is used to calculate the branch instruction
Why can't we just route it? Because its a pipeline processor so each time its doing many other stuff. We cannot pass the value to another component which might be doing something at the moment
- ALU Result
- isZero signal
- Read data 2 from the register file

MEM/WB
If its a store word instruction, it will use to store memory
- ALU result
- Memory read data

END
ALU data or memory data
Passback
Or it can use to store memory location
Needed:
- Data to be written
- Register number

We realised that the register number may not be correct because we use the current number in the decode stage to as the address but now we are doing pipelining, it might be the wrong value as it will be overridden.

Solution:
SO we will pass the value in write value through the pipeline continuously until it is in WB stage

Pipeline control

Use control signals as a single cycle data path
Difference: When is that signal utilized

Control unit will control the signal and pass the value along the pipeline until it reaches the designated place.

Grouping

We ask when do we use the signals and group them at the stage where they are being used.
We group them so we know when to send a signal from the pipeline control to control the latches,

Pipeline Performance

Td - pipeline overhead e.g pipeline register
max(Tk) - longest stage time among the N stages

Cycle time:
CTpipe = (max(Tk) + Td) * (I + N - 1 )
is the cycle needed for I instructions

But what is the better speed up?

Ideal speedup

- Each stage almost the same time
- The summation is the same * number of stage
- Assume there's no pipeline overhead
- Number of instruction I is much larger than number stags N

These assumptions can also show how pipeline loses performance


Speedup = time (single cycle) / time (pipe)
= N (if I is very large compared to N)
Conclusion: To increase speedup, increase N where N is the number of stages

Pipeline Hazard

3 Problems:
- Structure hazard caused by hardware resources
- Data hazard caused by data dependency
- Control Hazard caused by the change in program flow

Structure Hazard

Let say we only have one single memory module.
When we go through the pipeline, check if we use the same hardware twice in a clock cycle.
We cannot have two instructions using the same hardware.

Stall the pipeline

Ask the instruction to stall until the problem goes away, then continue. Every instruction will have a NOP  which is an or $0 $0 $0 which stall one cycle
But it will cause a bubble and will waste time

Or it can use to store memory location

Separate Memory
Split memory into 2 separate modules: data and instruction. Instructions like LW can use data memory while INST can use instruction memory

Using this idea, even when they access, its two separate modules. Thus we can access the memory twice.


Read after Write register

Occurs when a later instruction readds from destination register written by an earlier instruction. (True data dependency)


Register file access is quick to access to read and write such that it can 2 instruction in one cycle. One cycle has an upper edge/lower edge, we can read/write during these two different cycles.


Data Hazard (Data dependency)

When the relationship between instruction prevents pipeline execution. 

RAW (Read after Write)

i1: add $1, $2, $3
i2: sub $4, $1, $5

If the processor did not do properly, it will read the old data rather than the updated data.
There is no way we can avoid this (true data dependency).

Forwarding

- Forward the result to any trailing instructions before it's reflected in the register file
Because the latest time we can get the value is right after ALU, we will store it in a latch (ex/mem or Mem/wb)
- Bypass the data read from the register file

What if the instruction is a load word?

The calculation is done one step away from the use of the register.  We have no choice but to stall since lw at cycle 5 but needed in cycle 4.
We have to use a stall instead.

Control hazard (Control dependency)

An instruction depends on another instruction to know if it gets executed (e.g branch)
If control is not handled properly, it might cause an incorrect execution that might change values incorrectly.

e.g
beq $1, $3, 7 (PC+4) + (7*4)

IF 
Branch will be store into latch at end of the cycle.
PC will produce 44 which is the next instruction
Branch will move to decode stage
PC will be 48
BEQ will move over while PC = 44 will move to IF stage
By the time we decided whether to take the branch, there will be other instruction in the processor already. -> TOO LATE

Stall
We need to stall until the branch reaches the mem stage (3 cycles)
But 3 cycles is very heavy.

Early Branch resolution
Make the decision in ID stage instead of MEM

1. Calculate branch target address 
2. Check the condition of a branch instruction

We will try to move the calculation earlier, make the decision in the ID stage rather than EX stage.
In the decode stage, at the end of cycle 2 we wait for branch outcome and reduce 1 clock cycle delay.

But this will break the code which forwarding cannot solve (Due to past back in time)
We will use stalling as well.
Then we will then include a new path for the branch instruction from ALU to ID stage
This will add on another clock cycle delay.

But we will realise when we load follow by branch since we only get the correct value in the register after it reads the memory, we have to forward the memory from mem/wr latch but it's going back in time again (decode stage of branch), so we need double stall.
This will add on another clock cycle delay

-> Total is 3 clock cycle

But compiler can do something useful in this case, we can do other instructions
-Theres no other ways to improve this method-



Branch Prediction
We predict that all branches are not to be taken. We treat the branch as if it's not taken

Not taken:
Guess correct
-no Pipeline stall

Taken:
Guess wrongly
-> Wrong instruction
-> Flush successor
We set all the latch to 0
It's safe to let them go in and because the mem and write are at the later stage, the killed of instructions will not reach there.
Total instructions = 1 + 10*2 (loop) +1
Ideal pipeline = 4 + 22*1 = 26 cycle //only first instruction is overhead and we ran the code once

Delayed Branching
Since we are waiting for X cycles,
why not do something meanwhile.
"Branch-delay slot"
We need to find another instruction to move into.

e.g
When we see beq, we will need to delay 1 cycle,
We will look ahead from the branch for another independent instruction and execute them.
Those instructions are going to be executed regardless of the branch.

e.g
or xxxxxxxx
beq  xxxxxxxx, exit
(Empty slot for independent instruction here)
Xor xxxxxxxxxx
exit:

- Because Or does not depend on the branch and will execute anyways, it is moved into the slot.
- So while the BEQ is calculating, it will run OR

But what if we cant find an instruction?
- Nop