Tuesday, July 12, 2011

Delay Slots

Delay slots are an artifact of some early pipelined architectures in which  pipeline hazards were not handled explicitly. I was puzzled for while by some unexpected assembly produced by gcc while working on my  own implementation of the MIPS ISA,  further investigation yielded the following results about the branch delay and the load delay slots, both of them occurred in early MIPS architectures.

Load Delay Slot
The load word instruction (lw) loads a word from memory to the specified register, because of the pipelined nature of the architecture, the next instruction(s) execute concurrently with the current instruction,  if the following instruction uses the lw destination register as  one of its source registers then it cannot continue before the lw data is fetched from memory and written back to the destination register, otherwise, it will read invalid data. For example, in the following code snippet, the add instruction uses $s0 as one of its source registers, if it were to read $s0 before it's written by the lw instruction it  would read the old value of $s0.
        lw $s0, 4($0)
        add $s2, $s0, $0

This peculiarity is called a Hazard, specifically a data hazard. Data hazards can be handled by either stalling the pipeline or with register forwarding. The pipeline can be stalled with a nop,  which is sometimes called a bubble because it propagates through the whole pipeline causing every stage to be idle. Register Forwarding, on the other hand, forwards the result of an instruction from the current stage to the previous stage (i.e to the next instruction) bypassing the pipeline, the following image shows register forwarding from the first instruction to the second and third, the following instructions can read the register normally:

If the hazard is not handled by the data path, the assembler introduces a delay slot which it later fills with either a nop or by re-ordering the instructions, if it can find something useful to fill in the delay slot with. This is what the disassembled code looks like, the assembler fills the delay slot with a nop:
00000000 <main>:
   0: 8c100004  lw s0,4(zero)
   4: 00000000  nop
   8: 02008820  add s1,s0,zero

If we change the  add instruction  such that it's not dependant on the lw instruction any more, therefore it can execute in parallel, the assembler removes the nop:
00000000 <main>:
   0: 8c100004  lw s0,4(zero)
   4: 02408820  add s1,s2,zero

Branch Delay Slot
Similarly, due to the parallel execution of the instructions, by the time the branch target gets resolved the following instruction would have been fetched, in  other words, the instruction following a branch always executes whether the branch is taken or not.

This type of hazard is called a control hazard, modern architectures use a branch predictor to avoid flushing the pipeline, it gets flushed only in the case of a branch misprediction. Early MIPS didn't handle this, obviously, and the assembler needed to introduce yet another delay slot and fills it with either a nop or, if possible, by re-ordering the instructions. In the following example, if the branch were to execute without a delay slot it would execute the last addi instruction at each step of the loop which would never finish (because $s0 and $s2 are both incremented)
        addi $s0, $0, 1 
        add $s1, $s0, $0
        bne $s0, $s2, loop    
        addi $s2, $0, 1

Note that the second add instruction does not affect the branch target so its order is irrelevant and can come before or after the branch, the following disassembled code shows how the assembler re-ordered the instructions and used the add instruction to fill the delay slot:
00000000 <loop>:
   0: 20100001  addi s0,zero,1
   4: 1612fffe  bne s0,s2,0 <loop>
   8: 02008820  add s1,s0,zero
   c: 20120001  addi s2,zero,1
If the branch was, otherwise, dependant on the add instruction, its order will  not be changed and the assembler will use a nop to fill the delay slot instead. If we modify the and instruction and have it write to $s2 making the branch dependant on the and result it will use a nop:
00000000 <loop>:
   0: 20100001  addi s0,zero,1
   4: 2012ffff  addi s2,zero,-1
   8: 1612fffd  bne s0,s2,0 <loop>
   c: 00000000  nop
  10: 20120001  addi s2,zero,1
Read more ...