Skip to content

Lec 04/03/2026: MIPS Pipelining, Parallelism, and Hazards -- CSCI 343

Pipelining is how MIPS achieves instruction-level parallelism — instead of finishing one instruction before starting the next, multiple instructions overlap in the pipeline, each moving through a different stage per clock cycle. The result is more instructions completed per unit time, without changing how fast any individual instruction runs.

The main concepts we’ll cover are:

  • The five-stage instruction cycle
  • Parallelism through pipelining
  • Hazards in pipelining: structural, data, and control
  • Techniques to handle data hazards: stalling, forwarding, and delayed branching
  • Why MIPS is ideal for pipelining
  • Pipeline timing and speedup
  • Hardware support for pipeline optimization

Pipelining increases throughput by overlapping stages of multiple instructions. Each pipeline stage operates on a different instruction during the same clock cycle, so there’s no conflict in having multiple instructions in-flight at once. For this to hold, two things must be true:

  • No two instructions can use the same hardware unit in the same clock cycle — if two instructions both needed the ALU at once, one would have to wait, collapsing the overlap
  • All instructions must progress through the stages in order — out-of-order movement would create unpredictable dependencies that the hardware can’t easily track
  • Tasks (stages) for completing a load of laundry: wash → dry → fold → store
  • Instead of waiting for one load to finish before starting another, each stage is continuously active on a different load
  • Pipeline performs like a laundry assembly line: once the line is full, each step outputs a finished product every cycle
    From Patterson & Hennessy

Each MIPS instruction may go through the following five stages:

  1. Instruction Fetch (IF) - Access instruction memory
  2. Instruction Decode (ID) - Decode and read register operands
  3. Execute (EX) - Perform ALU operations
  4. Memory Access (MEM) - Read/write data memory (only for load/store)
  5. Write Back (WB) - Write result to register file

Not every instruction uses every stage actively, but all must pass through each stage in a pipelined design.

Some instructions like R-format skip MEM, and store instructions skip WB. Branch instructions may skip both MEM and WB. However, all must move through all stages to maintain pipeline structure and timing.

Each stage uses a different hardware component in the MIPS datapath: IF uses the instruction memory (IM), ID reads from the register file (Reg), EX runs the ALU, MEM accesses data memory (DM), and WB writes back to the register file. Since no two stages share the same physical component, multiple instructions can be in-flight at once without interfering with each other.

MIPS pipeline stages mapped to hardware components: IM, Reg, ALU, DM, Reg

Note that the register file appears at both ends: ID reads the source registers, and WB writes the result back. These use different ports of the register file, and since WB writes in the first half of a clock cycle while ID reads in the second half, they can safely share the same cycle without conflict.

The diagram below shows how pipelining reduces total clock cycles:

Pipelining Diagram

  • In a non-pipelined (serial) approach, each instruction must complete before the next begins. With five instructions and five stages per instruction, this would take 25 cycles.
  • In a pipelined CPU, overlapping stages allows all five instructions to complete in just 9 clock cycles.
  • Throughput Improvement: Completes more instructions per unit time (not faster individual instructions)
  • Efficiency: Better hardware utilization compared to single-cycle CPUs

  • Pipeline clock cycle is set to the slowest stage (typically MEM or EX)
  • With each instruction taking 5 stages, a pipeline allows completion of one instruction per cycle after the pipeline is full

Pipeline Time:

(n+k1)×t(n + k - 1) × t

where:

n=number of instructionsk=number of stages (5 for MIPS)t=clock cycle time (determined by slowest stage)\begin{align*} n &= \text{number of instructions} \\ k &= \text{number of stages (5 for MIPS)} \\ t &= \text{clock cycle time (determined by slowest stage)} \end{align*}

The formula follows from the fact that the first instruction takes k cycles to work through all stages, and each additional instruction adds one more cycle.

So for 5 instructions through a 5-stage pipeline:

(5+51)×t=9t(5 + 5 - 1) × t = 9t

In the pipeline visualization above:

  • The first few cycles are the fill-up phase, where stages are gradually occupied.
  • Once full, one instruction completes every cycle.
  • Once all instructions have entered, the pipeline drains one stage at a time.
  • Ideal Speedup ≈ number of stages (5), but actual speedup is less due to hazards and pipeline filling/draining overhead
ModelInstructionsClock CyclesClock TimeTotal Time
Single-cycle CPU55 cycles800 ps5 × 800ps = 4000 ps
Pipelined CPU59 cycles200 ps9 × 200ps = 1800 ps
  • The pipeline needs a few cycles to fill at the start and drain at the end, so the performance gain is smaller for short programs.
  • With large programs (e.g., 10,000+ instructions), the pipeline stays fully utilized and that fill/drain overhead becomes negligible.
  • The performance gain comes from throughput — more instructions finished per unit time — not from making any single instruction faster.

MIPS was designed from the start with pipelining in mind. Several of its design decisions directly make pipelining easier to implement:

  • Fixed-length instructions (32 bits) — every instruction is the same size, so fetching and decoding are simple and uniform
  • Regular instruction formats — fields like opcode, rs, and rt always sit in the same bit positions across all instruction types; this means the CPU can start reading source registers from the register file before the opcode is even decoded
  • Load/store architecture — only lw and sw touch memory, so the MEM stage only ever does real work for two instruction types
  • Aligned memory accesses — every memory access is word-aligned, so a single cycle is always enough for MEM

Together, these keep each pipeline stage simple and predictable, with no complicated special-casing in the control logic.


A pipeline hazard is anything that stops the next instruction from starting in the next clock cycle.

Prof kinda glossed over this — said don’t worry too much about it.

  • Cause: Two instructions attempt to use the same hardware component simultaneously
  • Solution: MIPS avoids these by design (e.g., separate instruction/data memory)
  • Cause: Dependency between instructions (e.g., one instruction needs a result from another still in progress)
  • Types:
    • RAW (Read-After-Write): Most common

      e.g.

      The sub instruction needs the value of $t1 produced by the add instruction:

      add $t1, $s1, $s2
      sub $s0, $t1, $t2

The diagram below shows how the two instructions are scheduled to run, one right after the other — the instructions overlap in the pipeline as is our intent for maximizing performance, but with a subtle issue: a data hazard. add starts at cycle 1 and sub starts one cycle later — so at cycle 3, while add is in its EX stage, sub is in its ID stage. This is the root of the hazard: sub needs $t1 at its EX stage (CC4) to feed into the ALU, but add won’t write it back to the register file until WB (CC5) — the hazard here is not that this would crash, but that we risk sub using the old (stale) value of $t1 instead of the new one that we intend as the result from add. In order to avoid this hazard, we need to update the behavior of our hardware to ensure that sub gets the correct value computed by add.

RAW data hazard pipeline: add followed by sub, dependency line from add WB (CC5) backward to sub EX (CC4), showing the hazard

  • Stalling (Bubbles)

    • Pipeline inserts bubbles to delay the dependent instruction.
    • Bubbles represent NOPs (no-ops / no operations) — empty operations that pass through the pipeline without reading, writing, or computing anything. They serve to push subsequent instructions forward by one cycle.
    • Implemented using a hazard detection unit, which detects the dependency and automatically inserts bubbles.

    With two nops inserted, add’s WB and sub’s ID both land at CC5. Since WB writes in the first half of that cycle and ID reads in the second half, they can safely share it — sub reads the correct value from the register file and the hazard is resolved. This half-cycle sharing is also why 2 nops suffice here rather than 3: without it, sub’s ID would have to land strictly after add’s WB.

    RAW data hazard resolved with two nop instructions: nop bubbles space out the instructions so sub reads the correct value from the register file

  • Forwarding (Bypassing)

    • The computed result physically exists on the data wire right after EX completes — it just hasn’t been written to the register file yet.
    • With a forwarding implementation, the dependent instruction’s decoder will still read the old (stale) register value. However, the forwarding unit detects the dependency and overrides that value with the one already sitting on the EX/MEM data line.
    • The dependent instruction doesn’t stall — it proceeds normally, getting the correct value substituted in at the ALU input.
    • Done using a forwarding unit and multiplexers that select between the register file output and the forwarded value.
    • This solution is ideal in our goal of maximizing performance (where we want all our instructions to complete as early as possible), since it eliminates the need for stalls and allows instructions to proceed without delay. However, it does require additional hardware complexity to implement the forwarding paths and control logic.

    We consider that add’s result is computed at the end of its EX stage. Without forwarding, sub can’t use it until add writes it back to the register file at WB, which is two cycles later. With a forwarding unit, there’s a direct hardware path that routes the EX result (now held in the EX/MEM pipeline register) straight to the ALU input of sub’s own EX stage, bypassing the register file. This is an EX-to-EX forward: add’s result is available at the end of CC3 and sub needs it at the start of CC4, so the dependency line is vertical at the CC3/CC4 boundary — no stall needed.

    RAW data hazard with EX-to-EX forwarding: vertical dependency line at CC3/CC4 boundary

This is a special case of a data hazard. Unlike an R-type instruction (where the result exists on the data wire right after EX), a lw’s EX stage only computes the memory address — the actual loaded value doesn’t exist until the end of MEM. This rules out forwarding from EX, since there’s nothing there yet to forward. If sub follows lw immediately, sub’s EX stage starts at the same cycle as lw’s MEM — one cycle too early to receive the value. Inserting exactly one stall shifts sub’s EX forward by one cycle so it aligns with lw’s MEM output, at which point forwarding delivers the value just in time.

e.g.

lw $s0, 0($t0)
sub $t1, $s0, $t2 # Requires 1 stall cycle + forwarding

Solutions:

  • Stall (nop/bubble): Insert a nop between lw and sub to create a gap for the loaded value to become available.
  • Code Scheduling (instruction reordering): The compiler reorders instructions so that an unrelated instruction fills the gap instead of a bubble — eliminating the stall with no extra hardware. Essentially: do loads early, do other work, then use the loaded values.

The diagram below shows the load-use hazard. The dot on lw marks the end of its MEM stage — the earliest point when the loaded value is available. The dot on sub marks the end of its ID stage — the point where it needs the value to proceed into EX. The line connecting the two dots shows the dependency. Notice that sub’s dot (end of cycle 3) falls to the left of lw’s dot (end of cycle 4): sub needs the value one cycle before lw can produce it, which is exactly why a stall must be inserted.

Load-use hazard pipeline: lw followed by sub, dependency line from end of lw MEM to end of sub ID

With a nop inserted between lw and sub, both the end of lw’s MEM and the end of sub’s ID land at the same cycle boundary (CC4). The dependency line is now vertical — the loaded value arrives exactly when sub needs it, so forwarding can deliver it with no further stalling.

Load-use hazard resolved with nop: vertical dependency line at CC4 boundary

  • Cause: Branches alter instruction flow, but the next instruction is fetched before branch resolution
  • Flush: Discard the incorrectly-fetched instruction when the branch resolves the other way
  • Branch prediction: Guess the likely path to avoid stalling; dynamic algorithms can improve accuracy. However, MIPS takes the simpler approach below.
  • MIPS Approach - Delayed Branching:
    • Always fetch the next instruction after the branch
    • The next instruction (delay slot) is always executed, regardless of branch result
    • Compiler reorders code to place useful instructions in the delay slot
    • Static prediction: MIPS assumes branches are not taken

Pipeline registers sit between stages and hold the outputs of each stage until the next stage is ready to consume them. This is what makes overlapping execution possible — each stage can move forward without waiting on the others, because the values are buffered in between.

Pipelined datapath showing IF/ID, ID/EX, EX/MEM, and MEM/WB inter-stage registers

  • IF/ID (64 bits): holds the 32-bit fetched instruction and the 32-bit PC+4 value
  • ID/EX (128 bits): holds PC+4, two 32-bit register read values, the 32-bit sign-extended immediate, and destination register number
  • EX/MEM (97 bits): holds the ALU result, zero flag, write data for stores, and destination register number
  • MEM/WB (64 bits): holds the 32-bit data read from memory (or ALU result) and the 5-bit destination register number, to be written back
  • Contains multiplexers that bypass pipeline registers to feed results directly to dependent stages
  • Used to resolve most RAW data hazards without stalling
  • Detects RAW hazards that forwarding alone can’t resolve (e.g., load-use) and inserts NOPs (bubbles) to stall the pipeline
  • Works in tandem with the forwarding unit: forwarding eliminates most stalls; the hazard detection unit handles the cases where it can’t
  • Modified in a pipelined CPU to coordinate stalls, instruction flushing on mispredicted branches, and pipeline register updates

Consider the following five instructions:

lw $t0, 0($s0) # I1
add $t1, $s1, $s2 # I2
sub $t2, $t0, $s3 # I3
or $t3, $s5, $s6 # I4
add $t4, $s7, $t5 # I5

I3I_3 depends on I1I_1 ($t0), but there is one independent instruction (I2I_2) in between. That gap is enough: by the time I3I_3 reaches its EX stage, I1I_1 has finished MEM and the loaded value can be forwarded directly — no stall needed. All five instructions enter the pipeline back-to-back.

Pipeline diagram for 5 instructions (lw, add, sub, or, add) entering back-to-back with no stalls

Assume the following component delays:

ComponentDelay
Memory unit (IM, DM)200 ps
ALU, adders100 ps
Register file (read or write)50 ps

The pipeline clock is set to the slowest individual stage, which is the memory unit at t=200pst = 200\,\text{ps}. With n=5n = 5 instructions and k=5k = 5 stages:

(n+k1)×t=(5+51)×200ps=9×200ps=1800ps(n + k - 1) \times t = (5 + 5 - 1) \times 200\,\text{ps} = 9 \times 200\,\text{ps} = 1800\,\text{ps}

In a single-cycle CPU, the clock is fixed to the longest instruction:

R-format (add, sub, or)

StageComponentTime
IFRead instruction memory200 ps
IDRead register file50 ps
EXAccess ALU100 ps
WBWrite register file50 ps
Total400 ps

lw

StageComponentTime
IFRead instruction memory200 ps
IDRead register file50 ps
EXAccess ALU (compute address)100 ps
MEMRead data memory200 ps
WBWrite register file50 ps
Total600 ps

lw is the longest at 600 ps, so the clock is set to 600 ps and every instruction takes one full cycle regardless of type:

Single-cycle time=5×600ps=3000ps\text{Single-cycle time} = 5 \times 600\,\text{ps} = 3000\,\text{ps} Speedup=3000ps1800ps1.67\text{Speedup} = \frac{3000\,\text{ps}}{1800\,\text{ps}} \approx 1.67

So the pipelined approach is about 1.67 times faster than the single-cycle baseline for this sequence of instructions, due to the overlapping execution and the ability to use a shorter clock cycle.


  • Hennessy & Patterson textbook chapters on pipelining, data hazards, and control hazards
  • PowerPoint slides posted on Brightspace for pipeline and hazard diagrams
  • Study older student presentations (e.g., by Steven Trowbridge & Raymond Kendall)
  • Practice interpreting pipeline diagrams like the one above