Saturday, December 20, 2025

Separating the Processor from the Instruction Set: A Case Study using RISC-V

 Some time ago, as I was beginning to learn about CPUs and register transfer level (RTL) design, I asked myself a question: could a generic processor design be reused for different instruction set architectures (ISAs)? Could a high-level description of a CPU be made flexible and parameterizable enough so that it could fit either RISC-V, MIPS, or some new specification? It may not be possible with traditional hardware description languages like SystemVerilog or VHDL, but the advent of new languages like Chisel that were built on real programming languages inspired me to try.

  Currently, I've managed to create a basic pipelined processor with partial support for RISC-V integer, logic, and control instructions. The RISC-V-specific parts are separate from the processor implementation. For example, the instruction ADDI, which adds an immediate value to a register value and stores it in a register, is defined below:

  class ADDI extends RISCVInstructionLogic("addi") {
    override val numOps : Int = 1
    override def decode(instr : UInt) : Bool = {
      isOpImm(instr) &
      (getFunc(instr) === "b000".U) &
      (getIDest(instr) =/= 0.U)
    }
    override def getRFIndex(instr : UInt, opIndex : Int) : UInt = {
      getISrc(instr)
    }
    override def writeRF(instr : UInt) : Bool = true.B
    override def getWriteIndex(instr : UInt, ops : Vec[UInt]) : UInt = {
      getIDest(instr)
    }
    override def getData(instr : UInt, pc : UInt,
      ops : Vec[UInt]) : UInt = {
      (ops(0).asSInt + getIImm(instr).asSInt).asUInt
    }
  }

This instruction and the other instructions are collected in a list inside a subclass of Instructions:

  class RISCVInstructions extends ProcessingModule.Instructions {
    override def logic =
      new ADDI ::
      new AUIPC ::
      new BEQ ::
      new BNE ::
      new BGE ::
      Nil
  }

Finally, the instructions and pre-trap vector (described below) are contained in a subclass of ProcessingModule. Other parameters like register data width and register file depth are also defined. This RISCVProcessingModule class can be simulated, synthesized, and/or instantiated within a larger system.

  class RISCVProcessingModule (rfDepth : Int)
      extends ProcessingModule.ProcessingModule(dWidth = 32,
dAddrWidth = 32,
iWidth = 32,
pcWidth = 32,
pcAlign = 3,
numOps = 2,
opWidth = 32,
rfDepth = rfDepth) {

    override def initInstrs = new RISCVInstructions

    override def initPreTrapVector = 0x2000.U(32.W)
  }

The source code for this work is available on GitHub. The processor implementation is described in the following sections.

Pipeline

At this point, ProcessingModule is a simple five-stage pipeline. The instruction fetch stage contains the program counter and reads instructions from instruction memory. Next, the decode stage parses each instruction and obtains the necessary data from the registers. If an instruction is a branch, it will also be resolved and fed back to the fetch stage. Otherwise, the register and/or immediate data is forwarded to the execute stage to compute some result. If the result needs to be stored in a register, it will be fed back to the decode stage. If data memory needs to be read or written, then the result goes on to the load-store stage. For memory reads, the data read from memory is then written back to the registers in the decode stage. Having all of these stages be bounded by registers decreases the combinational logic depth, which allows for a greater maximum clock frequency (fmax) and better performance.

Instruction Fetching

One of the challenges I faced was handling delays from memory. I decided to not design a tightly-coupled interface with predictable timing because I wanted to experiment with memory caching in the future. On a cache hit, a memory access could take only a single clock cycle, but on a miss, the access could remain outstanding for several cycles. In that case, we stall the pipeline. In the fetch stage, the PC values for outstanding fetches need to be queued in case of a delay. The queue gets popped when the memory responds with data. In addition, if there's backpressure from the following stages while waiting for instruction memory, then the output queue buffers PC-instruction pairs until the following stages are ready. These queues are a parameterizable module of my design that can be cleared in a single cycle. The queues need to be cleared in case of a branch or jump because the instructions that are currently being fetched are invalid and should not propagate down the pipeline.

Control and Status Registers

RISC-V has an extension that specifies a number of control and status registers (CSRs) divided into several privilege levels. These registers contain information about the state of the processor and can be modified by software to change its behavior. These registers are supposed to be in their own address space, but I left them in the main data memory for now. Some CSRs that have not been implemented yet will need to be automatically updated by the processor. One example is the CSR mcause, which stores the type of exception that occurred last. Since CSRs are specific to an ISA, I plan to add an optional pipeline stage at the end that contains custom logic specified by the user. This stage could contain the CSRs and the logic to update them.

Exceptions

Only one type of exception is supported so far which is the explicitly-called one generated by the instruction ECALL. Since exception behavior is architecture-specific, I decided to add a pre-trap vector that the processor jumps to first. This contains code specified by the user that calculates and jumps to the actual trap vector. For RISC-V, the pre-trap vector is just two instructions: one read from the CSR mtvec and a jump to the address contained in it. Instructions in the pre-trap vector work on just a few dedicated working registers instead of the main register set.
  In the future, more checks will be added to each stage to detect additional exceptions, like instruction mis-alignment and illegal instructions. Exception information will be sent along with the accompanying instruction and PC to custom logic designed by the user to update CSRs (mentioned above) and perform additional next steps. Support for external interrupts will also be added.

Testing

Each pipeline stage is a separate module, so it's easy to test them individually with short unit tests. I started with Chisel's built-in PeekPokeTester to drive each input with a value, set expectations on the outputs, and advance the clock one cycle at a time. However, writing lots of poke and expect calls was cumbersome and difficult to correlate with the signal waveforms. To improve the testing process, I wrote a thin wrapper around PeekPokeTester called TableTester. It allows you to specify all input and output values (with don't-cares) for one clock cycle in a single line. When all the values for all the cycles in a test are written, it looks like a table that's sort of a transpose of the waveform viewer. Below is an instance of TableTester created for the decode stage.

  new DecodeTableTester(dut) {

    // Test instructions
    val i2 = 0x9 // incr 2, 000_010_01
    val ec = 0x0 // ecall

    // Initial input values
    poke(dut.io.instrReady, true.B)
    poke(dut.io.instrIn.valid, false.B)
    poke(dut.io.data.valid, false.B)
    poke(dut.io.exData.valid, false.B)

    step(rfDepth) // RF init, steps 1-8

    //              instrIn     data            in  br      instrOut
    //              V  pc   w   V   i   t   w   r   v   b   pc  w  pTr
    stepUpdate(List(0,  0,  0,  0,  0,  0,  0,  1,  0,  x,  x,  x, 0))
    stepUpdate(List(1,  u, i2,  0,  0,  0,  0,  1,  0,  x,  x,  x, 0))
    stepUpdate(List(1,  1, ec,  0,  0,  0,  0,  1,  0,  x,  0, i2, 0))
    stepUpdate(List(0,  u,  u,  1,  2,  0,  1,  1,  1, 20,  x,  x, x))
    stepUpdate(List(0,  u,  u,  0,  0,  0,  0,  1,  0,  x,  x,  x, x))
    stepUpdate(List(1,  2, i2,  0,  0,  0,  0,  1,  0,  x,  x,  x, x))
  }

After testing each stage on its own, we put all of the stages together into the pipeline and perform integration testing. The input for this test is the official RISC-V unit tests which are small assembly programs. I compiled these programs and dumped the output into a text listing. This listing (sample below) gets parsed by a module that contains the processor and instruction and data memory. It monitors the data bus for a write to address 0x1000 to end the test.

Disassembly of section .text.init:

80000000 <_start>:
80000000: 0480006f          j 80000048 <reset_vector>

80000004 <trap_vector>:
80000004: 34202f73          csrr t5,mcause
80000008: 00800f93          li t6,8
8000000c: 03ff0863          beq t5,t6,8000003c <write_tohost>
80000010: 00900f93          li t6,9
80000014: 03ff0463          beq t5,t6,8000003c <write_tohost>
80000018: 00b00f93          li t6,11
8000001c: 03ff0063          beq t5,t6,8000003c <write_tohost>
...

References

I had to do a lot of studying to be able to put this pipeline together and debug it. Yuri Panchul's SystemVerilog homework is a set of exercises that's great for learning RTL design fundamentals and useful patterns. For CPU design, the book Digital Design and Computer Architecture by David Money Harris and Sarah L. Harris clearly explains the classic five-stage pipeline design and how to deal with the challenges of branching, hazards, and exceptions. Monte Dalrymple's book Inside an Open-Source Processor provides a good overview of RISC-V and how to begin designing a simple pipelined processor for it. Finally, the RISC-V unprivileged and privileged specifications provide all the details you need to make your implementation compliant.

No comments:

Post a Comment