Sunday, November 6, 2022

Graphical Chisel Development Environment with Docker

 In addition to rewriting a large part my processing module project, I've also been trying to streamline the development experience. I had acquired a new laptop and repurposed my old one for a home development server. Both of these had a fresh OS installed, so the project needed to be set up from scratch. This used to be a time-consuming, error-prone process with many steps for installing Chisel, configuring Emacs, and downloading all of the dependencies for building and testing. Downloading dependencies is itself a tricky process that includes opening source files in Emacs and waiting for the language server to start up, so it's difficult to automate. The language server, which enables fast error checking and autocompletion in Emacs, requires many plugins to be loaded, which I would prefer not to have in my default minimal configuration.

I had previously tried out a Docker image made for the Chipyard project, and it was easy to use and included everything needed to produce a System-on-Chip design. I liked that a single download took care of all the dependencies and utilities. It's also important to me that the project setup not leak out onto the rest of my system. The source code itself should remain on the host filesystem, so that every single change is preserved, but everything surrounding it should be contained and able to be reverted quickly. In addition, the Docker instance started up quickly and did not add much overhead to my system.

So, I started following the very good online documentation to write my own Dockerfile that is based on the Chipyard one. It downloads all of the Chisel dependencies and builds the appropriate version of Verilator. Emacs is included along with a specialized configuration and language server plugins so that my editor runs from within the instance. A terminal emulator is also installed to run on startup. This way, nothing more needs to be downloaded when the Docker instance is running; the system is ready to be used completely offline. When launching the instance, two virtual filesystem mounts are created: one for the source code, and one for an X11-related file (/tmp/.X11-unix) so that GUIs can be displayed with the host window manager. This setup has been tested to work successfully on my Linux home server (Ubuntu 21.10) and my main Windows laptop (WSL2).

Now the setup process has become much easier and cleaner, but there are still a few problems. Windows started from the instance don't conform to the OS display scaling settings for my high-DPI displays, so they look tiny by default. One workaround is to increase the Emacs face size. This makes Emacs more readable and keeps it looking sharp, but that is not applicable to gtkwave, the waveform viewer. Another solution for WSL2 is to add a .wslgconfig file to scale the windows to a readable size, but this makes everything look pixelated and blurry (see below).

Another potentially serious problem I ran into early on was the user and permissions associated with files in the source area created from within the instance. By default, the user and group that the instance provides upon launching is root and associates all new files with that user. This includes object files created by git, which can corrupt the database if it was initialized by a non-root user. Fortunately, I was able to recover from this mistake fairly easily, but I changed the Dockerfile to create a user account that can be chosen when building the image. So if the Docker user is the same as the user that launches the image, there is no issue. However, this does requires each individual user to build their own image, which can take around half an hour. The resulting image size is also fairly large (3.69 GB), so downloading it can take a while, not to mention uploading it. Despite these issues, Docker has become part of my routine development process, and I'll continue to improve it for this project and adopt it for future projects as well.

Sunday, October 23, 2022

Rewriting the Processing Module Pipeline

A couple of years ago, I presented a project called ProcessingModule at the Chisel Community Conference. Its goal was to generate a simple CPU core given an instruction set architecture (ISA). The ISA would also be written in Chisel but in a format that is easily readable and writeable by those unfamiliar with the language. An example implementation called AdderModule was provided with a basic ISA that only adds integers and accesses memory. There is also a branch instruction that checks if a register value is greater than zero.

I had to crunch to get the first version ready for the conference, but it worked with a handful of unit test cases. Afterwards, I started thinking about how it could support a popular ISA like RISC-V and get through the standard RISC-V test suite. The biggest problem was the possibility of data hazards when nearby instructions have dependencies on each other. Fixing these proved to be challenging due to the fact that the code was a single large blob that was difficult to reason about. Pipeline stages were not clearly defined, so it was hard to figure out where hazards should be detected and how they should be handled.

So I decided to go back to the drawing board with the book Computer Architecture by David Money Harris and Sarah L. Harris. I closely studied the chapters on five-stage pipeline design and rewrote the code in a similar fashion. A fetch module accesses instruction memory, a decode module parses the instructions and accesses the register file, an execute module performs integer computations, and a memory module accesses data memory. The decode module also contains logic for detecting hazards and triggers a stall if one occurs. Instructions results are forwarded to the register file early when possible to reduce the frequency of stalls. The code for each of the pipeline stages is much smaller and simpler than the previous implementation, so they are easier to understand and debug. The stages can also be tested separately, making it possible write shorter unit tests that cover more scenarios.

In addition to the underlying implementation, the ISA interface has also changed. The biggest difference is that a register file is now part of the interface, though the register width and number of registers can still be customized. There are more generator methods in the interface to help the implementation identify data dependencies between instructions. The new interface is shown below:

abstract class InstructionLogic(val name : String) {

  /** Number of operands */
  val numOps : Int = 0

  /** Indicates if the given word matches this instruction class */
  def decode(instr : UInt) : Bool

  /** Index in register file that specified operand is at */
  def getRFIndex(instr : UInt, opIndex : Int) : UInt = 0.U

  /** Indicates if this instruction is a branch */
  def branch() : Bool = false.B

  /** Indicates if this branch instruction is relative to PC */
  def relativeBranch() : Bool = false.B

  /** Returns instruction address or offset that should jump to */
  def getBranchPC(instr : UInt, ops : Vec[UInt]) : SInt = 0.S

  /** Indicates if this instruction reads from data memory */
  def readMemory() : Bool = false.B

  /** Indicates if this instruction writes to data memory */
  def writeMemory() : Bool = false.B

  /** Indicates if this instruction writes to the register file */
  def writeRF() : Bool = false.B

  /** Returns data address that should be accessed */
  def getAddress(instr : UInt, ops : Vec[UInt]) : UInt = 0.U

  /** Returns index in register file that should be written */
  def getWriteIndex(intr : UInt, ops : Vec[UInt]) : UInt = 0.U

  /** Returns data that should be written or stored */
  def getData(instr : UInt, ops : Vec[UInt]) : UInt = 0.U

  /** Returns memory data that should be written to register file
    * The data returned by this logic is calculated in the memory
    * stage using data just loaded from data memory
  def getRFWriteData(resultData : UInt, memData : UInt) : UInt = memData

The unit testing strategy has also been changed in several ways. DecoupledTester described in the last post is not cycle-accurate: input and output events may happen in any cycle as long as the event order is consistent. I wanted more visibility and control over the exact timing of the pipeline, so I wrote a new test harness called DecoupledPeekPokeTester. It still accepts a sequence of high-level events (instruction/data request/receive), but an event's position in the sequence denotes the exact cycle that it is expected, as shown below:

AdderModule should "increment with memory value with stall" in {
    (dut : AdderModule) => new DecoupledPeekPokeTester(dut){
      def cycles = List(
        List(new InstrReq(addr = 0)), // fetch 0
        List(new InstrRcv(instr = incrData(0, 1)), new InstrReq(addr = 1)), // receive incr1, fetch 1
        Nil, // decode incrData
        List(new InstrRcv(instr = store(0, 6))), // execute incrData, receive store
        Nil, // memory incrData, decode store
        Nil, // stall on incrData
        List(new LoadRcv(4), new LoadReq(addr = 1)), // memory incrData, execute store
        Nil, // writeback incrData, memory store
        Nil, // stall on store
        Nil, // stall on store
        Nil, // stall on store
        Nil, // stall on store
        List(new StoreReq(addr = 6, data = 4)), // writeback store to 6
  } should be (true)

With this harness, it is possible to verify that certain instruction sequences achieve an ideal cycles-per-instruction rate (CPI) of 1. DecoupledPeekPokeTester cases are also directly executed in Chisel with the treadle interpreter instead of generating Verilog code that must be compiled by Verilator which allows them to run much faster. Even with almost three times as many unit tests (24 vs. 9), the whole suite finishes in almost a third of the time compared to DecoupledTester (17 seconds vs. 52 seconds).

Now, the path to a RISC-V implementation is much clearer. The next steps are to convert the basic integer instruction set into the format required by ProcessingModule and run the standard instruction test suite. When that is done, then we can proceed to integrating the core into the Chipyard system and implementing it on an FPGA.