Hardware Validating an Emulator

In the emulation world, there are a few different ways to determine if your emulator is accurate.  The traditional way has been through the use of test ROMs: ROMs that either act as cartridges for cartridge-based console systems, or replacement system ROMs that execute from the system bootstrap. These ROMs will typically run a specified suite of tests and output pass/fail status either via the screen or some IO port.  

The 8088 doesn't have a dedicated test ROM (yet), but test ROMS exist for the 80186 and 80386 processors

Another more recent method is to use a CPU instruction test suite, typically comprised of thousands of JSON files, each a record of a single instruction's execution including cycle states and bus accesses. Each JSON file provides the information to set up the CPU state for the test, and the final resulting state for a correct execution of the instruction. A collection of such tests has been compiled by Thomas Harte, available here.  But again, no such test suite exists for the 8088 (yet).

Since this post was made, I have contributed a set of JSON CPU Tests for the 8088 CPU to Tom Harte's ProcessorTests repository. Check them out!

The Intel 8088 has traditionally been a difficult CPU to emulate as its dual-headed nature divided between Execution Unit and Bus Interface Unit gives it somewhat complicated and at times unexpected behavior. Published cycle timing lists for the 8088 are mostly useless in constructing a cycle-accurate 8088 emulator - a strictly accurate cycle timing list for the 8088 would simply list "it depends" for each opcode.  Even the execution of an instruction which makes no bus access whatsoever could take a variable number of cycles depending on whether it can execute directly from the instruction queue or must first be fetched.

In the pursuit of making a cycle-accurate CPU emulator, I took a third tack.  I utilized a commonly-available Arduino microcontroller to control an 8088.  With some supporting code, this allowed direct verification of my emulator's operation against a physical CPU, both for single instructions and continuous, cycle-by-cycle execution.

In the interest of disclosure, I didn't come up with this idea. I first saw the idea mentioned by phix on the VOGONS emulation forum, and he has made his code and KiCad files available here. He had successfully used it to validate the per-instruction accuracy of his emulator, VirtualXT

I took the idea and ran with it. I decided to utilize an Arduino MEGA 2560 v3 instead of a Raspberry Pi as Arduino uses 8088-compatible 5v voltage levels and has quite a few more pins available. 

Enter Arduino

One of the advantages to using the Arduino MEGA is I was able to connect every pin on the 8088, as well as adding in an i8288 bus controller to allow the 8088 to run in maximum mode, the same mode used in the IBM PC and XT. 

Here's what my original breadboard looked like:


The yellow wire is the clock signal, the green wires the address and data bus, the orange wires various status lines, and the blue wires various status signals from the i8288.  The LED is attached to the ALE signal, allowing me to see when the CPU had initially reset.  When running, this LED dims since ALE is active only on T1, so its duty cycle becomes less than 25%, depending on instruction execution length.

Manipulating the GPIO pins on the Arduino is done through direct register access for speed. 

Executing a single clock of the CPU is straightforward:

#define SET_CLOCK_LOW PORTG &= ~0x20
#define SET_CLOCK_HIGH PORTG |= 0x20

// Execute one clock pulse to the CPU
void clock_tick() {
  SET_CLOCK_HIGH;
  delayMicroseconds(CLOCK_PIN_HIGH_DELAY);
  SET_CLOCK_LOW;
  delayMicroseconds(CLOCK_PIN_LOW_DELAY);
}

Resetting the CPU is done in this manner:

// Resets the CPU by asserting RESET line for at least 4 cycles and waits for ALE signal.
bool cpu_reset() {

  memset(&CPU, 0, sizeof CPU);
  
  CYCLE_NUM = 0;
  bool ale_went_off = false;
  CPU.state_begin_time = 0;
  change_state(Reset);
  CPU.data_bus = 0x00; 
  init_queue();

  // Hold RESET high for 4 cycles
  SET_RESET_HIGH;

  for (int i = 0; i < RESET_HOLD_CYCLE_COUNT; i++) {

    if (READ_ALE_PIN == false) {
      ale_went_off = true;
    }
    clock_tick();
  }

  // CPU didn't reset for some reason.
  if (ale_went_off == false) {
    return false;
  }

  SET_RESET_LOW;

  // Clock CPU while waiting for ALE

  // The first response from the CPU during reset is the queue status lines reporting that the 
  // processor queue has been flushed. This happens on cycle #5 and corresponds with microcode
  // word 1e6.
  CPU.fetch_state = FETCH_IDLE;

  // Reset should only take 7 cycles, bit we can try for longer
  for ( int i = 0; i < RESET_CYCLE_TIMEOUT; i++ ) {
    cycle();   

    if (READ_ALE_PIN) {
      // ALE is active! CPU has successfully reset
      CPU.doing_reset = false;
      return true;
    }
  }

  // ALE did not turn on within the specified cycle timeout, so we failed to reset the CPU.
  return false;
}

Bus activity is handled by detecting the bus status signals from the i8288, reading and writing to the CPU's data and address lines at the appropriate T-cycle. In this manner we can feed the CPU an arbitrary sequence of bytes as code as the CPU begins filling its instruction queue. 

Since we have access to the CPU's QS0 and QS1 status lines, and are aware of CPU fetches due to the CODE bus status available on S0-S2, we can keep full track of the CPU's instruction queue as well.  We can print a cycle-by-cycle trace of executing a small program from the Arduino's memory:

00000001 E   [11001]    M:... I:... PASV t1           (0 ) | S1s2 F3 [B0A71C  ] <-q F9 STC
00000002 E   [11001]    M:... I:... PASV t1           (0 ) | S1s1  3 [B0A71C  ]
00000003 E A:[F0004]    M:... I:... CODE t1 \         (0 ) | S1s0 F2 [A71C    ] <-q B0 MOV
00000004 E   [F0004] CS M:R.. I:... CODE t2  | <-r CF (0 ) |  1s0  2 [A71C    ]
00000005 E   [F0004] CS M:R.. I:... PASV t3  | <-r CF (0 ) | S2s1 S1 [1C      ] <-q A7
00000006 E   [F0004] CS M:... I:... PASV t4 /         (0 ) | S2s0  1 [1C      ]
00000007 E A:[F0005]    M:... I:... CODE t1 \         (0 ) |  1s0 F1 [CF      ] <-q 1C SBB
00000008 E   [F0005] CS M:R.. I:... CODE t2  | <-r 90 (0 ) |  1s0  1 [CF      ]
00000009 E   [F0005] CS M:R.. I:... PASV t3  | <-r 90 (0 ) | S2s1 S0 [        ] <-q CF
00000010 E   [F0005] CS M:... I:... PASV t4 /         (0 ) | S2s0  0 [        ]
00000011 E A:[F0006]    M:... I:... CODE t1 \         (0 ) |  1s0  1 [90      ]

The first column is the cycle number. A: signifies the ALE signal. In brackets, the address bus. Then the active segment, memory read and write signals, the S0-S2 bus status, t-cycle, bus transfer bytes, and finally prefetch status, queue contents, and queue reads.

Operating solely on the Arduino, we are unable to emulate the entire 8088's address space, as the microcontroller only has 8K of available RAM, but we can simulate a few hundred bytes at arbitrary addresses, good enough for executing simple programs.

Arduino8088

Having proven the concept, I decided to design and order a custom PCB that would act as a hat for the Arduino MEGA. This was a lot neater than the old breadboard, and I took the time to connect the rest of the pins I had omitted:


It was time to integrate this as a CPU validator for my emulator.  

I needed a way to manipulate the 8088 from my PC, so I devised a simple binary serial protocol, consisting of one-byte commands for the CPU server that would run on the Arduino, converting commands into toggles of the various GPIO lines. A command could be followed by optional parameter bytes. Optional response bytes may be received, and then the instruction terminates with a one-byte status code.

Here are the various server commands available:

typedef enum {
  CmdNone            = 0x00,
  CmdVersion         = 0x01,
  CmdReset           = 0x02,
  CmdLoad            = 0x03,
  CmdCycle           = 0x04,
  CmdReadAddress     = 0x05,
  CmdReadStatus      = 0x06,
  CmdRead8288Command = 0x07,
  CmdRead8288Control = 0x08, 
  CmdReadDataBus     = 0x09,
  CmdWriteDataBus    = 0x0A,
  CmdFinalize        = 0x0B,
  CmdBeginStore      = 0x0C,
  CmdStore           = 0x0D,
  CmdQueueLen        = 0x0E,
  CmdQueueBytes      = 0x0F,
  CmdWritePin        = 0x10,
  CmdReadPin         = 0x11,
  CmdGetProgramState = 0x12,
  CmdLastError       = 0x13,
  CmdGetCycleStatus  = 0x14,
  CmdInvalid         = 0x15,
} server_command;

Many of these ended up being unused - querying individual bits of status is inefficient use of limited serial throughput. Instead, I rolled up most of the status codes into a single command that returns them all, command 0x14 or GetCycleStatus.  CmdWritePin is still useful for triggering some of the more interesting pins like NMI, INTR, or READY for simulating interrupts or wait states.

CmdLoad and CmdStore warrant a bit of mention - these commands will load and save the CPU register state, respectively. This is done through execution of small assembly routines directly on the Arduino. The load routine first executes POPF to set flags, then a series of MOV instructions that is hot-patched with the desired register contents, followed by a patched FAR JUMP that effectively sets CS:IP.  You can view that code here.  

The store routine does the opposite, this time using OUT instructions to a magic port number to record the contents of each register, PUSHF to record flags, and a CALL to record CS:IP. The store routine can be seen here.  Both routines are adapted from phix's original code, so credit where due. 

Emulator Validation

Using our hardware validator, validation of an emulator on an instruction by instruction basis is straightforward. For any given instruction we are about to execute, we send the CPU server a 'CmdLoad' command with the contents of the emulator's initial register state. We then execute the instruction on both the CPU server and emulator, then execute a 'CmdStore'.  The post-execution register state between CPU and emulator can then be directly compared.

The CPU server can detect the end of instruction execution by inspecting the QS0 and QS1 status lines. They will denote when the first byte of an instruction is fetched. When this happens unexpectedly, it means the next instruction is about to begin - and the current instruction has ended!

The queue status lines were necessary due to the design of the 8087, which had to maintain an identical copy of the instruction queue in the CPU and needed to know when the CPU was reading bytes from it. Not all CPUs have such a convenient method to indicate the boundary between instructions - something to keep in mind adapting this technique to other architectures. In theory, we could also use the trap flag for this purpose, at the performance penalty of constantly performing interrupts.

Unfortunately, the serial link turns out to be a serious bottleneck to this process. It's impossible to run the validator fast enough to boot the system - it would likely take hours to pass the memory test in the BIOS alone.  Some optimizations are available - we can mark certain instructions as previously seen so we do not repeat validation in tight loops, or skip validation at entire address ranges so we do not validate code from the BIOS at all. 

Perhaps a more efficient approach is to create an instruction fuzzer, and send your emulator and the CPU server random instructions.  The 8088 surprisingly tolerates receiving random data well. It has no concept of an invalid instruction, and for any particular sequence of bytes the CPU will perform some deterministic operation.  Of  course, we can keep a list of valid opcodes and skip the technically 'invalid' ones if we aren't interested in researching undefined behavior, similarly, we can mask undefined flag status, so we can choose to validate our emulated CPU only within the realm of well documented behavior.  

By randomizing register state and memory contents, we can comprehensively execute various opcodes over thousands of iterations quite easily. It's like having an infinite supply of JSON CPU tests.

Validating Cycle-Accuracy

One problem with this instruction-based approach is that setting up and reading out the register state not only takes a long time, but also disrupts the state of the CPU, leaving us on an unspecified T-cycle with the contents of our instruction queue destroyed.

It would be faster and more efficient to reset the 8088 once, and let it run in lockstep with the emulator on a cycle-by-cycle basis. We lose the ability to read out register state, but this isn't as big a problem as it might seem.  By comparing the bus operations the real and virtual CPU perform, we can still catch most deviations from correct behavior.  This is made possible by the emulator sending the validation module a list of cycle-states structures for each instruction that has been executed, then running the instruction on the 8088 and having the validation module produce the same list.  Right off the bat we can make sure that both lists of cycles are the same length - but we can also verify that each bus access entered the same status, latched the same address and wrote the same byte to the data bus.

Even better, we now can enable the processor instruction queue on the emulated CPU and validate its behavior against the hardware, something essential for investigating the rather curious and complicated algorithm that drives the 8088's BIU.  We can then validate that queue reads happen exactly when they are intended to.

This is what a typical validation operation looks like:


The cycles executed by the physical CPU are on the left, and the cycles executed by the emulator are on the right. In this case, validation failed - the emulated instruction ran in 3 cycles less than hardware. This was due to missing two PASV states at cycle #16 - likely a prefetch abort.

The 8088 is full of these sort of delays, delays that can at times appear arbitrary without a full grasp of the underlying logic behind them.  Hardware validation was essential for making my emulator cycle-accurate - I can't see how I would have accomplished it without.

I hope to eventually produce a series of JSON CPU tests for the 8088 directly from hardware to provide some of the benefits of this validation technique for those who can't or won't go to all this trouble, but that's a topic for a future post.

All the code and KiCad files for my Arduino8088 are available here on GitHub.
You can see the way I integrate the validator with my emulator here and here.

Happy Validating!

Comments

Popular posts from this blog

Hacking the Book8088 for Better Accuracy

Bus Sniffing the IBM 5150: Part 1

The Complete Bus Logic of the Intel 8088