The 8088MPH CPU Test
Much has been written about the PC demo 8088MPH.
If you haven't seen this demo yet, take a moment to watch it. If you're familiar at all with the IBM PC, you will immediately understand why this demo is so impressive. If not, consider that the IBM PC was a very limited, business oriented machine not designed for fancy graphics, and its optional CGA graphics adapter was best known for producing 4 garish neon colors at maximum.
8088MPH has become something of a PC emulator accuracy litmus test in the ensuing years since its release, and the gauntlet begins before the demo even shows a single effect.
8088MPH performs a CPU speed test when it starts, and if your CPU fails to complete it in exactly the same time as an 8088 CPU at 4.77Mhz should, you are greeted with this:
8088MPH CPU Test from DOSBox-X 0.83.24, @240 cycles/ms |
MartyPC 0.1.4 passing the test |
The 8088MPH CPU test has become an unofficial yardstick for an emulator's accuracy, and emulator authors have worked diligently to reduce the reported percentage of deviation, or in the best case scenario, actually pass it.
As it turns out, passing the test is harder than actually running the demo.
Off To The Races
The way the test works is conceptually very simple. The PC's Programmable Interval Timer is used as sort of a stopwatch, measurements bookending a block of code that exercises a wide variety of 8088 opcodes. The number of cycles 8088MPH reports is not, in fact, CPU cycles - it is timer cycles.
The test begins like so:
test_cpu_01 proc far var_4 = word ptr -4 var_2 = word ptr -2 var_s0 = word ptr 0 push bp mov bp, sp sub sp, 4 mov al, 54h ; Timer 1, LSB, Mode 2 out 43h, al ; mov al, 12h ; Timer 1 = 18 out 41h, al ; cli call start_timer_for_cpu_test
After adjusting the stack frame, the CPU test begins by resetting Timer 1 - the timer channel that controls DMA for DRAM refresh. It uses the default value of 18. This restarts the refresh timer, so that DMA will occur at predictable intervals throughout the test. If this wasn't done, the DRAM refresh timer would be an unpredictable phase vs the start of the test, and therefore the test could run too fast or too slow. Even with this timer reset, we see a little variance in test results running it multiple times.
Trixter, the test's author, seemed to account for this - a passing result is anywhere between 1668 and 1688 ticks. This 20 tick window represents 80 CPU cycles; or at most a couple of instructions worth.
It's worth pointing out that this means that accurately modelling DMA is critical for passing the test - the wait states inserted by the DMA controller are part of what is being timed so precisely. The test doesn't seem to access CGA VRAM; so it's not going to depend on whether you emulated CGA wait states correctly or not.
For more information on DMA emulation, see my previous article: Exploring DMA on the IBM PC.
The function start_timer_for_cpu_test sets Timer channel #0 to Mode #2 (Rate Generator) and programs the maximum counter value of 65536.
The body of the test exercises a vast assortment of 8088 instruction forms. Here it is in its entirety (Happy scrolling!):
mov ax, 1234h xor bx, bx mov cx, bx mov dx, 5678h mov si, bx mov di, bx add ax, 1234h add dx, 1234h add al, 12h add dl, 12h add [12D6h], ax add [12D6h], dx add ax, [12D6h] add dx, [12D6h] add ax, dx add dx, ax add al, dl add dl, al push es pop es or ax, 1234h or dx, 1234h or al, 12h or dl, 12h or [12D6h], ax or [12D6h], dx or ax, [12D6h] or dx, [12D6h] or ax, dx or dx, ax or al, dl or dl, al push cs pop es adc ax, 1234h adc dx, 1234h adc al, 12h adc dl, 12h adc [12D6h], ax adc [12D6h], dx adc ax, [12D6h] adc dx, [12D6h] adc ax, dx adc dx, ax adc al, dl adc dl, al push ax mov ax, sp push ss pop ss mov sp, ax pop ax sbb ax, 1234h sbb dx, 1234h sbb al, 12h sbb dl, 12h sbb [12D6h], ax sbb [12D6h], dx sbb ax, [12D6h] sbb dx, [12D6h] sbb ax, dx sbb dx, ax sbb al, dl sbb dl, al push ds pop ds and ax, 1234h and dx, 1234h and al, 12h and dl, 12h and [12D6h], ax and [12D6h], dx and ax, [12D6h] and dx, [12D6h] and ax, dx and dx, ax and al, dl and dl, al mov ax, es:[bx] daa sub ax, 1234h sub dx, 1234h sub al, 12h sub dl, 12h sub [12D6h], ax sub [12D6h], dx sub ax, [12D6h] sub dx, [12D6h] sub ax, dx sub dx, ax sub al, dl sub dl, al mov ax, cs:[bx] das xor ax, 1234h xor dx, 1234h xor al, 12h xor dl, 12h xor [12D6h], ax xor [12D6h], dx xor ax, [12D6h] xor dx, [12D6h] xor ax, dx xor dx, ax xor al, dl xor dl, al mov ax, ss:[bx] aaa cmp ax, 1234h cmp dx, 1234h cmp al, 12h cmp dl, 12h cmp [12D6h], ax cmp [12D6h], dx cmp ax, [12D6h] cmp dx, [12D6h] cmp ax, dx cmp dx, ax cmp al, dl cmp dl, al db 3Eh lodsw aas inc ax inc cx inc dx inc bx inc si inc di dec ax dec cx dec dx dec bx dec si dec di push ax push cx push dx push bx push bp push si push di pop di pop si pop bp pop bx pop dx pop cx pop ax xor cx, cx dec cx stc jb short _test_jump_01 nop _test_jump_01: clc jb short _test_jump_01 inc cx jcxz _test_jump_01 sub cx, 2 jmp short _test_jump_03 _test_jump_02: inc cx clc _test_jump_03: jbe short _test_jump_02 mov cx, 2 _test_loop01: nop loop _test_loop01 test ax, 1234h test dx, 1234h test al, 12h test dl, 12h test [12D6h], ax test [12D6h], dx test [12D6h], ax test [12D6h], dx test ax, dx test dx, ax test al, dl test dl, al lea ax, [12D6h] mov es, word [bx+si+1234h] nop xchg ax, [12D6h] xchg dx, [12D6h] xchg ax, [12D6h] xchg dx, [12D6h] xchg ax, dx xchg ax, dx xchg dl, al xchg al, dl cbw push ds pop es mov di, si movsb movsw movsb movsw lodsb stosb lodsw stosw lodsb stosb lodsw stosw cmpsb cmpsw cmpsb cmpsw scasb scasw scasb scasw mov al, 12h mov cl, 12h mov dl, 12h mov bl, 12h mov ah, 12h mov ch, 12h mov dh, 12h mov bh, 12h mov ax, 1234h mov cx, 1234h mov dx, 1234h mov bx, 1234h mov si, 1234h mov di, 1234h les bx, [1234h] mov bx, 0FFFFh rol bl, 1 rol byte [12DCh], 1 ror bl, 1 ror byte [12DCh], 1 rcl bl, 1 rcl byte [12DCh], 1 rcr bl, 1 rcr byte [12DCh], 1 shl bl, 1 shl byte [12DCh], 1 shr bl, 1 shr byte [12DCh], 1 shl bl, 1 shl byte [12DCh], 1 sar bl, 1 sar byte [12DCh], 1 rol bx, 1 rol word [12D6h], 1 ror bx, 1 ror word [12D6h], 1 rcl bx, 1 rcl word [12D6h], 1 rcr bx, 1 rcr word [12D6h], 1 shl bx, 1 shl word [12D6h], 1 shr bx, 1 shr word [12D6h], 1 shl bx, 1 shl word [12D6h], 1 sar bx, 1 sar word [12D6h], 1 mov cl, 4 rol bl, cl rol byte [12DCh], cl ror bl, cl ror byte [12DCh], cl rcl bl, cl rcl byte [12DCh], cl rcr bl, cl rcr byte [12DCh], cl shl bl, cl shl byte [12DCh], cl shr bl, cl shr byte [12DCh], cl shl bl, cl shl byte [12DCh], cl sar bl, cl sar byte [12DCh], cl rol bx, cl rol word [12D6h], cl ror bx, cl ror word [12D6h], cl rcl bx, cl rcl word [12D6h], cl rcr bx, cl rcr word [12D6h], cl shl bx, cl shl word [12D6h], cl shr bx, cl shr word [12D6h], cl shl bx, cl shl word [12D6h], cl sar bx, cl sar word [12D6h], cl aad nop nop nop nop nop nop aam nop nop nop nop nop nop xlatb mov ax, 1234h mov dx, 5678h cmc not dl not ax neg dl neg ax mov dx, 20BDh mul dx mov bx, 2710h div bx nop nop nop nop nop nop imul dx nop nop nop nop nop nop idiv bx clc stc pushf cld std popf mov ax, 1234h mov dx, 1234h mov al, 12h mov dl, 12h mov [12D6h], ax mov [12D6h], dx mov ax, [12D6h] mov dx, [12D6h] mov ax, dx mov dx, ax mov al, dl mov dl, al mov dx, cs:[bx] mov dx, [bp+0] mov dx, es:[si] mov dx, [di] lea bx, [0Ah] push word [bx] pop word [bx]
The instruction gauntlet starts at CS:07A5 with mov ax, 1234h and ends when the last instruction pop word [bx] is complete. Of course, there's more code here in the real test; there are routines to set up and read out the timer before and after, but these are the instructions being measured.
The test hits some opcodes that even the demo effects in 8088MPH are unlikely to use. MUL and DIV are glacially slow on the 8088, and their microcoded implementations have variable cycle-timings based on the values of their operands.
It's worth noting the futility of trying to use published cycle timings to pass this, or make a cycle-accurate 8088 emulator in general - all published cycle times for 8088 instructions are "best case" - assuming no bus delays and a full instruction queue, a state that is unlikely to to persist for long in practice. So full and accurate emulation of the CPU prefetch algorithm, the instruction queue, and BIU delays is necessary.
You can find more information on the 8088 prefetch algorithm in my previous blog on the topic.
Measuring Up
I mentioned before that a successful test passes between 1668 and 1688 timer ticks. The CPU is clocked 4 times faster than the timer is, so that works out to 6672 CPU cycles. Some of that isn't spent in the main body of the CPU test; cycles are spent in the routines that read the timer on both ends, and at the end an adjustment is made to the measured timer value.
Using an Arduino interface to an 8088, we can time the execution between the first 'mov' to the end of the last 'pop' just before the call to cpu_test_read_timer, we find it executes in exactly 6328 CPU cycles, or 1582 timer ticks. That's a lot less than 8088MPH expects; but this is run without DMA, which means that our instructions execute faster than they would otherwise.
My Arduino8088 is not the only game in town, however, and we can get a cycle trace of the CPU test using reenigne's bus-sniffer enabled xtserver. This produces cycle traces off a real IBM XT system, so the DMA timings and resulting wait states are exact. xtserver can only capture a certain amount of cycles due to memory limitations of the microcontroller in use, so the test was split up into four sections at different cycle offsets and reassembled.
The instruction set was prepended with a reprogramming of Timer Channel #1 to reset DRAM refresh DMA into a predictable state.
Measuring from mov ax, 1234h to pop word [bx] gives us 6676 cycles, or 1669 timer ticks - just barely within our 1668-1688 cycle window, but the extra cycles spent reading out the timer at the end make this less of a close call.
It's interesting to compare the results with and without DMA enabled. With DRAM refresh DMA on, execution is a full 5% slower, a rather painful penalty to pay on a system that was not exactly speedy in the first place.
Here's the assembled trace log from xtserver.
Combining my DMA state logic with my new BIU / prefetch logic, I was able to get MartyPC to execute the 8088MPH CPU test in cycle-perfect sync with xtserver.
Here's the corresponding trace log from MartyPC. The column headers have comments explaining what each column is. Of perhaps particular interest is the biu_state column - that corresponds with the BIU states I explained in my previous blog about the 8088 prefetch algorithm.
Here is the source code of the raw binary as executed by MartyPC in the trace above. Some setup is performed to ensure that certain instructions like AAA execute in the same time as on the xtserver.
If your emulator can produce cycle trace logs (I highly recommend implementing that as a feature!) comparing with the provided logs could give you the clues you need to pass the test yourself.
Source Code
In the process of working through the 8088MPH CPU test I had some correspondence with its author, Jim Leonard (aka Trixter). He was kind enough to release the original Pascal source code of the 8088MPH CPU test under MIT license, for which I am very grateful.
So if you're curious, here's the original source code of the 8088MPH CPU test!
Comments
Post a Comment