The 8088MPH CPU Test

Much has been written about the PC demo 8088MPH

If you haven't seen this demo yet, take a moment to watch it.  If you're familiar at all with the IBM PC, you will immediately understand why this demo is so impressive.  If not, consider that the IBM PC was a very limited, business oriented machine not designed for fancy graphics, and its optional CGA graphics adapter was best known for producing 4 garish neon colors at maximum.

8088MPH has become something of a PC emulator accuracy litmus test in the ensuing years since its release, and the gauntlet begins before the demo even shows a single effect.

8088MPH performs a CPU speed test when it starts, and if your CPU fails to complete it in exactly the same time as an 8088 CPU at 4.77Mhz should, you are greeted with this:

8088MPH CPU Test from DOSBox-X 0.83.24, @240 cycles/ms

Whereas if you pass, you are greeted with a brief pat on the back before the demo begins automatically:

MartyPC 0.1.4 passing the test

The 8088MPH CPU test has become an unofficial yardstick for an emulator's accuracy, and emulator authors have worked diligently to reduce the reported percentage of deviation, or in the best case scenario, actually pass it.  

As it turns out, passing the test is harder than actually running the demo. 

Off To The Races

The way the test works is conceptually very simple. The PC's Programmable Interval Timer is used as sort of a stopwatch, measurements bookending a block of code that exercises a wide variety of 8088 opcodes. The number of cycles 8088MPH reports is not, in fact, CPU cycles - it is timer cycles.

The test begins like so:

test_cpu_01     proc far

var_4           = word ptr -4
var_2           = word ptr -2
var_s0          = word ptr  0

                push    bp
                mov     bp, sp
                sub     sp, 4
                mov     al, 54h         ; Timer 1, LSB, Mode 2
                out     43h, al         ; 
                mov     al, 12h         ; Timer 1 = 18
                out     41h, al         ; 
                cli
                call    start_timer_for_cpu_test

After adjusting the stack frame, the CPU test begins by resetting Timer 1 - the timer channel that controls DMA for DRAM refresh. It uses the default value of 18. This restarts the refresh timer, so that DMA will occur at predictable intervals throughout the test.  If this wasn't done, the DRAM refresh timer would be an unpredictable phase vs the start of the test, and therefore the test could run too fast or too slow. Even with this timer reset, we see a little variance in test results running it multiple times. 

Trixter, the test's author, seemed to account for this - a passing result is anywhere between 1668 and 1688 ticks. This 20 tick window represents 80 CPU cycles; or at most a couple of instructions worth.

It's worth pointing out that this means that accurately modelling DMA is critical for passing the test - the wait states inserted by the DMA controller are part of what is being timed so precisely.  The test doesn't seem to access CGA VRAM; so it's not going to depend on whether you emulated CGA wait states correctly or not. 

For more information on DMA emulation, see my previous article: Exploring DMA on the IBM PC.

The function start_timer_for_cpu_test sets Timer channel #0 to Mode #2 (Rate Generator) and programs the maximum counter value of 65536. 

The body of the test exercises a vast assortment of 8088 instruction forms. Here it is in its entirety (Happy scrolling!):

        mov     ax, 1234h
        xor     bx, bx
        mov     cx, bx
        mov     dx, 5678h
        mov     si, bx
        mov     di, bx
        add     ax, 1234h
        add     dx, 1234h
        add     al, 12h
        add     dl, 12h
        add     [12D6h], ax
        add     [12D6h], dx
        add     ax, [12D6h]
        add     dx, [12D6h]
        add     ax, dx
        add     dx, ax
        add     al, dl
        add     dl, al
        push    es
        pop     es
        or      ax, 1234h
        or      dx, 1234h
        or      al, 12h
        or      dl, 12h
        or      [12D6h], ax
        or      [12D6h], dx
        or      ax, [12D6h]
        or      dx, [12D6h]
        or      ax, dx
        or      dx, ax
        or      al, dl
        or      dl, al
        push    cs
        pop     es
        adc     ax, 1234h
        adc     dx, 1234h
        adc     al, 12h
        adc     dl, 12h
        adc     [12D6h], ax
        adc     [12D6h], dx
        adc     ax, [12D6h]
        adc     dx, [12D6h]
        adc     ax, dx
        adc     dx, ax
        adc     al, dl
        adc     dl, al
        push    ax
        mov     ax, sp
        push    ss
        pop     ss
        mov     sp, ax
        pop     ax
        sbb     ax, 1234h
        sbb     dx, 1234h
        sbb     al, 12h
        sbb     dl, 12h
        sbb     [12D6h], ax
        sbb     [12D6h], dx
        sbb     ax, [12D6h]
        sbb     dx, [12D6h]
        sbb     ax, dx
        sbb     dx, ax
        sbb     al, dl
        sbb     dl, al
        push    ds
        pop     ds
        and     ax, 1234h
        and     dx, 1234h
        and     al, 12h
        and     dl, 12h
        and     [12D6h], ax
        and     [12D6h], dx
        and     ax, [12D6h]
        and     dx, [12D6h]
        and     ax, dx
        and     dx, ax
        and     al, dl
        and     dl, al
        mov     ax, es:[bx]
        daa
        sub     ax, 1234h
        sub     dx, 1234h
        sub     al, 12h
        sub     dl, 12h
        sub     [12D6h], ax
        sub     [12D6h], dx
        sub     ax, [12D6h]
        sub     dx, [12D6h]
        sub     ax, dx
        sub     dx, ax
        sub     al, dl
        sub     dl, al
        mov     ax, cs:[bx]
        das
        xor     ax, 1234h
        xor     dx, 1234h
        xor     al, 12h
        xor     dl, 12h
        xor     [12D6h], ax
        xor     [12D6h], dx
        xor     ax, [12D6h]
        xor     dx, [12D6h]
        xor     ax, dx
        xor     dx, ax
        xor     al, dl
        xor     dl, al
        mov     ax, ss:[bx]
        aaa
        cmp     ax, 1234h
        cmp     dx, 1234h
        cmp     al, 12h
        cmp     dl, 12h
        cmp     [12D6h], ax
        cmp     [12D6h], dx
        cmp     ax, [12D6h]
        cmp     dx, [12D6h]
        cmp     ax, dx
        cmp     dx, ax
        cmp     al, dl
        cmp     dl, al
        db      3Eh
        lodsw
        aas
        inc     ax
        inc     cx
        inc     dx
        inc     bx
        inc     si
        inc     di
        dec     ax
        dec     cx
        dec     dx
        dec     bx
        dec     si
        dec     di
        push    ax
        push    cx
        push    dx
        push    bx
        push    bp
        push    si
        push    di
        pop     di
        pop     si
        pop     bp
        pop     bx
        pop     dx
        pop     cx
        pop     ax
        xor     cx, cx
        dec     cx
        stc
        jb      short _test_jump_01
        nop

_test_jump_01:
        clc
        jb      short _test_jump_01
        inc     cx
        jcxz    _test_jump_01
        sub     cx, 2
        jmp     short _test_jump_03

_test_jump_02:
        inc     cx
        clc

_test_jump_03:
        jbe     short _test_jump_02
        mov     cx, 2

_test_loop01:
        nop
        loop    _test_loop01
        test    ax, 1234h
        test    dx, 1234h
        test    al, 12h
        test    dl, 12h
        test    [12D6h], ax
        test    [12D6h], dx
        test    [12D6h], ax
        test    [12D6h], dx
        test    ax, dx
        test    dx, ax
        test    al, dl
        test    dl, al
        lea     ax, [12D6h]
        mov     es, word [bx+si+1234h]
        nop
        xchg    ax, [12D6h]
        xchg    dx, [12D6h]
        xchg    ax, [12D6h]
        xchg    dx, [12D6h]
        xchg    ax, dx
        xchg    ax, dx
        xchg    dl, al
        xchg    al, dl
        cbw
        push    ds
        pop     es
        mov     di, si
        movsb
        movsw
        movsb
        movsw
        lodsb
        stosb
        lodsw
        stosw
        lodsb
        stosb
        lodsw
        stosw
        cmpsb
        cmpsw
        cmpsb
        cmpsw
        scasb
        scasw
        scasb
        scasw
        mov     al, 12h
        mov     cl, 12h
        mov     dl, 12h
        mov     bl, 12h
        mov     ah, 12h
        mov     ch, 12h
        mov     dh, 12h
        mov     bh, 12h
        mov     ax, 1234h
        mov     cx, 1234h
        mov     dx, 1234h
        mov     bx, 1234h
        mov     si, 1234h
        mov     di, 1234h
        les     bx, [1234h]
        mov     bx, 0FFFFh
        rol     bl, 1
        rol     byte [12DCh], 1
        ror     bl, 1
        ror     byte [12DCh], 1
        rcl     bl, 1
        rcl     byte [12DCh], 1
        rcr     bl, 1
        rcr     byte [12DCh], 1
        shl     bl, 1
        shl     byte [12DCh], 1
        shr     bl, 1
        shr     byte [12DCh], 1
        shl     bl, 1
        shl     byte [12DCh], 1
        sar     bl, 1
        sar     byte [12DCh], 1
        rol     bx, 1
        rol     word [12D6h], 1
        ror     bx, 1
        ror     word [12D6h], 1
        rcl     bx, 1
        rcl     word [12D6h], 1
        rcr     bx, 1
        rcr     word [12D6h], 1
        shl     bx, 1
        shl     word [12D6h], 1
        shr     bx, 1
        shr     word [12D6h], 1
        shl     bx, 1
        shl     word [12D6h], 1
        sar     bx, 1
        sar     word [12D6h], 1
        mov     cl, 4
        rol     bl, cl
        rol     byte [12DCh], cl
        ror     bl, cl
        ror     byte [12DCh], cl
        rcl     bl, cl
        rcl     byte [12DCh], cl
        rcr     bl, cl
        rcr     byte [12DCh], cl
        shl     bl, cl
        shl     byte [12DCh], cl
        shr     bl, cl
        shr     byte [12DCh], cl
        shl     bl, cl
        shl     byte [12DCh], cl
        sar     bl, cl
        sar     byte [12DCh], cl
        rol     bx, cl
        rol     word [12D6h], cl
        ror     bx, cl
        ror     word [12D6h], cl
        rcl     bx, cl
        rcl     word [12D6h], cl
        rcr     bx, cl
        rcr     word [12D6h], cl
        shl     bx, cl
        shl     word [12D6h], cl
        shr     bx, cl
        shr     word [12D6h], cl
        shl     bx, cl
        shl     word [12D6h], cl
        sar     bx, cl
        sar     word [12D6h], cl
        aad
        nop
        nop
        nop
        nop
        nop
        nop
        aam
        nop
        nop
        nop
        nop
        nop
        nop
        xlatb
        mov     ax, 1234h
        mov     dx, 5678h
        cmc
        not     dl
        not     ax
        neg     dl
        neg     ax
        mov     dx, 20BDh
        mul     dx
        mov     bx, 2710h
        div     bx
        nop
        nop
        nop
        nop
        nop
        nop
        imul    dx
        nop
        nop
        nop
        nop
        nop
        nop
        idiv    bx
        clc
        stc
        pushf
        cld
        std
        popf
        mov     ax, 1234h
        mov     dx, 1234h
        mov     al, 12h
        mov     dl, 12h
        mov     [12D6h], ax
        mov     [12D6h], dx
        mov     ax, [12D6h]
        mov     dx, [12D6h]
        mov     ax, dx
        mov     dx, ax
        mov     al, dl
        mov     dl, al
        mov     dx, cs:[bx]
        mov     dx, [bp+0]
        mov     dx, es:[si]
        mov     dx, [di]
        lea     bx, [0Ah]
        push    word [bx]
        pop     word [bx]

The instruction gauntlet starts at CS:07A5 with mov ax, 1234h and ends when the last instruction pop word [bx] is complete. Of course, there's more code here in the real test; there are routines to set up and read out the timer before and after, but these are the instructions being measured. 

The test hits some opcodes that even the demo effects in 8088MPH are unlikely to use. MUL and DIV are glacially slow on the 8088, and their microcoded implementations have variable cycle-timings based on the values of their operands.

It's worth noting the futility of trying to use published cycle timings to pass this, or make a cycle-accurate 8088 emulator in general - all published cycle times for 8088 instructions are "best case" - assuming no bus delays and a full instruction queue, a state that is unlikely to to persist for long in practice. So full and accurate emulation of the CPU prefetch algorithm, the instruction queue, and BIU delays is necessary.

You can find more information on the 8088 prefetch algorithm in my previous blog on the topic.

Measuring Up

I mentioned before that a successful test passes between 1668 and 1688 timer ticks.  The CPU is clocked 4 times faster than the timer is, so that works out to 6672 CPU cycles. Some of that isn't spent in the main body of the CPU test; cycles are spent in the routines that read the timer on both ends, and at the end an adjustment is made to the measured timer value.

Using an Arduino interface to an 8088, we can time the execution between the first 'mov' to the end of the last 'pop' just before the call to cpu_test_read_timer, we find it executes in exactly 6328 CPU cycles, or 1582 timer ticks. That's a lot less than 8088MPH expects; but this is run without DMA, which means that our instructions execute faster than they would otherwise.


My Arduino8088 is not the only game in town, however, and we can get a cycle trace of the CPU test using reenigne's bus-sniffer enabled xtserver. This produces cycle traces off a real IBM XT system, so the DMA timings and resulting wait states are exact. xtserver can only capture a certain amount of cycles due to memory limitations of the microcontroller in use, so the test was split up into four sections at different cycle offsets and reassembled.

The instruction set was prepended with a reprogramming of Timer Channel #1 to reset DRAM refresh DMA into a predictable state. 

Measuring from mov ax, 1234h to pop word [bx] gives us 6676 cycles, or 1669 timer ticks - just barely within our 1668-1688 cycle window, but the extra cycles spent reading out the timer at the end make this less of a close call.

It's interesting to compare the results with and without DMA enabled.  With DRAM refresh DMA on, execution is a full 5% slower, a rather painful penalty to pay on a system that was not exactly speedy in the first place.

Here's the assembled trace log from xtserver.

Combining my DMA state logic with my new BIU / prefetch logic, I was able to get MartyPC to execute the 8088MPH CPU test in cycle-perfect sync with xtserver. 

Here's the corresponding trace log from MartyPC. The column headers have comments explaining what each column is. Of perhaps particular interest is the biu_state column - that corresponds with the BIU states I explained in my previous blog about the 8088 prefetch algorithm.

Here is the source code of the raw binary as executed by MartyPC in the trace above. Some setup is performed to ensure that certain instructions like AAA execute in the same time as on the xtserver.

If your emulator can produce cycle trace logs (I highly recommend implementing that as a feature!) comparing with the provided logs could give you the clues you need to pass the test yourself.

Source Code

In the process of working through the 8088MPH CPU test I had some correspondence with its author, Jim Leonard (aka Trixter).  He was kind enough to release the original Pascal source code of the 8088MPH CPU test under MIT license, for which I am very grateful.








Comments

Popular posts from this blog

Hacking the Book8088 for Better Accuracy

Bus Sniffing the IBM 5150: Part 1

The Complete Bus Logic of the Intel 8088