The 8088 Prefetch Algorithm
⚠
This post is now outdated. Although the logic described here will correctly model the externally observed behavior of the 8088, it has been replaced by a far simpler model of the 8088 bus logic described here. I am leaving this post up for reference, but the logic described here should not be used to develop an 8088 emulator.
Preface
Much of the terminology used in this post is my own. It may not match other resources or official references. The algorithm described here may not be identical to the actual prefetch algorithm of the 8088 in all scenarios; it's difficult to test every corner case. This information is provided for the benefit of emulator authors and curious geeks, but comes with no guarantees.
Before we get started, I would recommend taking a read through Ken Shirriff's blog post on the 8086 prefetch circuitry. I was already deep into research on the prefetch algorithm of the 8088 when this blog post appeared. He came to many of the same conclusions, and our posts cover much of the same ground, although the prefetching system of the 8086 differs in significant ways from the 8088.
The algorithm described here cannot effectively implemented unless you implement instruction timings from the 8088 microcode. That means your emulator must issue bus requests and suspend and resume prefetching on the same cycle that a real 8088 would.
For more information and a disassembly of the 8088 microcode, see reenigne's blog post.
Terminology
EU - Execution unit. The part of the CPU responsible for executing instruction microcode.
BIU - Bus Interface Unit. The part of the CPU responsible for reading and writing to the bus, as well as prefetching bytes from the bus into the processor instruction queue.
Instruction Queue - A four-byte FIFO buffer that holds bytes fetched by the BIU.
PC - Program Counter. An index into the current code segment that points to the next instruction to be fetched into the queue.
M-Cycle - A sequence of four or more cycles that make up a single bus transfer operation, whether it is a memory read, memory write, IO read or IO write.
T-Cycle - One of several individual cycle states the CPU can be in during an m-cycle (T1, T2, T3, Tw, or T4) or when the bus is idle (Ti)
Fetch - a bus transfer performed by the BIU to read a byte into the instruction queue from memory.
This post will refer to the process of reading a byte from memory into the queue as either 'prefetching' or 'fetching' interchangeably. A prefetch is simply any fetch that occurs when the instruction queue is not empty - i.e., we are fetching ahead of the currently executing instruction.
Prefetching on the 8088
The 8088 has until recently been a difficult CPU to emulate in a cycle accurate way. The Execution Unit (EU) and the Bus Interface Unit (BIU) operate asynchronously. The EU can be working on executing an instruction's microcode while the BIU is off doing its own thing, usually fetching bytes into the instruction queue. The algorithm that controls when and how this is all coordinated has never been officially documented to my knowledge, and the resulting behavior can seem nonsensical until the rules that govern it are understood. Prior to the disassembly of the 8088 microcode by reenigne, much of the initiating bus timing was a mystery.
The instruction queue on the 8088 is a four-byte buffer which the BIU fills one byte a time by reading ahead from the currently executing instruction, taking advantage of the time spent when the EU is not utilizing the bus. The PC register on the 8088 points to the next instruction to be fetched - often pointing past the address of the currently executing instruction. In fact, the IP 'register' you may be familiar with is not a real register at all - the value of IP is produced as needed from PC by subtracting the length of the instruction queue from PC. This is done with the CORRECT microcode routine.
Long instructions such as DIV and MUL will completely fill the queue during execution, since the BIU can fetch bytes into the queue while the lengthy calculation occurs on the EU. In contrast, repeated short instructions like NOP will drain the queue as they operate faster than bytes can be fetched to replace them.
The EU cannot execute instructions directly from memory. The EU must read bytes from the instruction queue, sequentially, one byte at a time. In general, it takes one cycle to read a byte from the queue. If there are no bytes in the queue, then in order to execute the next instruction a byte must first be fetched from memory by the BIU, placed into the queue, and then read out of the queue by the EU.
Since everything the EU does is via the queue, bytes in the queue shadow bytes in memory. If an instruction byte in memory is modified after being fetched into the queue, the byte in the queue will be executed and the modified byte ignored. This has ramifications for self-modifying code.
The queue is completely emptied or "flushed" when an control flow instruction such as JMP or CALL is executed (and the FLUSH microcode routine executed), as the queue would otherwise contain instructions past the jump that are no longer valid.
The 8086 instruction set includes instructions that can directly modify the value of the CS register. Modifying CS causes a change in program control flow, however modifying CS does not flush the instruction queue. Therefore after CS is changed, any instruction bytes present in the queue will execute before any bytes fetched at the new CS:PC. This will likely result in unpredictable and unwanted behavior, making intentional use of these instructions difficult. The opcode for the notorious instruction POP CS was redefined in future Intel processors as an instruction extension opcode, since it had little utility otherwise.
Before we dive in, a few words about the various status lines of the CPU. They will help explain the CPU cycle traces to follow.
8088 Status Lines
Queue Status Lines in Maximum Mode
The 8088 contains a very convenient feature for investigation of CPU instruction timings. It maintains two status lines directly off the CPU called QS0 and QS1. Together, these status lines allow us to peek inside the operation of the instruction queue on the previous cycle. That one cycle delay is important to keep in mind.
The rationale for exposing this internal CPU status was to integrate with the 8087 FPU, which had to snoop on the bus and maintain an identical mirror of the CPU's instruction queue in order to know when an FPU instruction was to be executed.
Two status lines gives us 4 possible states. We can see when the first byte of an instruction has been read, when a subsequent byte has been read (modrm or displacement), and when the queue was emptied (flushed). The distinction between the first byte of an instruction read and a subsequent byte of an instruction read is extremely useful - we can determine the time spent executing a specific instruction by measuring the cycles between two "First Byte" statuses, with some caveats.
A segment override prefix is treated by the queue status lines as its own instruction, so an instruction with prefixes will have more than one "First Byte". Additionally, measuring this way includes the cycles after an instruction has ended (RNI) spent fetching the next instruction.
Bus Transfer Types
The CPU also maintains status lines that inform us of the type of bus transfer in effect.
Of particular note is the "Code Access" status. This indicates that the BIU is executing an instruction fetch. Note that these status lines are only valid on T1 and T2 of an m-cycle, with the bus reading Passive on T3 and T4.
Basic Fetch Logic
The Intel white paper for the 8088 has this to say about instruction fetching:
The 8088 BIU will fetch a new instruction to load into the queue each time there is a 1 byte hole (space available) in the queue.
As we will see, this is a bit of an oversimplification. There are a set of rules that the BIU follows in order to actually perform a fetch.
Fetch Decisions
The BIU makes a decision to attempt a fetch when the bus is available at least two cycles before the fetch will begin.
A fetch decision can occur in these three scenarios:
- After the FLUSH microcode routine
- During T3 of the last byte of any bus transfer, including another code fetch
- After a byte is read out from a full instruction queue, when a full queue had previously suspended prefetching
Prefetching can be suppressed in the following scenarios:
- The SUSPEND microcode routine has suspended fetching
- The instruction queue is full
- A bus request from the EU has arrived before T3 of the last byte of any bus transfer
When prefetching is suppressed, fetch decisions still happen - they just will decide not to fetch. We'll see the result of that later on.
One might anticipate that if no other conditions apply, the bus is idle and the CPU is executing Ti states, then a fetch could be scheduled - but in general, the rules above cover all scenarios such that a 'default' rule is not needed.
Once the BIU has made a decision to perform a fetch, the current bus m-cycle has ended, and any transition between BIU states is complete, one of two possible things happen:
- If there is room in the queue, the BIU begins a bus transfer of the Code status type indicating a code fetch, and proceeds to execute an m-cycle to read one byte from memory at PC, making the byte available in the instruction queue on T4.
- If a bus request from the EU occurred on T3, Tw, or T4 of the previous m-cycle (i.e., after the fetch decision was made), then the BIU performs a prefetch abort, consisting of two TI or idle cycles, after which the bus operation requested by the EU will begin.
Prefetch Aborts
Prefetch aborts are the most common type of bus delay encountered when operating the 8088, occurring in a wide variety of instructions - potentially any instruction that performs a bus operation. The abort mechanism cuts the maximum delay to an EU requested bus operation of 4 CPU cycles (an entire Code m-cycle) in half. But why do we suffer this penalty at all?
A Simple Model
In the model I present, the BIU is a simple state machine with three states:
BIU State Machine |
The BIU can be in the idle (I) state, where it is not servicing any request. Or, it can be in one of two 'bus request' states: either servicing the prefetcher or the EU.
In this model, changing states from PF (service prefetcher) to EU (service EU), or vice-versa, takes 2 CPU cycles. Changing states from Idle to either bus request state takes 3 CPU cycles. A change in state cannot be interrupted when it has started.
The time it takes to switch from either of the request states to the Idle state doesn't seem critical to instruction timing; I model it as a single cycle.
From this point on, I will refer to the three BIU states as B_pf, B_eu, and B_idle.
The prefetch decision performed on T3 of the last byte of a bus m-cycle is a decision to change states as follows:
- If no bus request from the EU is received on T1 or T2, and there is room in the instruction queue, then the BIU makes a decision to shift to the B_pf state on T3. This shift is complete after T4 so that a new bus m-cycle can immediately begin on T1, or potentially after some fetch delay (more on that later).
- If an EU bus request was received on T1 or T2 (or during the previous byte of an atomic word transfer), then a pending bus request flag is set. On T3 we then shift to the B_eu state to be ready to service the EU bus request. The shift is complete after T4 so that a new bus m-cycle can immediately begin with a new T1 state.
- If an EU bus request is received on T3, Tw, or T4, then it comes a bit too late to prevent the shift to the B_pf state. The shift started on T3, and can't be interrupted, however, an abort flag is set that will trigger a prefetch abort at the end of the m-cycle.
- Note that adding any number of wait states will not affect a resulting prefetch abort. It might be reasonable to think that if an EU request comes in on T3, and there are four wait states, there would be enough time to shift to the B_pf state and then shift back to the B_eu state; allowing the prefetch abort cycles to overlap with the wait states, saving time. But this optimization unfortunately does not occur. The only time a state shift can occur during an m-cycle is on T3.
- After the end of the current m-cycle, if the abort flag is set, then two idle Ti states occur - time in which the BIU is shifting back to the B_eu state to service the bus request. In other words, prefetch abort cycles are simply the time it takes for the BIU to change states.
- Once the m-cycle is completed and the BIU has changed to the B_pf state, it is too late for a bus request to abort the code fetch, however the pending bus request flag is set for the new m-cycle.
That's a lot of words, so here are some diagrams showing the logic resulting from a EU bus request received on various cycles of a Code fetch m-cycle. These diagrams assume that there is room in the instruction queue for a fetch without incurring delays.
Fetch decision timings |
Since fetch decisions can occur during fetches, the BIU will execute fetches until the queue is full if not interrupted by bus requests from the EU.
Note that although the 8088 divides 16-bit word operations into two 8-bit bus transfers, the CPU is still aware of when a word operation is in progress. A fetch will not be scheduled during the first byte/m-cycle of a word operation. This ensures that both 8-bit transfers of a word-sized instruction operation happen sequentially without interruption.
Fetch Delays
During the prefetch decision on T3, a special condition may apply. If we are actively executing a Code fetch m-cycle, the CPU decides to prefetch, and the length of the instruction queue was 3 at the start of T3*, then a 3-cycle fetch delay comes into effect. We still spend two cycles to transition into the B_pf state, but instead of immediately beginning a Code fetch m-cycle, we execute 3 Ti delay states. It's important to note that this fetch delay is not an additional delay in changing BIU state - state changes cannot be interrupted, whereas a fetch delay can be.
*This is an important detail - a byte may be read out of the queue on T3 by the EU bringing the queue size down to two, but the logic that determines whether to incur a fetch delay will have already occurred.
If an EU bus request comes in on T3, Tw or T4 of the m-cycle in which the fetch delay was calculated, then the fetch delay can be aborted in the same manner as a normal prefetch abort. Instead of executing the delay cycles, we will execute two cycles to transition to the B_eu state.
There are a few possible explanations for this delay. Consider the case where the BIU is performing prefetches back to back. During T3 of the fourth Code fetch, the fourth byte has not yet been made available into the instruction queue. The BIU will see that the queue length is 3, and we will decide to perform another fetch. Then, one cycle later, the currently fetched byte is made available in the instruction queue, and the instruction queue is now full. Now we have a scenario where a prefetch is scheduled to begin when the queue is already full. The delay mechanism gives a small window where a byte could be read out of the queue by the EU by the currently executing instruction, allowing prefetching to continue smoothly.
Many times this doesn't happen, however, and the queue is still full at the expiry of the 3 fetch delay cycles. In this event, the BIU transitions to the B_idle state.
The BIU Idle State
The BIU is not usually in the idle state during normal operation. This usually only occurs in the scenarios where either the instruction queue is full and prefetching cannot occur, or prefetching has been deliberately suspended.
In the former case, the BIU will not transition from B_idle to B_pf until a byte is read from the queue, making room to allow another fetch to occur.
In the latter case, the BIU will not transition from B_idle to B_pf until the FLUSH microcode operation is executed.
In either case, transitioning from B_idle to B_pf takes 3 cycles.
Consider what happens when the EU makes a bus request while the BIU state is B_idle.
- The BIU must first transition from B_idle to B_eu, at the cost of 3 cycles.
- The requested bus operation will occur. On T3 of the last byte of the bus operation, a prefetch decision will be made. Since in this scenario prefetching cannot occur, the BIU has no choice but to transition back to the B_idle state.
This means that when prefetching is disabled all bus requests from the EU are delayed by 3 cycles due to repeated state transitions. This is commonly seen in string operations with the REP prefix such as REP MOVSB.
Prefetch Suspension
Program flow instructions such as jumps and calls will perform two special queue operations, SUSPEND and FLUSH. These are special operators executed by the instruction's microcode.
SUSPEND will, as the name suggests, suspend prefetching. It also transitions the BIU to the B_idle state. When SUSPEND is executed, any currently executing m-cycle will be allowed to complete before SUSPEND returns. Correspondingly, this lengthens the duration of the SUSPEND operation by the remaining cycles of the current m-cycle, making it one of the few microcode operations that can take more than one CPU cycle (CORRECT and jumps being other examples).
FLUSH clears the instruction queue of all bytes, resumes prefetching, and begins transitioning the BIU to the B_pf state. A Code fetch m-cycle will occur 3 cycles after FLUSH is executed.
There are certain instructions that will issue bus requests from the EU after SUSPEND and before FLUSH, while the BIU is in the B_idle state. In this scenario, the BIU transitions to the B_eu state, taking 3 cycles. Then, on the last byte transfer of the bus operation during T3, the CPU will transition back to the B_idle state since prefetching is disabled.
In this way, prefetch suspension from the SUSPEND command looks and behaves identically to the scenario where the BIU has transitioned to the B_idle state due to the instruction queue being full.
Bootstrapping Prefetch
It's worth considering how the first instruction fetch is scheduled. This happens during the CPU reset procedure. The CPU reset procedure is microcoded just like the average CPU instruction, but of course it has no dedicated opcode. The RESET microcode routine is executed in response to the CPU's reset line going high.
In the IBM PC, the reset line to the CPU is controlled by the Intel 8284A chip, which should in theory issue a reset to the CPU shortly after the system is turned on and power is verified good from the PSU.
The CPU begins a code fetch immediately after the RESET microcode routine is complete.
Under this model, I assume that the BIU begins in the B_idle state; but it's mostly a moot point since the RESET procedure immediately issues SUSPEND on its first microcode cycle, so we transition to B_idle in any case.
Two cycles later, RESET issues the FLUSH operation, beginning the transition to the B_pf state over the next 3 cycles, and then finally begins a Code fetch m-cycle.
Here's what it looks like:
00 [00000] M:... I:... PASV T1 [ ] | I | | 01 [00000] M:... I:... PASV T1 [ ] | I | 1E4: ZERO -> RD | none SUSP 02 [00000] M:... I:... PASV T1 [ ] | I | 1E5: ONES -> RC | 03 [00000] M:... I:... PASV T1 [ ] | >PF | 1E6: ZERO -> PC | FLUSH none 04 [00000] M:... I:... PASV T1 [ ] | >PF | 1E7: ZERO -> F | 05 [00000] M:... I:... PASV T1 [ ] | >PF | 1E8: ZERO -> RA | 06 A:[FFFF0] CS M:... I:... CODE T1 [ ] | PF | 1E9: ZERO -> RS | none RNI
We can see the flush on cycle #3, the three transition states to B_PF, and finally on cycle #6, the beginning of the first code fetch, after which the first instruction (at the CPU reset vector) will begin.
Hardware Examples
I don't expect you to just take my word for all of this; after all, this model is derived from observation, not analysis of silicon. Lets examine some cycle traces from hardware to see how well the model explains various delays seen from the real thing.
These traces were generated by using an Arduino MEGA to control an 8088 CPU. See my previous blog entry, Hardware Validating an Emulator for more information on how this works.
Here is a basic legend explaining the various fields in these cycle traces:
Prefetch Abort Example
05 [F0101] CS M:R.. I:... CODE T2 F[ ] q-> E6 | PF | out 41h, al 06 [F0101] CS M:R.. I:... PASV T3 r-> 41 [ ] | PF | Prefetch decision 07 [F0101] CS M:... I:... PASV T4 [ ] | PF | 08 A:[F0102] M:... I:... CODE T1 [41 ] | PF | 09 [F0102] CS M:R.. I:... CODE T2 S[ ] q-> 41 | PF | 10 [F0102] CS M:R.. I:... PASV T3 [ ] | PF | Prefetch decision 11 [F0102] CS M:... I:... PASV T4 [ ] | PF | Bus request from EU 12 [F0102] M:... I:... PASV Ti [90 ] | >EU | Abort cycle 13 [F0102] M:... I:... PASV Ti [90 ] | >EU | Abort cycle 14 A:[00041] M:... I:... IOW T1 [90 ] | EU | 15 [00041] CS M:... I:.A. IOW T2 [90 ] | EU | 16 [00041] CS M:... I:.AW PASV T3 <-w 24 [90 ] | >PF | Prefetch decision
Here we have an OUT instruction that encounters a prefetch abort. The OUT instruction issues a bus request to the BIU on cycle #11. This is on T4 of a Code fetch m-cycle, which was too late to prevent a code fetch decision, but early enough that we can abort the code fetch. This abort occurs on cycles #12 and #13 as the BIU transitions back to the B_eu state to allow the requested bus operation to begin on cycle #14.
To prove that the prefetch decision is made on T3, let's simulate 3 wait state memory and execute the same operation:
08 [F0101] CS M:R.. I:... CODE T2 r-> 41 F[ ] q-> E6 | PF | out 41h, al 09 [F0101] CS M:R.. I:... PASV T3 r-> 41 [ ] | PF | Prefetch decision 10 [F0101] CS M:... I:... PASV Tw [ ] | PF | 11 [F0101] CS M:... I:... PASV Tw [ ] | PF | 12 [F0101] CS M:... I:... PASV Tw [ ] | PF | 13 [F0101] CS M:... I:... PASV T4 [ ] | PF | 14 A:[F0102] M:... I:... CODE T1 [41 ] | PF | 15 [F0102] CS M:R.. I:... CODE T2 S[ ] q-> 41 | PF | 16 [F0102] CS M:R.. I:... PASV T3 r-> 90 [ ] | PF | Prefetch decision 17 [F0102] CS M:... I:... PASV Tw [ ] | PF | Bus request from EU 18 [F0102] CS M:... I:... PASV Tw [ ] | PF | 19 [F0102] CS M:... I:... PASV Tw [ ] | PF | 20 [F0102] CS M:... I:... PASV T4 [ ] | PF | 21 [F0102] M:... I:... PASV Ti [90 ] | >EU | Abort cycle 22 [F0102] M:... I:... PASV Ti [90 ] | >EU | Abort cycle 23 A:[00041] M:... I:... IOW T1 [90 ] | EU | Begin bus transfer 24 [00041] CS M:... I:.A. IOW T2 [90 ] | EU | 25 [00041] CS M:... I:.AW PASV T3 <-w 24 [90 ] | >PF | Prefetch decision
With the additional wait states, the OUT instruction now issues a bus request on cycle #17. This is now during a Tw state. Note that we have 4 cycles to go before the end of the m-cycle, but we still have to spend two cycles, #21 and #22, performing the prefetch abort transition back to the B_eu state. The implication is that the prefetch decision must occur on T3, and T3 is the only period during an m-cycle in which a BIU state change can be initiated.
SUSPEND Delay Example
A few instructions perform EU bus requests after SUSPEND, notably CALL FAR. Let's take a look.
05 [F0101] CS M:R.. I:... CODE T2 F[ ] q-> FF | PF | callf word ptr ds:[bx] 06 [F0101] CS M:R.. I:... PASV T3 r-> 1F [ ] | PF | 07 [F0101] CS M:... I:... PASV T4 [ ] | PF | 08 A:[F0102] M:... I:... CODE T1 [1F ] | PF | 09 [F0102] CS M:R.. I:... CODE T2 S[ ] q-> 1F | PF | 10 [F0102] CS M:R.. I:... PASV T3 r-> 90 [ ] | PF | 037: HL -> tmpa | UNC EAOFFSET 11 [F0102] CS M:... I:... PASV T4 [ ] | PF | JMP: | 12 A:[F0103] M:... I:... CODE T1 [90 ] | PF | 1E0: SIGMA-> tmpa | UNC EAFINISH 13 [F0103] CS M:R.. I:... CODE T2 [90 ] | PF | 1E1: tmpa -> IND 6 | R DD,P0 14 [F0103] CS M:R.. I:... PASV T3 r-> 90 [90 ] | >EU | 15 [F0103] CS M:... I:... PASV T4 [90 ] | >EU | 16 A:[1F591] M:... I:... MEMR T1 [9090 ] | EU | 17 [1F591] DS M:R.. I:... MEMR T2 [9090 ] | EU | 18 [1F591] DS M:R.. I:... PASV T3 r-> 00 [9090 ] | EU | 19 [1F591] DS M:... I:... PASV T4 [9090 ] | EU | 20 A:[1F592] M:... I:... MEMR T1 [9090 ] | EU | 21 [1F592] DS M:R.. I:... MEMR T2 [9090 ] | EU | 22 [1F592] DS M:R.. I:... PASV T3 r-> 00 [9090 ] | >PF | 23 [1F592] DS M:... I:... PASV T4 [9090 ] | >PF | 1E2: OPR -> tmpb | none RTN 24 A:[F0104] M:... I:... CODE T1 [9090 ] | PF | Return cycle 25 [F0104] CS M:R.. I:... CODE T2 [9090 ] | PF | Group delay cycle 26 [F0104] CS M:R.. I:... PASV T3 r-> 90 [9090 ] | PF | 068: IND -> tmpc | INC2 tmpc 27 [F0104] CS M:... I:... PASV T4 [9090 ] | PF | 069: SIGMA-> IND 6 | R DD,P0 28 [F0104] M:... I:... PASV Ti [909090 ] | >EU | Prefetch Abort (PF -> EU) 29 [F0104] M:... I:... PASV Ti [909090 ] | >EU | Prefetch Abort (PF -> EU) 30 A:[1F593] M:... I:... MEMR T1 [909090 ] | EU | 31 [1F593] DS M:R.. I:... MEMR T2 [909090 ] | EU | 32 [1F593] DS M:R.. I:... PASV T3 r-> 00 [909090 ] | EU | 33 [1F593] DS M:... I:... PASV T4 [909090 ] | EU | 34 A:[1F594] M:... I:... MEMR T1 [909090 ] | EU | 35 [1F594] DS M:R.. I:... MEMR T2 [909090 ] | EU | 36 [1F594] DS M:R.. I:... PASV T3 r-> 00 [909090 ] | >PF | 37 [1F594] DS M:... I:... PASV T4 [909090 ] | >PF | 06A: OPR -> tmpa | DEC2 tmpc 38 A:[F0105] M:... I:... CODE T1 [909090 ] | >I | 06B: SP -> tmpc | none SUSP 39 [F0105] CS M:R.. I:... CODE T2 [909090 ] | I | Suspend bus wait 40 [F0105] CS M:R.. I:... PASV T3 r-> 90 [909090 ] | I | Suspend bus wait 41 [F0105] CS M:... I:... PASV T4 [909090 ] | I | Suspend bus wait 42 [F0105] M:... I:... PASV Ti [90909090] | I | 06C: SIGMA-> IND | none CORR 43 [F0105] M:... I:... PASV Ti [90909090] | I | CORR cycle 2 44 [F0105] M:... I:... PASV Ti [90909090] | >EU | 06D: RC -> OPR 6 | w DS,M2 45 [F0105] M:... I:... PASV Ti [90909090] | >EU | (IDLE -> EU) 46 [F0105] M:... I:... PASV Ti [90909090] | >EU | (IDLE -> EU) 47 A:[2D80C] M:... I:... MEMW T1 [90909090] | EU | 48 [2D80C] SS M:.A. I:... MEMW T2 [90909090] | EU | 49 [2D80C] SS M:.AW I:... PASV T3 <-w 00 [90909090] | EU | 50 [2D80C] SS M:... I:... PASV T4 [90909090] | EU | 51 A:[2D80D] M:... I:... MEMW T1 [90909090] | EU | 52 [2D80D] SS M:.A. I:... MEMW T2 [90909090] | EU | 53 [2D80D] SS M:.AW I:... PASV T3 <-w F0 [90909090] | >I | 54 [2D80D] SS M:... I:... PASV T4 [90909090] | I | 06E: tmpa -> RC | PASS tmpc 55 [2D80D] M:... I:... PASV Ti [90909090] | I | 06F: PC -> OPR | UNC NEARCALL 56 [2D80D] M:... I:... PASV Ti [90909090] | I | Jump cycle 57 [2D80D] M:... I:... PASV Ti [90909090] | >PF | 077: tmpb -> PC | FLUSH none 58 [2D80D] M:... I:... PASV Ti E[ ] | >PF | 078: IND -> tmpc | 59 [2D80D] M:... I:... PASV Ti [ ] | >PF | 079: SIGMA-> IND | 60 A:[00000] M:... I:... CODE T1 [ ] | PF | 07A: SIGMA-> SP 6 | W,RNI DS,P0 61 [00000] CS M:R.. I:... CODE T2 [ ] | PF | 62 [00000] CS M:R.. I:... PASV T3 r-> 00 [ ] | >EU | 63 [00000] CS M:... I:... PASV T4 [ ] | >EU | 64 A:[2D80A] M:... I:... MEMW T1 [00 ] | EU | 65 [2D80A] SS M:.A. I:... MEMW T2 [00 ] | EU | 66 [2D80A] SS M:.AW I:... PASV T3 <-w 02 [00 ] | EU | 67 [2D80A] SS M:... I:... PASV T4 [00 ] | EU | 68 A:[2D80B] M:... I:... MEMW T1 [00 ] | EU | 69 [2D80B] SS M:.A. I:... MEMW T2 [00 ] | EU | 70 [2D80B] SS M:.AW I:... PASV T3 <-w 01 [00 ] | >PF |
Of particular note here are the three passive cycles from #44 to #46, which have no explanation in the microcode itself. The bus request from the EU that pushes CS to the stack occurs on cycle #44, but does not begin until cycle #47.
Under our model, we can explain this because these cycles are spent transitioning from the B_idle (I) state to the B_eu state. On cycle #53, since prefetching is still disabled, we transition back to the B_idle state. However, no further bus requests are made until prefetching is resumed by FLUSH on cycle #57.
Queue Full Delay Example
We see bus delays during long-running instructions when the instruction queue is full. This is typically during string instructions.
17 [F0104] CS M:R.. I:... CODE T2 F[ ] q-> F3 | PF | rep movsb 18 [F0104] CS M:R.. I:... PASV T3 r-> A4 [ ] | PF | 19 [F0104] CS M:... I:... PASV T4 [ ] | PF | 20 A:[F0105] M:... I:... CODE T1 [A4 ] | PF | 21 [F0105] CS M:R.. I:... CODE T2 F[ ] q-> A4 | PF | 22 [F0105] CS M:R.. I:... PASV T3 r-> 90 [ ] | PF | 12C: | F1 RPTS 23 [F0105] CS M:... I:... PASV T4 [ ] | PF | JMP: | 24 A:[F0106] M:... I:... CODE T1 [90 ] | PF | 112: BC -> tmpc | PASS tmpc 25 [F0106] CS M:R.. I:... CODE T2 [90 ] | PF | 113: SIGMA-> no dest | DEC tmpc 26 [F0106] CS M:R.. I:... PASV T3 r-> 90 [90 ] | PF | 114: | NZ 10 27 [F0106] CS M:... I:... PASV T4 [90 ] | PF | JMP: | 28 A:[F0107] M:... I:... CODE T1 [9090 ] | PF | 116: | none RTN 29 [F0107] CS M:R.. I:... CODE T2 [9090 ] | PF | Return cycle 30 [F0107] CS M:R.. I:... PASV T3 r-> 90 [9090 ] | PF | 12D: IJ -> IND 6 | R DD,BL 31 [F0107] CS M:... I:... PASV T4 [9090 ] | PF | 32 [F0107] M:... I:... PASV Ti [909090 ] | >EU | Prefetch Abort (PF -> EU) 33 [F0107] M:... I:... PASV Ti [909090 ] | >EU | Prefetch Abort (PF -> EU) 34 A:[1D89D] M:... I:... MEMR T1 [909090 ] | EU | 35 [1D89D] DS M:R.. I:... MEMR T2 [909090 ] | EU | 36 [1D89D] DS M:R.. I:... PASV T3 r-> 00 [909090 ] | >PF | 37 [1D89D] DS M:... I:... PASV T4 [909090 ] | >PF | 12E: IND -> IJ | X0 8 38 A:[F0108] M:... I:... CODE T1 [909090 ] | PF | 12F: IK -> IND 6 | w DA,BL 39 [F0108] CS M:R.. I:... CODE T2 [909090 ] | PF | 40 [F0108] CS M:R.. I:... PASV T3 r-> 90 [909090 ] | >EU | 41 [F0108] CS M:... I:... PASV T4 [909090 ] | >EU | 42 A:[1DA47] M:... I:... MEMW T1 [90909090] | EU | 43 [1DA47] ES M:.A. I:... MEMW T2 [90909090] | EU | 44 [1DA47] ES M:.AW I:... PASV T3 <-w 00 [90909090] | >I | 45 [1DA47] ES M:... I:... PASV T4 [90909090] | I | 130: IND -> IK | NF1 7 46 [1DA47] M:... I:... PASV Ti [90909090] | I | 131: SIGMA-> tmpc | INT RPTI 47 [1DA47] M:... I:... PASV Ti [90909090] | I | 132: tmpc -> BC | NZ 1 48 [1DA47] M:... I:... PASV Ti [90909090] | I | JMP: | 49 [1DA47] M:... I:... PASV Ti [90909090] | >EU | 12D: IJ -> IND 6 | R DD,BL 50 [1DA47] M:... I:... PASV Ti [90909090] | >EU | (IDLE -> EU) 51 [1DA47] M:... I:... PASV Ti [90909090] | >EU | (IDLE -> EU) 52 A:[1D89E] M:... I:... MEMR T1 [90909090] | EU | 53 [1D89E] DS M:R.. I:... MEMR T2 [90909090] | EU | 54 [1D89E] DS M:R.. I:... PASV T3 r-> 00 [90909090] | >I | 55 [1D89E] DS M:... I:... PASV T4 [90909090] | I | 12E: IND -> IJ | X0 8 56 [1D89E] M:... I:... PASV Ti [90909090] | >EU | 12F: IK -> IND 6 | w DA,BL 57 [1D89E] M:... I:... PASV Ti [90909090] | >EU | (IDLE -> EU) 58 [1D89E] M:... I:... PASV Ti [90909090] | >EU | (IDLE -> EU) 59 A:[1DA48] M:... I:... MEMW T1 [90909090] | EU | 12E: IND -> IJ | X0 8 60 [1DA48] ES M:.A. I:... MEMW T2 [90909090] | EU | 12E: IND -> IJ | X0 8 61 [1DA48] ES M:.AW I:... PASV T3 <-w 00 [90909090] | >I | 12E: IND -> IJ | X0 8 62 [1DA48] ES M:... I:... PASV T4 [90909090] | I | 130: IND -> IK | NF1 7 63 [1DA48] M:... I:... PASV Ti [90909090] | I | 131: SIGMA-> tmpc | INT RPTI 64 [1DA48] M:... I:... PASV Ti [90909090] | I | 132: tmpc -> BC | NZ 1 65 [1DA48] M:... I:... PASV Ti [90909090] | I | JMP: | 66 [1DA48] M:... I:... PASV Ti [90909090] | >EU | 12C: | F1 RPTS 67 [1DA48] M:... I:... PASV Ti [90909090] | >EU | 12C: | F1 RPTS 68 [1DA48] M:... I:... PASV Ti [90909090] | >EU | 12C: | F1 RPTS 69 A:[1D89F] M:... I:... MEMR T1 [90909090] | EU | 12C: | F1 RPTS 70 [1D89F] DS M:R.. I:... MEMR T2 [90909090] | EU | 12C: | F1 RPTS 71 [1D89F] DS M:R.. I:... PASV T3 r-> 00 [90909090] | >I | 12C: | F1 RPTS 72 [1D89F] DS M:... I:... PASV T4 [90909090] | I | 12E: IND -> IJ | X0 8 73 [1D89F] M:... I:... PASV Ti [90909090] | >EU | 12E: IND -> IJ | X0 8 74 [1D89F] M:... I:... PASV Ti [90909090] | >EU | 12E: IND -> IJ | X0 8 75 [1D89F] M:... I:... PASV Ti [90909090] | >EU | 12E: IND -> IJ | X0 8 76 A:[1DA49] M:... I:... MEMW T1 [90909090] | EU | 12E: IND -> IJ | X0 8 77 [1DA49] ES M:.A. I:... MEMW T2 [90909090] | EU | 12E: IND -> IJ | X0 8 78 [1DA49] ES M:.AW I:... PASV T3 <-w 00 [90909090] | >I | 12E: IND -> IJ | X0 8 79 [1DA49] ES M:... I:... PASV T4 [90909090] | I | 130: IND -> IK | NF1 7 80 [1DA49] M:... I:... PASV Ti [90909090] | I | 131: SIGMA-> tmpc | INT RPTI 81 [1DA49] M:... I:... PASV Ti [90909090] | I | 132: tmpc -> BC | NZ 1 82 [1DA49] M:... I:... PASV Ti [90909090] | I | JMP: | 83 [1DA49] M:... I:... PASV Ti [90909090] | >EU | 12C: | F1 RPTS 84 [1DA49] M:... I:... PASV Ti [90909090] | >EU | 12C: | F1 RPTS 85 [1DA49] M:... I:... PASV Ti [90909090] | >EU | 12C: | F1 RPTS 86 A:[1D8A0] M:... I:... MEMR T1 [90909090] | EU | 12C: | F1 RPTS 87 [1D8A0] DS M:R.. I:... MEMR T2 [90909090] | EU | 12C: | F1 RPTS 88 [1D8A0] DS M:R.. I:... PASV T3 r-> 00 [90909090] | >I | 12C: | F1 RPTS 89 [1D8A0] DS M:... I:... PASV T4 [90909090] | I | 12E: IND -> IJ | X0 8 90 [1D8A0] M:... I:... PASV Ti [90909090] | >EU | 12E: IND -> IJ | X0 8 91 [1D8A0] M:... I:... PASV Ti [90909090] | >EU | 12E: IND -> IJ | X0 8 92 [1D8A0] M:... I:... PASV Ti [90909090] | >EU | 12E: IND -> IJ | X0 8 93 A:[1DA4A] M:... I:... MEMW T1 [90909090] | EU | 12E: IND -> IJ | X0 8 94 [1DA4A] ES M:.A. I:... MEMW T2 [90909090] | EU | 12E: IND -> IJ | X0 8 95 [1DA4A] ES M:.AW I:... PASV T3 <-w 00 [90909090] | >I | 12E: IND -> IJ | X0 8 96 [1DA4A] ES M:... I:... PASV T4 [90909090] | I | 130: IND -> IK | NF1 7 97 [1DA4A] M:... I:... PASV Ti [90909090] | I | 131: SIGMA-> tmpc | INT RPTI 98 [1DA4A] M:... I:... PASV Ti [90909090] | I | 132: tmpc -> BC | NZ 1 99 [1DA4A] M:... I:... PASV Ti [90909090] | >PF | 132: tmpc -> BC | NZ 1
This example shows two iterations of REP MOVSB.
Note cycle #44, where a prefetch decision is made when the queue is full. Under our model, we transition to the B_idle state at this point. On cycle #49, a bus request is made, requiring a transition from B_idle to B_eu over 3 cycles. The next prefetch decision on cycle #54 again returns us to the B_idle state. The next EU bus request is on cycle #56, and we transition back to B_eu, and so on.
These state transitions continue for the entire execution time of REP MOVSB, in this case accounting for 6 cycles per iteration.
Fetch Delay Example
38 [F0108] CS M:R.. I:... PASV T3 r-> 90 [E64290 ] | PF | Fetch Decision : Delay 3 39 [F0108] CS M:... I:... PASV T4 F[4290 ] q-> E6 | PF | out 42h, al 40 [F0108] M:... I:... PASV Ti [429090 ] | PF | Fetch Delay Cycle 1 41 [F0108] M:... I:... PASV Ti S[9090 ] q-> 42 | PF | Fetch Delay Cycle 2 42 [F0108] M:... I:... PASV Ti [9090 ] | PF | Fetch Delay Cycle 3 43 A:[F0109] M:... I:... CODE T1 [9090 ] | PF | Bus request from EU 44 [F0109] CS M:R.. I:... CODE T2 r-> 90 [9090 ] | PF | 45 [F0109] CS M:R.. I:... PASV T3 r-> 90 [9090 ] | >EU | Fetch Decision 46 [F0109] CS M:... I:... PASV T4 [9090 ] | >EU | 47 A:[00042] M:... I:... IOW T1 [909090 ] | EU | Bus request serviced 48 [00042] CS M:... I:.A. IOW T2 <-w 90 [909090 ] | EU | 49 [00042] CS M:... I:.AW PASV T3 <-w 00 [909090 ] | >PF | Fetch Decision
On the cycle before this OUT instruction began on cycle #39, a fetch decision was made when the length of the instruction queue was 3. This resulted in a 3 cycle fetch delay. Note that in this case, the EU did read bytes out of the instruction queue during the delay window, preventing a BIU transition to the B_idle state.
Summary
This model seems relatively simple in retrospect, but was surprisingly difficult to discover on my own. I think that the proposition that the CPU needs time to prepare for different bus operations came as a bit of a surprise, so I wasn't initially considering it.
In any case, this new logic replaced some pretty convoluted delay handling logic in my emulator, and at least one type of delay I was handling turned out to just be a particular instance of the general state machine and could be removed entirely.
I haven't covered all possible bus scenarios in this article - there may be additional states to consider, such as a state for Interrupt Acknowledge bus cycles, and this model might provide some clue to the timings of the HALT instruction. That is all a bit outside of the scope of the prefetch algorithm itself, so I hope to cover more ground in future blog articles.
This all seems to match my observations too. The extra fetch delay when there are 3 bytes in the queue may be a policy choice rather than because of the possibility of a byte pending into the queue - see the discussion on Ken's post that you linked to at the top (under the "Policy" heading and my comment on his post). There (on the 8086) the delay happens when there are 3 or 4 bytes in the queue but I think the equivalent on the 8088 is that it happens when there are 3 bytes in the queue.
ReplyDelete