DMA, or Direct Memory Access, is a scheme that allows peripherals access to system memory without the assistance of the CPU, which normally would require the use of 'mov' instructions to shuffle data around. With DMA, a hard disk controller can read and write to memory itself. However, the CPU and the DMA-requesting device must not attempt to use the memory bus at the same time. Some sort of coordination is required, and there are as many schemes to accomplish this as there are computer architectures. Here we will discuss the DMA design and implementation of the original IBM PC and XT.
DMA and Emulation
Emulating the IBM PC using the IBM BIOS requires implementing the Intel 8237A DMA controller. There's not much of a way around it - DMA was used by the floppy drive controller and the BIOS will not attempt PIO mode, so unless you are content to fiddle around in ROM BASIC for eternity you will need DMA at some point.
The easiest thing for an emulator author to do is just have DMA transfers just magically occur by writing to the emulator's memory directly, oblivious to the concerns of wait states or bus status. This actually works quite well, and admittedly improves performance and responsiveness of your emulated system over a real system where the DMA controller must coordinate with the CPU for access to the bus.
In an emulated system, your memory cells will not lose charge, so there is no need to actually perform DMA for DRAM refresh, as DMA channel #0 is set up to do. If your goal is to simply run most games and applications, this approach is sufficient.
If you're interested in a high degree of accuracy, however, you're going to have to gather your courage and at least implement a simulation of DRAM refresh DMA. This process is constantly running while the computer is on, cannot be disabled, and effects the cycle timings of every software program.
The DRAM on the IBM PC must be constantly refreshed by some sort of periodic access, or it will lose its contents. The IBM PC does not contain dedicated DRAM refresh circuitry, so IBM employed one of the three system timer channels and one of four DMA channels specifically for this purpose. The DMA controller does not actually instruct a transfer to occur, it is enough to strobe the address lines of the DRAM to refresh them.
So how exactly does DMA on the IBM PC work? Let's dive in.
DMA on the IBM PC
If you haven't implemented CPU wait states yet in your emulator, now is the time to go do that, because DMA relies on wait states for its operation.
If you've looked at the pinouts for the Intel 8088, you may have noticed the HOLD and HOLDA lines on the CPU that are used for negotiating control over the system bus. You'd be forgiven for thinking these might be involved in DMA in some manner, after all, the DMA controller must negotiate control over the bus from the CPU. But they are not. Instead, the READY line to the CPU is de-asserted, which cases the CPU to pause a bus transfer on T3 and insert wait states until READY is reasserted. While this is happening, bus signals from the 8288 bus controller chip are suppressed as well. It is while the CPU is idled and disconnected from addressing logic in this manner that the DMA controller performs its job.
The 8237A is provided the same clock as the CPU. Therefore when referring to clocks or cycles from now on, they are equivalent to CPU cycles.
I would suggest reading the 8237 white paper to familiarize yourself with the basic operation of the chip and its registers.
Refer to the pinout of the 8237:
When the DMA controller gets a request for service on one of its four DREQ lines, it will assert its HRQ (Hold Request) line one cycle later if that DMA channel is not masked and can be serviced. HRQ is interpreted by miscellaneous TTL logic on the motherboard to eventually produce a HOLDA (Hold Acknowledge) signal fed back to the DMA controller to tell it to proceed with a DMA transfer. The DMA controller sends DACK to the requesting device, which puts its data on the bus, and the transfer happens.
Let's take a look at how that is accomplished. The IBM 5160 simplified some of the DMA logic, so we'll be examining that version of things for clarity.
The logic for this is on Sheet #2 of the system diagrams in the IBM 5160 Technical Reference:
It's an intimidating mess of 74 series logic chips, but fortunately, most of it we can ignore. Here is a transcribed portion that handles the HRQ signal from the DMA chip:
The U90 OR gate in the upper left is detecting whether or not we are writing a command to the DMA controller (Both !XIOW and !DMACS active-low), in which case we do not want to perform DMA service at the same time, so we suppress the HRQ signal.
Take note of the NAND gate U57. Four signals are fed into it. Essentially, the motherboard checks for four conditions simultaneously: LOCK must not be asserted, S0 and S1 must be HIGH, and HRQ must be active. If all of these conditions are true, and we are not writing to an IO port of the DMA controller, then we pass the HRQ signal on.
LOCK can be asserted by a LOCK prefix (0xF0), but is also asserted during interrupt acknowledgement. Interrupt acknowledgement consists of two Interrupt Acknowledge bus states, the second of which provides the interrupt vector onto the data bus. This use of LOCK in the HOLDA delay logic prevents the DMA process from interfering with the reading of the interrupt vector.
S0 and S1 are status lines off the CPU that provide the bus state. For reference:
If both S0 and S1 are high, we can see that the bus status is either Halt or Passive. The 5150's circuitry actually checks that S0, S1 and S2 are all high with a larger NAND gate, so it will only accept the Passive state. This is only a miniscule difference as the Halt state is only present on one cycle after a halt instruction, but it theoretically could cause DMA timing differences between a 5150 and 5160 system.
Note that that status line state for a bus cycle is only available on T1 and T2. The status lines will read passive in T3-T4. So we are not quite waiting for the bus to be completely inactive - we are specifically waiting for either a passive bus or the CPU to be in state T3-T4.
This gated HRQ signal proceeds through two flip-flops clocked by the CPU clock and emerges on the right side of the diagram as HOLDA which goes directly back to the DMA controller. The first flip flop essentially delays HOLDA by one clock cycle. The second flip-flop is more interesting:
The inverted CPU clock signal is used to clock this flip-flop. Since a falling edge of the CPU cycle is now a rising edge, this produces a half-cycle delay in HOLDA. Additionally, HRQ itself is used as the flip-flop's reset condition. Therefore, HOLDA will go low shortly after HRQ is de-asserted.
If our assumptions are correct, HOLDA can be asserted a minimum of one and a half clock cycles after HRQ.
Lets confirm with an oscilloscope.
|
DMA operation |
Here we have the DREQ0 line (for DMA channel 0, DRAM refresh DMA) in yellow, HRQ in magenta, HOLDA in cyan and the CPU clock in green. The vertical scale is offset for each signal for readability.
We can confirm the following:
- HRQ is asserted one cycle after DREQ.
- HOLDA is delayed by either 1.5, 2.5, or 3.5 clock cycles.
- HOLDA is de-asserted shortly after HRQ is.
Notice that there are 3 different delays to HOLDA seen here, representing 3 possible bus states when HRQ is asserted:
- The bus is idle (passive), or the CPU is in T3-T4. Assert HOLDA after 1.5 clock cycles.
- The CPU is in T2. Assert HOLDA after 2.5 clock cycles.
- The CPU is in T1. Assert HOLDA after 3.5 clock cycles.
Bear in mind there may be even more delays if DMA overlaps with an IO write to the DMA controller or an interrupt acknowledge is in process due to the logic mentioned previously.
We can also see that DREQ0 and HRQ appear to occur in the middle of a CPU cycle. Why?
Let's take a brief detour to investigate.
DRAM Refresh DMA Operation
The BIOS configures timer channel #1 for Rate Generator mode with a count register of 18, which will trigger every 72 CPU cycles. If you are familiar with the 8253, you may understand that the output of the timer in Rate Generator mode is normally high, going low for one timer cycle on a count of 1.
|
Timer Channel #1 and DREQ0 |
This produces the signal seen in yellow. DREQ0 is in magenta. If timer channel #1 output was tied directly to the DMA controller's DREQ0 line, it would be constantly requesting DMA service. Therefore, the signal is passed first through a flip-flop clocked on channel #1's output with DACK as the clear condition and an input tied to +5V. Let's zoom in:
|
Timer Channel #1 and DREQ0 (Zoomed) |
The Rate Generator output actually goes high again on a falling edge of the input clock, which is 1/4 the CPU clock. Therefore we see the output go high again approximately 4.5 CPU cycles after it goes low. We should note this 1/2 cycle delay in DREQ0. This means that the 1/2 cycle delay in HOLDA actually puts HOLDA (cyan) back in phase with the CPU clock.
DMA States
Like the CPU itself, the DMA controller has a set of states it proceeds through during operation, but these are called S-states instead of T-states. They are as follows:
- SI - Idle state. The DMA controller is not providing DMA service and can receive and respond to IO commands.
- S0 - The DMA controller has received an unmasked DREQ, and has asserted its HRQ line. This DMA controller will execute S0 states until it receives a HOLDA signal.
- S1 - The DMA controller has received a HOLDA signal and is beginning active service.
- S2 - The DMA controller asserts AEN, puts the address of DMA service on the address bus, asserts DACK and either MEMR or MEMW.
- S3 - The address is held valid on the bus
- S4 - The transfer is completed. Either the DMA service concludes (in single transfer mode) or repeats at S2 for additional transfers.
Note that the DMA controller is not writing data itself, it is simply coordinating the transfer between the device requesting service and the CPU. Also, just like the CPU, the DMA controller and the requesting device is subject to the effect of wait states on memory access. A READY signal to the RDY pin on the DMA controller controls whether the DMA controller inserts Sw wait states after S2. On the IBM PC, this would really only affect DMA transfers to and from video memory.
Knowing this, we should be able to overlay DMA states onto our scope measurement:
|
DMA States |
Notice that the DMA controller does not proceed to S1 until after receipt of HOLDA. Delay in HOLDA by waiting for passive bus state delays the entire process. Once begun, however, HOLDA will remain active for a minimum of 5 cycles while the DMA controller proceeds from S0 through S4 states.
DACK is asserted on S2, and as we saw earlier, DACK is the reset condition for DREQ0, so we can see the yellow DREQ0 de-assert after S2 as expected.
Recall the mechanism behind which the CPU is restricted from performing bus transfers during this time is the READY signal to the CPU. This is controlled via the !DMAWAIT signal which is one of two RDY inputs on the 8284 clock generator that provides the READY signal to the CPU.
|
READY input to CPU |
Through the magic of Photoshop, let's add a 5th probe to the !DMAWAIT line:
|
DMA States with !DMAWAIT |
We can see that !DMAWAIT (red) is low for 5 cycles. !DMAWAIT is generated by passing HOLDA (cyan) through a flip-flop twice, incurring a two-cycle delay. Let's look at the relationship between !DMAWAIT (RDY1) and the READY output of the 8284:
|
!DMAWAIT and READY |
We can see that the falling edge of !DMAWAIT is reflected almost immediately in the corresponding READY signal to the CPU, but the rising edge of !DMAWAIT (yellow) takes a little more than a cycle to reflect in the READY output, due to internal flip-flops in the 8284. Therefore, !DMAWAIT is effectively extended by one cycle. Therefore, the total minimum number of wait states the CPU will incur is 6.
Addressing
One piece of the puzzle we haven't looked at yet is addressing. Normally, there are a set of latches that save the address emitted by the CPU on T1. There must be a mechanism to substitute the address output by the DMA controller during DMA operation. Don't be confused by the AEN pin on the 8237, it's not even connected. Instead, AEN and !AEN signals are generated by passing HOLDA through a flip-flop for a one cycle delay. These signals are fed to the !AEN and CEN pins on the 8288 chip which is normally responsible for generating the bus signals for the CPU. The effect of this is to disable the outputs on the 8288 immediately, so that no bus signals are generated. Meanwhile, the AEN line connects to the !OE (Output Enable) pins on the main address latches, allowing the DMA controller to drive the address bus by itself.
Note that there is a 1/2 cycle delay (112ns) before AEN going low and when the 8288 re-enables its command outputs.
Reducing DMA Timing
It is worth investigating what happens when DMA happens back to back. Let's reduce the Channel #1 DRAM refresh DMA timer to 2. This will effectively trigger DMA every 8 CPU cycles.
We can do this with a simple assembly program:
cpu 8086
bits 16
org 100h
%include "macros.asm"
section .text
COUNT_VALUE EQU 2
start:
pit_set_mode 1, PIT_RWM_LSB, 2, 0
mov al, COUNT_VALUE
pit_write_byte 1
dos_exit 1
Let's see what this looks like in the scope. In the next set of images, the colors have switched a bit, DREQ is magenta and HRQ yellow.
|
Compressed DMA Timing |
We can see that with this new aggressive timing the DMA requests are nearly back to back. It's hard to see in this image, but the 2nd rising edge of HRQ (yellow) overlaps the last falling edge of the first pulse, so there is no actual phase where it is rising and falling at the same time. Let's break this out into individual snapshots for clarity:
|
Compressed DMA timing (Single) |
Here's the shortest cycle, when HOLDA is not delayed, and the entire process lasts for about eight clock cycles. The DMA transfers happen back to back with one cycle between the end of HOLDA and the new DREQ0. There are 8 cycles between DREQs as expected.
|
Compressed DMA timing - HOLDA delayed |
Delay HOLDA by one cycle, however, and something interesting happens. We miss the 2nd DREQ0 entirely. Recall that DACK0 is the reset condition for the DREQ0-producing flip flop. DACK (Green) is low long enough to suppress the next DREQ0.
Putting It All Together
The basic operational phases of a DRAM refresh DMA operation are as follows:
- Timer channel #1 counts down to 1 and goes low. It then rises on the next falling edge of the clock, causing the flip flop it is connected to it to output high. The output of this flip-flop is connected to DREQ0 on the DMA controller.
- The DMA controller sees the DREQ0 request and with DMA channel 0 being unmasked, decides to provide DMA service. It asserts HRQ one cycle after DREQ0 and waits for HOLDA.
- The motherboard waits until the bus is idle, or in T3-T4, LOCK not asserted, and not actively writing to the DMA chip. It then raises HOLDA. This incurs a delay of 0, 1 or 2 (or more) cycles.
- The DMA controller proceeds through states S0-S4 to effect the DMA transfer.
- Two cycles after HOLDA, during S2, !DMAWAIT is asserted. READY to the CPU is immediately de-asserted and the CPU is prevented from making bus transfers.
- During S2, DACK is asserted, which resets the DREQ0 flip-flop state, de-asserting DREQ0.
- The DMA transfer completes on S4 and HRQ is de-asserted.
- The !DMAWAIT signal continues for another two clock cycles.
- READY to the CPU is re-asserted after an additional cycle.
Let's look at this in context now with the various DMA states overlaid on top of a cycle trace courtesy of
reenigne. This trace was taken using a bus sniffer card attached to an IBM 5160.
We can see the triggering DREQ and initial HRQ. HOLDA was not delayed as we were in T4. We have DMAWAIT for 5 cycles for a total of 6 wait states.
Emulation Application
There are some additional details I haven't gone into here, but if you have followed along thus far you are probably well prepared to investigate them on your own. In any case, the information provided here is sufficient for an emulator accurate enough to run Area 5150, assuming you have a means to simulate DMA on a cycle-accurate basis.
It's not sufficient to hard-code an assumption of DMA occurring every 72 clock cycles. Demos like 8088MPH and Area 5150 use a trick to slightly slow down DMA refresh by adjusting the timer channel #1 reload value from 18 to 19. This causes DMA to occur every 76 clock cycles instead, still within the limits of stability, but has the advantage of being an even divisor of the 304 clock cycles it takes to draw a single line of a CGA display. This makes DMA refresh a predictable occurrence on each scanline instead of a random annoyance. In addition, to achieve lockstep, these demos will initially set the DRAM refresh timer to a value as low as 1. You will need a mechanism to adjust your DRAM refresh scheduler when the value of timer channel #1 is updated.
This is really interesting stuff - thanks for writing it all up! I hadn't realised that there were DMA timing differences between the 5150 and 5160 - it's fortunate that neither 8088 MPH nor Area 5150 hits that edge case.
ReplyDelete