Early handheld gaming device from Nintendo

Older hardware like the GameBoy provides a good opportunity to learn how to write software in a resource constrained environment. These constraints help one to better understand costs and encourage creativity (both in data structures and code) to minimize costs.

Emulation

Emulator architecture

The most important question to ask when beginning a Gameboy emulator is how your emulator will keep time. Generally, all computer systems do nothing unless attached to a clock. This clock ticks (raising a voltage from low to high to low) at a certain frequency, and each tick the hardware makes one tiny step of forward progress. Tick fast enough and you have a computer system that can do interesting things at reasonable speed. This means the basic mechanism of forward progress in an emulator should be some kind of tick() function.

The Gameboy has a system clock that runs at ~4 MHz. However, the CPU is memory bound and the memory is clocked at approximately ~1 MHz. Therefore, there are two units of time that can be used to tick() a gameboy emulator:

Both of these are viable as tick() unit, but they will result in significantly different implementations. The three possible timing architectures are:

T-cycle based

Everything is ticked at the rate of 1 T-cycle. This has the benefit of accuracy but potentially wastes a lot of time doing very little as the CPU takes at least 4 T-cycles per 1 M-cycle step.

M-cycle based

Everything is ticked at the rate of 1 M-cycle. This is a good model if you plan to have the CPU drive the rest of the system. Some hardware devices that tick more frequently liker timers and the PPU may need to do multiple rounds of work per tick.

Mixed

The system is ticked at 1 T-cycle, but some hardware components are ticked less frequently. This is a fairly natural compromise between the two, but again does make CPU stepping more difficult.

CPU architecture

It's tempting when writing a Gameboy emulator to create a CPU state machine something like the following:

FETCH EXECUTE CRASH HALT STOP

However, this is sublty incorrect. The Gameboy CPU has a pipelined fetch / execute architecture. Every opcode technically takes an additional cycle during which the opcode is fetched from memory and decoded. In general, this is hidden from timing tests as the next opcode is pre-fetched and decoded during the last cycle of the currently executing opcode.

Therefore, a correct state machine actually looks more like the following:

FETCH / EXECUTE CRASH HALT STOP

The FETCH / EXECUTE state should have logic similar to the following:

/* Tick CPU for 1 M-cycle */
fn tick_cpu() {
  /* executing_instruction() is TRUE if an instruction is executing. If a
  ** previously executing instruction finishes during tick_instruction() it
  ** becomes FALSE.
  */
  if (executing_instruction())
    tick_instruction()

  /* Either the current instruction finished or we were never executing one.
  ** Either way prefetch the next instruction on this same cycle.
  */
  if (!executing_instruction())
    prefetch_next_instruction()
}

CB-prefixed opcode timings

CB-prefixed opcodes have the in-memory form [0xCB, 0xNN] where 0xCB is the prefix that indicates the following byte (0xNN) is the co-processor opcode. Because these opcodes are 2-bytes rather than 1-byte long, they pay an additional 1 M-cycle penalty for fetching and decoding the coprocessor opcode.

Tip: All tables of opcode timings for the CB-prefixed opcodes include an extra 1 M-cycle (4 T-cycles) for fetching and decoding the coprocessor opcode. This should be kept in mind as it may result in incorrect opcode timings, especially when the coprocessor opcode fetch is not explicitly part of the CB-prefixed opcode implementation.

Programming

16-bit Addition / Subtraction

The gameboy hardware is 8-bit but has a 16-bit address space, making 16-bit math a necessity, especially when traversing address space. The only operations possible are the addition / subtraction of an 8-bit value against a 16-bit register.

16-bit addition proceeds by adding an 8-bit value to the lower register of a 16-bit register pair (for example, adding a value to E for DE). The carry (C) register can then be examined to determine if there was an overflow requiring incrementing the upper register of the 16-bit register pair. This works in all cases because it is impossible for 8-bit addition/subtraction to produce an overflow larger than that requiring an increment or decrement of the upper byte. The following is an example of how to perform 16-bit addition and subtraction.

;;; Perform 16-bit addition on DE with value in A.
add e                ; Add E to contents in A
jr  nc, .noCarryAdd  ; If no carry, D can remain unchanged.
inc d                ; Else, increment D
.noCarryAdd:
ld e, a              ; Copy lower result back into E.
                     ; DE contains addition result.

;;; Perform 16-bit subtaction on DE with value in A. Use B to store value in A
;;; so subtraction can proceed with the correct order of operands.
ld b, a              ; Move A in to B
ld a, e              ; Move E in A
sub b                ; Subtract B (Equivalent to E - A)
jr nc, .noCarrySub   ; If no carry, D can remain unchanged.
dec d                ; Else, decrement D
.noCarrySub:
ld e, a              ; Copy lower result back into E
                     ; DE contains subtraction result.

Comparisons

On gameboy hardware comparisons are done with the CP opcode which sets the zero (Z) and carry (C) flags. Below is a cheatsheet for how to do the various kinds of comparisons:

;;; a == 20
cp 20
jr z, .label

;;; a != [hl]
cp [hl]
jr nz, .label

;;; a < b
cp b
jr c, .label

;;; a > c
cp c
jr z, .skip
jr nc, .label

;;; a <= b
cp b
jr z, .label
jr c, .label

;;; a >= c
cp c
jr z, .label
jr nc, .label

Jump Tables (Switch Statement)

Often, when making Gameboy software, it is useful to have separate logic that is executed based on some state value - similar to switch statement in C. For example, a table of functions for processing input based on whether the character is in a crouch, jump, walk, or run state. This forum post has the following suggestion for how to handle this:

;;; InvokeJumpTable takes a jump table base address on the stack and a jump
;;; table offset in A and jumps to the address contained in memory at that
;;; offset.
;;; @param stack Top of stack contains base address of jump table.
;;; @param A offset into the jump table.
;;; @return
;;;        A contains low byte of jump address.
;;;       HL contains jump address.
;;;       Stack is popped by one item.
SECTION "InvokeJumpTable", ROM0[$08]
InvokeJumpTable::
  pop hl             ; Get base address of table from stack.
  add a, l           ; Add offset in A to table base.
  ld  l, a
  jr  nc, .dojump    ; If no carry, complete jump
  inc h              ; If carry, increment H to account for carry.
.dojump:
  ; HL now points to entry we want to jump to in the jump table.
  ld a, [hl+] ; Load low byte of target address from HL.
  ld h, [hl]  ; Load high byte of target address.
  ld l, a     ; Load low byte of target address into L. HL now has target.
  jp hl       ; Jump to the target address

Example usage:

;;; NOTE: Values of wPlayer.state should increment by 2 since each jump table
;;; entry contains 2 bytes, the proper offset between any 2 entries is +/- 2.
ld  a, [wPlayer.state]
rst InvokeJumpTable
;;; RST pushes the address of instruction after it onto the stack. So we can
;;; simply place our jump table after the invocation instruction.
dw  .player_crouch
dw  .player_jump
dw  .player_walk
dw  .player_run
; ...
.player_crouch:
  ld a, [wPlayer.posx]
  ; ...
.player_jump:
  ; ...
.player_walk:
  ; ...
.player_run:
  ; ...

incoming(1): tameboy

Last update on 7E6A0E, edited 1 times. 1/1thh