Hacking CGA

This is meant to be a short post to talk about some CGA idiosyncrasies and how you can bypass them.

My video library VGALIB now supports CGA in addition to VGA, EGA support is planned too. Adding VGA was simple and that’s why I did it first; VGA implements a 320×200 linear framebuffer. A linear framebuffer is one where each pixel is represented by a simple lookup and the pixels are contiguous in the memory region. The formula width*y + x is commonly used to perform linear buffer address resolution. It is because of this simplicity that I made the internal representation of images 8 bit linear buffers. Each pixel is represented by 1 byte that can hold 1 of 256 colors.

CGA was developed when more technical challenges existed and the creators made some curious choices that result in more work for programmers. The CGA graphical modes are interlaced, the even rows are stored contiguously at 0xB800:0000 and the odd rows are stored contiguously at 0xB800:2000. The pixel format is a packed 2 bits per pixel, so 1 byte contains 4 contiguous pixels in little-endian format. The first pixel is the leftmost 2 bits and the 4th pixel is the rightmost 2 bits. Converting a linear 8bpp framebuffer into a packed 2bpp framebuffer is most easily, and quickly, done in assembly language. It is jobs like this where assembly language really shines because C doesn’t contain the full expression of data operations that assembly offers.

You can see the challenge now, I have to convert a 320×200 8bpp framebuffer that is 64,000 bytes into a 2bpp packed and interlaced framebuffer of 16,000 bytes. Below is the code that makes it happen.

case CGALO2:
	_DX=_height>>1;
	_SI=FP_OFF(src);
	_DI=FP_OFF(_buffer);
	_CX=FP_SEG(src);
	_AX=FP_SEG(_buffer);
	_BX=_row_bytes;
	_DS=_CX;
	_ES=_AX;

xlatelo:
	asm {
		mov cx,bx
	}
xeven:
	asm {
		lodsd				// 5
								//                       HHHH HHHH LLLL LLLL
		and		eax, 0x03030303	// 2 0000 00xx 0000 00xx 0000 00xx 0000 00xx
		shl		al, 2       	// 3 0000 00xx 0000 00xx 0000 00xx 0000 xx00
		or		al, ah      	// 2 0000 00xx 0000 00xx 0000 00xx 0000 xxxx
		ror		eax, 16     	// 3 0000 00xx 0000 xxxx 0000 00xx 0000 00xx
		shl		al, 2       	// 3 0000 00xx 0000 xxxx 0000 00xx 0000 xx00
		or		al, ah      	// 2 0000 00xx 0000 xxxx 0000 00xx 0000 xxxx
		shl		ax, 12			//   0000 00xx 0000 xxxx xxxx 0000 0000 0000
		shr		eax,12			//   0000 0000 0000 0000 00xx 0000 xxxx xxxx
		stosb               	// 4

		loop	xeven       	//

		push	di
		add		di, 0x1FB0
		mov		cx,bx
	}                       // 44
xodd:
	asm {

		lodsd				// 5
								//                       HHHH HHHH LLLL LLLL
		and		eax, 0x03030303	// 2 0000 00xx 0000 00xx 0000 00xx 0000 00xx
		shl		al, 2       	// 3 0000 00xx 0000 00xx 0000 00xx 0000 xx00
		or		al, ah      	// 2 0000 00xx 0000 00xx 0000 00xx 0000 xxxx
		ror		eax, 16     	// 3 0000 00xx 0000 xxxx 0000 00xx 0000 00xx
		shl		al, 2       	// 3 0000 00xx 0000 xxxx 0000 00xx 0000 xx00
		or		al, ah      	// 2 0000 00xx 0000 xxxx 0000 00xx 0000 xxxx
		shl		ax, 12			//   0000 00xx 0000 xxxx xxxx 0000 0000 0000
		shr		eax,12			//   0000 0000 0000 0000 00xx 0000 xxxx xxxx
		stosb               	// 4

		loop	xodd       	//

		pop		di

		dec		dx
		jnz		xlatelo
	}                       // 44

There are a few things in that code which may cause you to scratch your head, so let me explain a few things to those who are not familiar with Borland C and x86 ASM. Borland C allows inline assembler but labels must be C labels, so it makes the syntax a little messy when writing a long stanza of assembler.

The pseudo-registers in the C code must be loaded in a specific order because the macros FP_OFF() and FP_SEG() will use intermediate registers, clobbering data you’ve already loaded. You must load the segment registers last by copying from another register.

The loop instruction uses the CX register as a loop counter, so if you have nested loops then you must allocate another register and do the decrement and checking manually. CX is used to count the number of bytes in a line of pixels, while DX is used as the counter for the number of lines on the screen (divided by 2, because we inline the even and odd lines in one iteration).

The algorithm uses 32bit instructions for speed and convenience, since 4 bytes in the source buffer are translated to 1 byte in the screen buffer. I used lodsd, which loads a double-word, to get 4 contiguous pixels. To the right of each instruction you will see the effective bit pattern after each is executed. The 8bit values are masked to 2 bits, this instruction could be eliminated, but it makes debugging easier.

The Intel 386 register layout is partitioned in this manner:

eax 0x01020304 32bits contiguous
 ax     0x0304 16bits low word of eax
 ah     0x03    8bits high byte of ax
 al       0x04  8bits low byte of ax

So ax, ah, and al allow you to address incrementally smaller portions of eax, which is useful for doing bitwise operations. I do a shift left of al so it’s above the 2 lower bits of ah, then I bitwise OR them to create a 4 bit value in the low nibble of al. Then I do a rotate of eax by 16bits so I can work on the high word. The same shift and OR is performed to get a 4bit nibble in the low byte. The magic comes from shifting ax left by 12 bits so there are 8 contiguous bits split across the boundary of the upper word of eax and the lower word (ax). You can’t directly address the upper word of eax at the byte level, so it’s effectively masked. By moving the bits in the low word and shifting everything down, I end up with 8 bits that are in reverse order of the source, are packed, and can be directly stored in CGA video ram.

I realize we’re talking about technology that was released 38 years ago, and these techniques wouldn’t be used in the 80’s because wasting memory on linear buffers would seem crazy, but today it’s fine as an educational tool. This technique could be adapted to a packed pixel format, but the more memory efficient you make the buffer, the more cycles it will take to act on the buffer. The method I chose is memory inefficient, but much more cycle efficient. When you are targeting slow computers, cycles are a precious resource you can’t change, but memory restrictions can be worked around through overlays and just-in-time loading from disk. I intend to rewrite all of the 32bit assembler sequences to also support 16bit CPUs, meaning it will be possible to run the code on original PC/XT and AT hardware. For now the minimum CPU requirement is a 386.

Leave a Reply

Your email address will not be published. Required fields are marked *