notes.txt

https://github.com/kyz/libmspack/tree/master

COMPRESS.EXE file formats: SZDD and KWAJ

This document describes the SZDD and KWAJ file formats which are implemented in the MS-DOS commands COMPRESS.EXE and EXPAND.EXE.

Both formats compress a single file to another single file, replacing the last character in the filename with an underscore or dollar character, e.g. README.TXT becomes README.TX_ or README.TX$.
SZDD file format

An SZDD file begins with this fixed header:
SZDD header format Offset	Length	Description
0x00	8	"SZDD" signature: 0x53,0x5A,0x44,0x44,0x88,0xF0,0x27,0x33
0x08	1	Compression mode: only "A" (0x41) is valid here
0x09	1	The character missing from the end of the filename (0=unknown)
0x0A	4	The integer length of the file when unpacked

The header is immediately followed by the compressed data. The following pseudocode explains how to unpack this data; it's a form of the LZSS algorithm.
SZDD decompression pseudocode

char window[4096];
int pos = 4096 - 16;
memset(window, 0x20, 4096); /* window initially full of spaces */
for (;;) {
    int control = GETBYTE();
    if (control == EOF) break; /* exit if no more to read */
    for (int cbit = 0x01; cbit & 0xFF; cbit <<= 1) {
        if (control & cbit) {
            /* literal */
            PUTBYTE(window[pos++] = GETBYTE());
        }
        else {
            /* match */
            int matchpos = GETBYTE();
            int matchlen = GETBYTE();
            matchpos |= (matchlen & 0xF0) << 4;
            matchlen = (matchlen & 0x0F) + 3;
            while (matchlen--) {
                PUTBYTE(window[pos++] = window[matchpos++]);
                pos &= 4095; matchpos &= 4095;
            }
        }
    }
}

There is also a variant SZDD format seen in the installation package for QBasic 4.5, so I call it the QBasic variant. It has a different header and the pos variable in the pseudocode above is set to 4096-18 instead of 4096-16.
QBasic SZDD variant header format Offset	Length	Description
0x00	8	"SZ" signature: 0x53,0x5A,0x20,0x88,0xF0,0x27,0x33,0xD1
0x08	4	The integer length of the file when unpacked
KWAJ file format

A KWAJ file begins with this fixed header:
KWAJ header format Offset	Length	Description
0x00	8	"KWAJ" signature: 0x4B,0x57,0x41,0x4A,0x88,0xF0,0x27,0xD1
0x08	2	compression method (0-4)
0x0A	2	file offset of compressed data
0x0C	2	header flags to mark header extensions
Compression methods

The "compression method" field indicates the type of data compression used:

    No compression
    No compression, data is XORed with byte 0xFF
    The same compression method as the QBasic variant of SZDD
    LZ + Huffman "Jeff Johnson" compression
    MS-ZIP

Header extensions

Header extensions immediately follow the header.

If you don't care about the header extensions, use the file offset to skip to the compressed data.

The header extensions appear in this order:

When header flags bit 0 is set
    4 bytes: decompressed length of file
When header flags bit 1 is set
    2 bytes: unknown purpose
When header flags bit 2 is set
    2 bytes: length of data, followed by that many bytes of (unknown purpose) data
When header flags bit 3 is set
    1-9 bytes: null-terminated string with max length 8: file name
When header flags bit 4 is set
    1-4 bytes: null-terminated string with max length 3: file extension
When header flags bit 5 is set
    2 bytes: length of data, followed by that many bytes of (arbitrary text) data

KWAJ compression method 3

Compression method 3 is unique to the KWAJ format. It's an LZ+Huffman algorithm created by Jeff Johnson.

Bits are always read from MSB to LSB, one byte at a time.

There are three parts:

    The data starts off with 6 nybbles; 4 bits each. Each nybble is between 0-3 and is the encoding type of the 5 huffman length lists to follow. The 6th nybble is just padding.
    Then follow 5 huffman code length lists.
    Then follows the compressed data, which is a mix of huffman symbols and raw bits.

Huffman code length lists

KWAJ uses 5 huffman trees. They always have the same number of symbols in them. They are, in order:

    16 symbol tree (0-15) to store match run lengths (MATCHLEN)
    16 symbol tree (0-15) to store match run lengths immediately following a short literal run (MATCHLEN2)
    32 symbol tree (0-31) to store literal run lengths (LITLEN)
    64 symbol tree (0-63) to store the upper 6 bits of match distances (OFFSET)
    256 symbol tree (0-255) to store literals (LITERAL)

Canonical huffman codes are used, which means you simply need to know how many symbols in each huffman tree (given above), and how long each huffman symbol is

How the symbol lengths are encoded depends on the encoding type, as given by the 6 nybbles at the start of the compressed data.

Symbol lengths are read in ascending order, and the number of symbols to read is implied by which tree you're defining.

Huffman code length list, encoding type 0
    All symbol have the same length, implied by the number of symbols in the tree:

        16 symbols -> all symbols are length 4
        32 symbols -> all symbols are length 5
        64 symbols -> all symbols are length 6
        256 symbols -> all symbols are length 8

    You don't need to read anything.

Huffman code length list, encoding type 1
    A run-length encoding is used:

        read 4 bits for the first symbol length (0-15)
        LOOP:
            read 1 bit == 0 if symbol length is the same as the previous, OTHERWISE:
            read 1 bit == 0 if symbol length is previous + 1, OTHERWISE:
            read 4 bits for symbol length (0-15)

Huffman code length list, encoding type 2
    Another run-length encoding is used:

        read 4 bits for the first symbol length (0-15)
        LOOP:
            read 2 bits as selector (0-3):
                selector == 3: read 4 bits for symbol length, OTHERWISE:
                symbol length is previous symbol + (selector-1), i.e. -1, 0 or +1

Huffman code length list, encoding type 3
    There is no compression. Read 4 bits per symbol (0-15).

Compressed data

At this point, the compressed data begins.

We have a 4096 byte ring buffer, initially filled with byte 0x20 (ASCII space). Unlike the SZDD format, the starting position in the buffer is irrelevant, as match positions are stored relative to the current position in the window, not as absolute positions in the window.

Pseudo-code:

 ring buffer position = 4096-17
 selected table = MATCHLEN
 LOOP:
     code = read huffman code using selected table (MATCHLEN or MATCHLEN2)
     if EOF reached, exit loop
     if code > 0, this is a match:
         match length = code + 2
         x = read huffman code using OFFSET table
         y = read 6 bits
         match offset = current ring buffer position - (x<<6 | y)
         copy match as output and into the ring buffer
         selected table = MATCHLEN
     if code == 0, this is a run of literals:
         x = read huffman code using LITLEN table
         if x != 31, selected table = MATCHLEN2
         read {x+1} literals using LITERAL huffman table, copy as output and into the ring buffer

MS-ZIP
KWAJ type 4 compression is called MS-ZIP, because it is almost identical to the MS-ZIP compression found in Microsoft Cabinet files. Each 32768 bytes of data is compressed independently using Phil Katz's DEFLATE algorithm. However, the history window is shared between blocks, so they must be unpacked in order. The format of each block is as follows:
KWAJ MS-ZIP block format Offset	Length	Description
0	2	Compressed length of this block (n). Stored in Intel byte order. Doesn't include these two bytes.
2	2	"CK" in ASCII (0x43, 0x4B)
4	n-2	Data compressed in DEFLATE format

The final block will unpack to 1-32768 bytes. It will be followed by two zero bytes.