-
Notifications
You must be signed in to change notification settings - Fork 0
/
notes.txt
179 lines (138 loc) · 7.6 KB
/
notes.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
https://github.com/kyz/libmspack/tree/master
COMPRESS.EXE file formats: SZDD and KWAJ
This document describes the SZDD and KWAJ file formats which are implemented in the MS-DOS commands COMPRESS.EXE and EXPAND.EXE.
Both formats compress a single file to another single file, replacing the last character in the filename with an underscore or dollar character, e.g. README.TXT becomes README.TX_ or README.TX$.
SZDD file format
An SZDD file begins with this fixed header:
SZDD header format Offset Length Description
0x00 8 "SZDD" signature: 0x53,0x5A,0x44,0x44,0x88,0xF0,0x27,0x33
0x08 1 Compression mode: only "A" (0x41) is valid here
0x09 1 The character missing from the end of the filename (0=unknown)
0x0A 4 The integer length of the file when unpacked
The header is immediately followed by the compressed data. The following pseudocode explains how to unpack this data; it's a form of the LZSS algorithm.
SZDD decompression pseudocode
char window[4096];
int pos = 4096 - 16;
memset(window, 0x20, 4096); /* window initially full of spaces */
for (;;) {
int control = GETBYTE();
if (control == EOF) break; /* exit if no more to read */
for (int cbit = 0x01; cbit & 0xFF; cbit <<= 1) {
if (control & cbit) {
/* literal */
PUTBYTE(window[pos++] = GETBYTE());
}
else {
/* match */
int matchpos = GETBYTE();
int matchlen = GETBYTE();
matchpos |= (matchlen & 0xF0) << 4;
matchlen = (matchlen & 0x0F) + 3;
while (matchlen--) {
PUTBYTE(window[pos++] = window[matchpos++]);
pos &= 4095; matchpos &= 4095;
}
}
}
}
There is also a variant SZDD format seen in the installation package for QBasic 4.5, so I call it the QBasic variant. It has a different header and the pos variable in the pseudocode above is set to 4096-18 instead of 4096-16.
QBasic SZDD variant header format Offset Length Description
0x00 8 "SZ" signature: 0x53,0x5A,0x20,0x88,0xF0,0x27,0x33,0xD1
0x08 4 The integer length of the file when unpacked
KWAJ file format
A KWAJ file begins with this fixed header:
KWAJ header format Offset Length Description
0x00 8 "KWAJ" signature: 0x4B,0x57,0x41,0x4A,0x88,0xF0,0x27,0xD1
0x08 2 compression method (0-4)
0x0A 2 file offset of compressed data
0x0C 2 header flags to mark header extensions
Compression methods
The "compression method" field indicates the type of data compression used:
No compression
No compression, data is XORed with byte 0xFF
The same compression method as the QBasic variant of SZDD
LZ + Huffman "Jeff Johnson" compression
MS-ZIP
Header extensions
Header extensions immediately follow the header.
If you don't care about the header extensions, use the file offset to skip to the compressed data.
The header extensions appear in this order:
When header flags bit 0 is set
4 bytes: decompressed length of file
When header flags bit 1 is set
2 bytes: unknown purpose
When header flags bit 2 is set
2 bytes: length of data, followed by that many bytes of (unknown purpose) data
When header flags bit 3 is set
1-9 bytes: null-terminated string with max length 8: file name
When header flags bit 4 is set
1-4 bytes: null-terminated string with max length 3: file extension
When header flags bit 5 is set
2 bytes: length of data, followed by that many bytes of (arbitrary text) data
KWAJ compression method 3
Compression method 3 is unique to the KWAJ format. It's an LZ+Huffman algorithm created by Jeff Johnson.
Bits are always read from MSB to LSB, one byte at a time.
There are three parts:
The data starts off with 6 nybbles; 4 bits each. Each nybble is between 0-3 and is the encoding type of the 5 huffman length lists to follow. The 6th nybble is just padding.
Then follow 5 huffman code length lists.
Then follows the compressed data, which is a mix of huffman symbols and raw bits.
Huffman code length lists
KWAJ uses 5 huffman trees. They always have the same number of symbols in them. They are, in order:
16 symbol tree (0-15) to store match run lengths (MATCHLEN)
16 symbol tree (0-15) to store match run lengths immediately following a short literal run (MATCHLEN2)
32 symbol tree (0-31) to store literal run lengths (LITLEN)
64 symbol tree (0-63) to store the upper 6 bits of match distances (OFFSET)
256 symbol tree (0-255) to store literals (LITERAL)
Canonical huffman codes are used, which means you simply need to know how many symbols in each huffman tree (given above), and how long each huffman symbol is
How the symbol lengths are encoded depends on the encoding type, as given by the 6 nybbles at the start of the compressed data.
Symbol lengths are read in ascending order, and the number of symbols to read is implied by which tree you're defining.
Huffman code length list, encoding type 0
All symbol have the same length, implied by the number of symbols in the tree:
16 symbols -> all symbols are length 4
32 symbols -> all symbols are length 5
64 symbols -> all symbols are length 6
256 symbols -> all symbols are length 8
You don't need to read anything.
Huffman code length list, encoding type 1
A run-length encoding is used:
read 4 bits for the first symbol length (0-15)
LOOP:
read 1 bit == 0 if symbol length is the same as the previous, OTHERWISE:
read 1 bit == 0 if symbol length is previous + 1, OTHERWISE:
read 4 bits for symbol length (0-15)
Huffman code length list, encoding type 2
Another run-length encoding is used:
read 4 bits for the first symbol length (0-15)
LOOP:
read 2 bits as selector (0-3):
selector == 3: read 4 bits for symbol length, OTHERWISE:
symbol length is previous symbol + (selector-1), i.e. -1, 0 or +1
Huffman code length list, encoding type 3
There is no compression. Read 4 bits per symbol (0-15).
Compressed data
At this point, the compressed data begins.
We have a 4096 byte ring buffer, initially filled with byte 0x20 (ASCII space). Unlike the SZDD format, the starting position in the buffer is irrelevant, as match positions are stored relative to the current position in the window, not as absolute positions in the window.
Pseudo-code:
ring buffer position = 4096-17
selected table = MATCHLEN
LOOP:
code = read huffman code using selected table (MATCHLEN or MATCHLEN2)
if EOF reached, exit loop
if code > 0, this is a match:
match length = code + 2
x = read huffman code using OFFSET table
y = read 6 bits
match offset = current ring buffer position - (x<<6 | y)
copy match as output and into the ring buffer
selected table = MATCHLEN
if code == 0, this is a run of literals:
x = read huffman code using LITLEN table
if x != 31, selected table = MATCHLEN2
read {x+1} literals using LITERAL huffman table, copy as output and into the ring buffer
MS-ZIP
KWAJ type 4 compression is called MS-ZIP, because it is almost identical to the MS-ZIP compression found in Microsoft Cabinet files. Each 32768 bytes of data is compressed independently using Phil Katz's DEFLATE algorithm. However, the history window is shared between blocks, so they must be unpacked in order. The format of each block is as follows:
KWAJ MS-ZIP block format Offset Length Description
0 2 Compressed length of this block (n). Stored in Intel byte order. Doesn't include these two bytes.
2 2 "CK" in ASCII (0x43, 0x4B)
4 n-2 Data compressed in DEFLATE format
The final block will unpack to 1-32768 bytes. It will be followed by two zero bytes.