Skip to content

BPS (old doc)

Boris Timofeev edited this page Jan 21, 2017 · 1 revision

This text from old site http://byuu.org. A copy of the page is in Internet Archive

BPS file format specification

char[4] signature ("BPS1")
varint source-file-size
varint target-file-size
varint metadata-size
utf8[metadata-size] metadata
uint8[] patchdata
uint32 source-file-checksum
uint32 target-file-checksum
uint32 patch-file-checksum

varint encoding and decoding

BPS supports encoding of infinitely-sized numbers. It does this by using a bit-streaming technique. C implementations follow:

void encode(uint64_t data) {
    while(true) {
        uint64_t x = data & 0x7f;
        data >>= 7;
        if(data == 0) {
            write(0x80 | x);
            break;
        }
        write(x);
        data--;
    }
}


uint64_t decode() {
    uint64_t data = 0, shift = 1;
    while(true) {
        uint8_t x = read();
        data += (x & 0x7f) * shift;
        if(x & 0x80) break;
        shift <<= 7;
        data += shift;
    }
    return data;
}

The idea is that the top bit indicates if we are finished decoding yet, and the lower bits make up part of the address. In this way, and combined with relative offset addressing, we can amortize all values to a single byte.

That is to say, while IPS always requires three bytes for every offset, and two bytes for every length; BPS averages to just barely over one byte per offset or length encoding.

Metadata

Metadata stores an XML file in UTF-8 encoding. The string is not null-terminated. Because binary differencing has unlimited potential uses, there is no rigid standard for this data. XML is extensible, which is the whole point here.

In the future, specifications for certain types of patches, eg SNES patches, may be defined.

Patch Data

The patch data contains zero or more blocks that specify how to create the modified file from the patch and original file.

The way BPS works, is that it encodes every single byte of the modified file. It does not have seek commands that skip around in the file, and it does not store target write addresses. It streams every byte, from the very first, to the very last.

To do this, BPS implements four commands:

%00 - SourceRead(length)
%01 - TargetRead(length, data[length])
%10 - SourceCopy(length, offset)
%11 - TargetCopy(length, offset)

To process the patch data, you simply keep parsing it until the remaining bytes left in the last are less than or equal to 12 (eg once the checksum footer section is reached.)

First, you decode() a varint. The low two bits of this value specify the command mode, as listed above. Shift this value right by two, and then add one, and you have the length, which is used by all commands. We add one, because it makes no sense to store a zero-length write command.

If this is a SourceCopy or TargetCopy command, call decode() again. This time, the lowest bit represents the sign. 0 = positive, 1 = negative. Shift the value right by one, and you have your offset. If the sign is 1 (negative), then negate the offset. The reason we store the sign-bit at the bottom, is because the encode/decode method does not allow us to encode twos-complement negative numbers in the traditional manner.

SourceRead

SourceRead copies bytes from the original file to the target output file. It copies from sourceData[outputOffset] to targetData[outputOffset], where outputOffset is the current write location for the output file.

This command is used to store data that is unchanged between the original and modified files.

TargetRead

TargetRead copies bytes from the patch file to the target output file. Immediately after the decoded varint, there is a stream of bytes equal to the length. Copy these and write them to targetData[outputOffset++].

This is the primary method for storing changes to the original data.

SourceCopy

This command treats the source file as a large dictionary. When patch application begins, sourceRelativeOffset is set to zero. The offset will be a signed value that is added to the sourceRelativeOffset. Once done, for the number of bytes requested by length, copy sourceData[sourceRelativeOffset++] to targetData[outputOffset++].

The sourceRelativeOffset now points to the end of the copy that was just done. The reason we store this as a relative value, is because smaller numbers consume less bytes with our encode/decode method. The general thinking is that the source and target files progress linearly, so most likely the next SourceCopy command will be a small number of bytes forward. In this way, we can amortize the encoded offset size to a single byte.

TargetCopy

This command treats all of the data that we have already written to the new target file as a large dictionary, and is also used for RLE encoding.

The idea is that sometimes there will be repeating data in the newly modified data, and this command can catch and encode that, without the need to store all of the bytes over and over with TargetRead.

It is also very effective for RLE encoding, which is very useful for expanded file padding. The trick to using this as an RLE command is to store a given number of bytes using TargetRead. Let's say TargetRead just copied over { 0x00, 0xff }. We can now perform a TargetCopy, starting from outputOffset - 2, with a length of 65,534. As the patch applier works one byte at a time, it will read the first two bytes as it writes them to the output. Now it is pointing at the newly written second instance of the two-byte pattern, and it will repeat. We have now encoded a 64KB binary sequence into two very small commands, typically about five bytes in size. Also note how unlike traditional RLE, we can encode more than one-byte repetitions as well.

TargetCopy has its own index, targetRelativeOffset, which is also initialized to zero at the start of the patch application process.

Checksum Data

At the end of the file, the source, target and patch CRC32 checksums are stored. The reason we store these at the bottom of the file, is so that the patch creator or applier can compute them as the files are processed, and patch creation will not need to seek back and rewrite sections of the header.

The patch checksum itself is special: it does not include the checksum value itself as part of the checksum. So this is equal to the checksum of patchData(patchSize - 4). The checksum would obviously change as the checksum was being written, so we don't count that part. Although it's possible to use trickery to do this anyway, it goes against BPS' goal of remaining a simple patching format. Speaking of which, this is why CRC32 is used, instead of stronger hashing algorithms such as SHA256: the checks are meant to ensure valid data is there, not to provide security against malicious attacks that try and disguise modified files as the originals.

Clone this wiki locally