Skip to content

Junk data

Jocelyn Beedie edited this page Jun 13, 2020 · 2 revisions

Junk data

If you see me refer to some data in a file as 'junk data', I literally mean that data is completely meaningless. A lot of files contain data that has no real 'function' in relation to the rest of the file. This makes the data difficult to understand and pick apart, so knowing what is and is not 'junk' is very beneficial.

How junk data is created

There are many reasons that junk data may exist. Typically this is either from padding data to disk sectors, fixed length C strings having extra data, struct alignment, or conditional data.

Padding

A lot of junk data tends to be used for padding. For example, a lot of the data in DGC archives are zeroes, so that the contents of the archive align to 0x800 byte regions. Since the Gamecube was disk based, this made disk reads faster.

C strings

This usually arises when uninitialized memory is written to a file. Uninitialized memory is memory that has been allocated, but has not been written to. Consider the following code:

char data[20] = "Hello, world!";

The first 14 bytes of 'data' have been assigned to the string "Hello, world!" (null terminator included). The last 6 bytes of this data are uninitialized, and could be anything (but this doesn't matter in practice because those bytes are ignored).

Struct alignment

Sometimes junk data arises from struct alignment. For example, consider the following structure:

struct MyData {
    uint8_t x;
    uint32_t y;
}

Many platforms can only read data from certain locations. A single byte can be read from anywhere in memory, but a 32-bit integer must be aligned to 4 bytes. To make it so that the above structure always has y aligned to 4 bytes, the following structure will be used by a C compiler instead:

struct MyData {
    uint8_t x;
    uint8_t __padding[3]; // 3 bytes of padding
    uint32_t y;
}

When creating instances of MyData, the padding will be uninitialized, and this same uninitialized data will be written to the disk. For example, the unknown13 value in LIGHT files is padded with three junk bytes.

Conditional data

Some data is only written do under certain conditions, and stays uninitialized otherwise. This kind of data is the hardest to determine, since it appears to be junk data, and on occasion may only be initialized in extreme conditions. Although not known for sure, there is some data in MATERIAL files that might be conditional data.

Determining junk data

It can be difficult to discern junk data from real data. fortunately, there are many methods of determining what is junk and what isn't.

A lot of files tend to be copied between different Totem archives. For example, consider the file "DB:>PERSO>PATRICK>TEXTURES>PATRICK.TGA", which is copied between seven different archives. This is a BITMAP file, so it contains texture data. However, if you compare the MD5 sums of these files, you get the following:

e74bfa380e5b01feff04f18f62ab2da6  extracted/rotfd/DATA/JF/LVL_JFCL/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
a10b4032c762e19918c30ca298264a64  extracted/rotfd/DATA/GL/LVL_GLBE/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
a037999ce3a3ec135d26420801e17fcc  extracted/rotfd/DATA/BB/LVL_BBSH/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
459549f6f810bcabf07c6bc4b9cfb422  extracted/rotfd/DATA/BB/LVL_BBTP/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
eadaad30a042836ca0c124fc0a77ca87  extracted/rotfd/DATA/BB/LVL_BBEX/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
9dd38052e384393569c70130b2df4b95  extracted/rotfd/DATA/DG/LVL_DGBA/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA
2114a4af911ab5f0ace8960b28d09eb2  extracted/rotfd/DATA/DG/LVL_DGDS/DB__PERSO_PATRICK_TEXTURES_PATRICK_DEE350CD_.TGA

None of these files match whatsoever! But, the actual image data they contain is exactly the same, since they have the same name. A neat little command line tool called "VBinDiff" can be used to determine generally where the junk data is. If we compare two files, we can see where the junk data lies:

VBinDiff output for patrick textures

A lot of the junk data matches between the two files (of course, there's only 256 values that a byte can be, and zeroes tend to be very common among junk data), so this method isn't perfect. But it can provide a good starting visualization for junk data.

Sometimes, however, there only exists one version of a file. If this is the case, data can sometimes be determined to be junk by looking for sections of that file appears to have uninitialized memory. The tells for junk data might be seeing data that you might see elsewhere, usually by seeing random file paths or other textual data (this data is usually cut off at the beginning and/or end too).

Clone this wiki locally