tar archives with large UIDs/GIDs not detected #307

chrisnovakovic · 2022-07-13T13:16:50Z

Attach the file for which the detection is inaccurate

Run this through xxd -r. (I'll also include it as a new test case in the PR I'm about to open.)

Expected MIME type

application/x-tar

Returned MIME type

application/octet-stream (no match)

Version of the library you are using

v1.4.1 (although the bug also exists on master)

Output of go version

go version go1.17.8 linux/amd64

Additional context

I work on a Linux system with large UIDs and GIDs - both of mine are 895295090. Creating a tar archive from files owned by this user with GNU tar 1.32 results in the following tar fields for those files:

uid (byte offset 0x6c): 80 00 00 00 35 5d 1e 72
gid (byte offset 0x74): 80 00 00 00 35 5d 1e 72

GNU tar's internals documentation explains the format of these fields (emphasis mine):

For fields containing numbers or timestamps that are out of range for the basic format, the GNU format uses a base-256 representation instead of an ASCII octal number. If the leading byte is 0xff (255), all the bytes of the field (including the leading byte) are concatenated in big-endian order, with the result being a negative number expressed in two’s complement form. If the leading byte is 0x80 (128), the non-leading bytes of the field are concatenated in big-endian order, with the result being a positive number expressed in binary form. Leading bytes other than 0xff, 0x80 and ASCII octal digits are reserved for future use, as are base-256 representations of values that would be in range for the basic format.

mimetype's tar detection logic inspects bytes at a number of offsets within specific fields and checks whether they fall within given ranges. These ranges are based on empirical work by the National Archives, who derived them empirically from a corpus of tar archives they'd collected. This isn't very reliable, because (as demonstrated here) different archivers store values in certain fields in different formats, and presumably their corpus simply didn't contain any archives with UIDs or GIDs large enough to overflow tar's basic format and have to be stored in the GNU format instead.

IMO, it would be far more reliable to detect tar files based on the value of the magic field at byte offset 0x101 - I've yet to come across an archiver that doesn't set the first five bytes of this field to 75 73 74 61 72 (ustar). GNU tar fills the rest of the field with spaces (i.e. 75 73 74 61 72 20 20).

The text was updated successfully, but these errors were encountered:

chrisnovakovic · 2022-07-13T13:45:56Z

Having just noticed testdata/tar.v7.tar, I guess we need to retain detection of pre-POSIX tar archives, so perhaps it's enough to add in the ustar magic field check and leave the rest in place to detect v7 archives.

gabriel-vasile · 2022-07-13T14:17:57Z

Yes, I think so too.

One side question: if you create a v7 tar on your machine, is it successfully detected with mimetype v1.4.1?

chrisnovakovic · 2022-07-13T14:20:10Z

One side question: if you create a v7 tar on your machine, is it successfully detected with mimetype v1.4.1?

I was just wondering the same thing 🙂 I'm currently creating test archives for all the formats generated by GNU tar to make sure mimetype can detect them.

chrisnovakovic · 2022-07-13T15:05:43Z

One side question: if you create a v7 tar on your machine, is it successfully detected with mimetype v1.4.1?

It is. I've included it as a new test case in #308.

UStar tar archives have a `magic` header field at byte offset 101 in each entry whose value begins with the string `ustar`. Identify them with the MIME type `application/x-tar`. Also add test cases for a number of UStar-compatible formats, created by GNU tar 1.29 (with `--format=<format-name>`): * `tar.gnu.tar` * `tar.oldgnu.tar` * `tar.posix.tar` * `tar.ustar.tar` as well as `tar.star.tar` (created by star 1.6) and, for completeness, `tar.v7-gnu.tar` (a v7 tar archive created by GNU tar 1.29). Fixes gabriel-vasile#307.

UStar tar archives have a `magic` header field at byte offset 257 in each entry whose value begins with the string `ustar`. Identify them with the MIME type `application/x-tar`. Also add test cases for a number of UStar-compatible formats, created by GNU tar 1.29 (with `--format=<format-name>`): * `tar.gnu.tar` * `tar.oldgnu.tar` * `tar.posix.tar` * `tar.ustar.tar` as well as `tar.star.tar` (created by star 1.6) and, for completeness, `tar.v7-gnu.tar` (a v7 tar archive created by GNU tar 1.29). Fixes gabriel-vasile#307.

* Detect UStar tar archives UStar tar archives have a `magic` header field at byte offset 257 in each entry whose value begins with the string `ustar`. Identify them with the MIME type `application/x-tar`. Also add test cases for a number of UStar-compatible formats, created by GNU tar 1.29 (with `--format=<format-name>`): * `tar.gnu.tar` * `tar.oldgnu.tar` * `tar.posix.tar` * `tar.ustar.tar` as well as `tar.star.tar` (created by star 1.6) and, for completeness, `tar.v7-gnu.tar` (a v7 tar archive created by GNU tar 1.29). Fixes #307.

chrisnovakovic mentioned this issue Jul 13, 2022

Detect UStar tar archives #308

Merged

gabriel-vasile closed this as completed in #308 Jul 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tar archives with large UIDs/GIDs not detected #307

tar archives with large UIDs/GIDs not detected #307

chrisnovakovic commented Jul 13, 2022 •

edited

Loading

chrisnovakovic commented Jul 13, 2022

gabriel-vasile commented Jul 13, 2022

chrisnovakovic commented Jul 13, 2022

chrisnovakovic commented Jul 13, 2022

tar archives with large UIDs/GIDs not detected #307

tar archives with large UIDs/GIDs not detected #307

Comments

chrisnovakovic commented Jul 13, 2022 • edited Loading

chrisnovakovic commented Jul 13, 2022

gabriel-vasile commented Jul 13, 2022

chrisnovakovic commented Jul 13, 2022

chrisnovakovic commented Jul 13, 2022

chrisnovakovic commented Jul 13, 2022 •

edited

Loading