Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tar archives with large UIDs/GIDs not detected #307

Closed
chrisnovakovic opened this issue Jul 13, 2022 · 4 comments · Fixed by #308
Closed

tar archives with large UIDs/GIDs not detected #307

chrisnovakovic opened this issue Jul 13, 2022 · 4 comments · Fixed by #308

Comments

@chrisnovakovic
Copy link
Contributor

chrisnovakovic commented Jul 13, 2022

Attach the file for which the detection is inaccurate

Run this through xxd -r. (I'll also include it as a new test case in the PR I'm about to open.)

Expected MIME type

application/x-tar

Returned MIME type

application/octet-stream (no match)

Version of the library you are using

v1.4.1 (although the bug also exists on master)

Output of go version

go version go1.17.8 linux/amd64

Additional context

I work on a Linux system with large UIDs and GIDs - both of mine are 895295090. Creating a tar archive from files owned by this user with GNU tar 1.32 results in the following tar fields for those files:

  • uid (byte offset 0x6c): 80 00 00 00 35 5d 1e 72
  • gid (byte offset 0x74): 80 00 00 00 35 5d 1e 72

GNU tar's internals documentation explains the format of these fields (emphasis mine):

For fields containing numbers or timestamps that are out of range for the basic format, the GNU format uses a base-256 representation instead of an ASCII octal number. If the leading byte is 0xff (255), all the bytes of the field (including the leading byte) are concatenated in big-endian order, with the result being a negative number expressed in two’s complement form. If the leading byte is 0x80 (128), the non-leading bytes of the field are concatenated in big-endian order, with the result being a positive number expressed in binary form. Leading bytes other than 0xff, 0x80 and ASCII octal digits are reserved for future use, as are base-256 representations of values that would be in range for the basic format.

mimetype's tar detection logic inspects bytes at a number of offsets within specific fields and checks whether they fall within given ranges. These ranges are based on empirical work by the National Archives, who derived them empirically from a corpus of tar archives they'd collected. This isn't very reliable, because (as demonstrated here) different archivers store values in certain fields in different formats, and presumably their corpus simply didn't contain any archives with UIDs or GIDs large enough to overflow tar's basic format and have to be stored in the GNU format instead.

IMO, it would be far more reliable to detect tar files based on the value of the magic field at byte offset 0x101 - I've yet to come across an archiver that doesn't set the first five bytes of this field to 75 73 74 61 72 (ustar). GNU tar fills the rest of the field with spaces (i.e. 75 73 74 61 72 20 20).

@chrisnovakovic
Copy link
Contributor Author

Having just noticed testdata/tar.v7.tar, I guess we need to retain detection of pre-POSIX tar archives, so perhaps it's enough to add in the ustar magic field check and leave the rest in place to detect v7 archives.

@gabriel-vasile
Copy link
Owner

Yes, I think so too.

One side question: if you create a v7 tar on your machine, is it successfully detected with mimetype v1.4.1?

@chrisnovakovic
Copy link
Contributor Author

One side question: if you create a v7 tar on your machine, is it successfully detected with mimetype v1.4.1?

I was just wondering the same thing 🙂 I'm currently creating test archives for all the formats generated by GNU tar to make sure mimetype can detect them.

@chrisnovakovic
Copy link
Contributor Author

One side question: if you create a v7 tar on your machine, is it successfully detected with mimetype v1.4.1?

It is. I've included it as a new test case in #308.

chrisnovakovic added a commit to chrisnovakovic/mimetype that referenced this issue Jul 13, 2022
UStar tar archives have a `magic` header field at byte offset 101 in
each entry whose value begins with the string `ustar`. Identify them
with the MIME type `application/x-tar`.

Also add test cases for a number of UStar-compatible formats, created by
GNU tar 1.29 (with `--format=<format-name>`):

* `tar.gnu.tar`
* `tar.oldgnu.tar`
* `tar.posix.tar`
* `tar.ustar.tar`

as well as `tar.star.tar` (created by star 1.6) and, for completeness,
`tar.v7-gnu.tar` (a v7 tar archive created by GNU tar 1.29).

Fixes gabriel-vasile#307.
chrisnovakovic added a commit to chrisnovakovic/mimetype that referenced this issue Jul 13, 2022
UStar tar archives have a `magic` header field at byte offset 257 in
each entry whose value begins with the string `ustar`. Identify them
with the MIME type `application/x-tar`.

Also add test cases for a number of UStar-compatible formats, created by
GNU tar 1.29 (with `--format=<format-name>`):

* `tar.gnu.tar`
* `tar.oldgnu.tar`
* `tar.posix.tar`
* `tar.ustar.tar`

as well as `tar.star.tar` (created by star 1.6) and, for completeness,
`tar.v7-gnu.tar` (a v7 tar archive created by GNU tar 1.29).

Fixes gabriel-vasile#307.
gabriel-vasile pushed a commit that referenced this issue Jul 18, 2022
* Detect UStar tar archives

UStar tar archives have a `magic` header field at byte offset 257 in
each entry whose value begins with the string `ustar`. Identify them
with the MIME type `application/x-tar`.

Also add test cases for a number of UStar-compatible formats, created by
GNU tar 1.29 (with `--format=<format-name>`):

* `tar.gnu.tar`
* `tar.oldgnu.tar`
* `tar.posix.tar`
* `tar.ustar.tar`

as well as `tar.star.tar` (created by star 1.6) and, for completeness,
`tar.v7-gnu.tar` (a v7 tar archive created by GNU tar 1.29).

Fixes #307.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants