Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

archive/tar: package understanding of GNU format is wrong #12594

Closed
dsnet opened this issue Sep 11, 2015 · 14 comments
Closed

archive/tar: package understanding of GNU format is wrong #12594

dsnet opened this issue Sep 11, 2015 · 14 comments
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Milestone

Comments

@dsnet
Copy link
Member

dsnet commented Sep 11, 2015

Using go1.5

Also discovered this while fixing other archive/tar issues (and I found fair number of them, mostly minor). However, fixing this will change the way archive/tar reads and writes certain formats.

What the current archive/tar thinks the GNU format is:

What the GNU manual actually says the format is:

The GNU manual says that the format for headers using this magic is the following (in Go syntax):

type headerGNU struct {
    // Original V7 header
    name     [100]byte //   0
    mode     [8]byte   // 100
    uid      [8]byte   // 108
    gid      [8]byte   // 116
    size     [12]byte  // 124
    mtime    [12]byte  // 136
    chksum   [8]byte   // 148
    typeflag [1]byte   // 156
    linkname [100]byte // 157

    // This section is based on the Posix standard.
    magic      [6]byte         // 257: "ustar "
    version    [2]byte         // 263: " \x00"
    uname      [32]byte        // 265
    gname      [32]byte        // 297
    devmajor   [8]byte         // 329
    devminor   [8]byte         // 337

    // The GNU format replaces the prefix field with this stuff.
    // The fact that GNU replaces the prefix with this makes it non-compliant.
    atime      [12]byte        // 345
    ctime      [12]byte        // 357
    offset     [12]byte        // 369
    longnames  [4]byte         // 381
    unused     [1]byte         // 385
    sparse     [4]headerSparse // 386
    isextended [1]byte         // 482
    realsize   [12]byte        // 483
                               // 495
}

type headerSparse struct {
    offset   [12]byte //  0
    numbytes [12]byte // 12
                      // 24
}

In fact, the structure for GNU swaps out the prefix section of POSIX, for a bunch of extra fields for atime, ctime, and sparse file support (contrary to what Go thinks).

Regarding the use of base-256 encoding, it seems that GNU was the first to introduce this encoding back in 1999. Since then, pretty much every tar decoder handles reading base-256 encoding regardless of whether it is GNU format or not. Marking the format as GNU may or may not be necessary just because base-256 encoding was used.

Problem 1:

When reading, if the decoder detects the GNU magic number, it will attempt to read 155bytes for the prefix. This is just plain wrong and will start to read the atime, ctime, etc instead. This causes the prefix to be incorrect.

See this playground example
The paths there have something like "12574544345" prepended to it. This is because when the tar archive tries to read the the prefix, it is actually reading the atime (which is in ASCII octal and is null terminated). Thus, it incorrectly uses the atime as the prefix.

This probably went undetected for so long since the "incremental" mode of GNU tar is rarely used, and thus the atime and ctime fields are never filled out and left as null bytes. This happens to work in the common case, since the cstring for this field ends up being an empty string.

Problem 2:

When writing, if a numeric field was ever too large to represent in octal format, it would trigger the usedBinary flag and cause the library to output the GNU magic numbers, but subsequently fail to encode in the GNU format. Since it believes that the GNU format has a prefix field, it erroneously tries to use it, losing some information in the process.

This is ultimately what causes #9683, but is rare in practice since the perfect conditions need to be met for GNU format to be used. There is a very narrow range between the use cases of USTAR and PAX where the logic will use GNU.

Solution:

When decoding, change it so that the reader doesn't read the 155byte prefix field (since this is just plain wrong). Optionally, support parsing of the atime and ctime from the GNU format. Nothing needs to change for sparse file support since that logic correctly understood the GNU format.

When encoding, I propose the following order of precedence:

  • First, use the 1988 POSIX (USTAR) standard when possible for maximum backwards compatibility.
  • If any numeric field goes beyond the octal representation, or any names are longer than what is supported, just use the 2001 POSIX (PAX) standard.

Let's avoid writing the GNU format. In fact the GNU manual itself, says the following under the POSIX section:

This archive format will be the default format for future versions of GNU tar.

The only advantages that GNU offers over USTAR is:

  • Unlimited length filenames (only ASCII)
  • Relatively large filesizes
  • Possibly atime and ctime

However, PAX offers all of these over USTAR and far more:

  • Unlimited length strings (including UTF-8) support for filenames, usernames, etc.
  • Unlimited large integers for filesizes, uids, etc.
  • Sub-second resolution times.
  • No need for base-256 encoding (and assuming that decoders can handle them) since PAX has its own well-defined method of encoding arbitrarily large integers.

Not to mention, we are already outputting PAX in many situations. What's the point of straggling between 3 different output formats?

Thoughts?

@ianlancetaylor
Copy link
Member

My thought is that you obviously know more about this than I do. Do you have any actual question? If not it seems like you know how to proceed.

CC @dsymonds

@dsnet
Copy link
Member Author

dsnet commented Sep 12, 2015

Given that this is obviously a bug for the reader. The course of action there is clear.

As for the writer side, should it still output GNU format under certain conditions? Or should we just completely remove support for it and use PAX.

@ianlancetaylor
Copy link
Member

I think we should always output PAX format. It's old enough that I don't think we need to worry about generating older formats.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/14623 mentions this issue.

@dsnet
Copy link
Member Author

dsnet commented Dec 2, 2015

Fixing this bug (for the Writer) is not trivial. I have a fix for it available, but it would require multiple CLs and I dont have the bandwidth to go through the code reviews for them.

I suggest moving this to Go1.7 milestone.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/14669 mentions this issue.

gopherbot pushed a commit that referenced this issue May 6, 2016
The Reader and Writer have hard-coded constants regarding the
offsets and lengths of certain fields in the tar format sprinkled
all over. This makes it harder to verify that the offsets are
correct since a reviewer would need to search for them throughout
the code. Instead, all information about the layout of header
fields should be centralized in one single file. This has the
advantage of being both centralized, and also acting as a form
of documentation about the header struct format.

This method was chosen over using "encoding/binary" since that
method would cause an allocation of a header struct every time
binary.Read was called. This method causes zero allocations and
its logic is no longer than if structs were declared.

Updates #12594

Change-Id: Ic7a0565d2a2cd95d955547ace3b6dea2b57fab34
Reviewed-on: https://go-review.googlesource.com/14669
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@dsnet dsnet self-assigned this May 9, 2016
@rsc
Copy link
Contributor

rsc commented May 17, 2016

We're in the Go 1.7 freeze; any bugs in archive/tar not introduced in the Go 1.7 dev cycle aren't worth the risk to fix. If users do encounter problems with archive/tar, it is straightforward to run a forked, fixed copy until the standard version is fixed in a future release.

@rsc rsc modified the milestones: Go1.8, Go1.7 May 17, 2016
@rsc rsc added the NeedsFix The path to resolution is known, but the work has not been done. label Sep 29, 2016
@rsc rsc modified the milestones: Go1.8Maybe, Go1.8 Sep 29, 2016
@gopherbot
Copy link
Contributor

CL https://golang.org/cl/31444 mentions this issue.

gopherbot pushed a commit that referenced this issue Oct 19, 2016
The GNU format does not have a prefix field, so we should make
no attempt to read it. It does however have atime and ctime fields.
Since Go previously placed incorrect values here, we liberally
read the atime and ctime fields and ignore errors so that old tar
files written by Go can at least be partially read.

This fixes half of #12594. The Writer is much harder to fix.

Updates #12594

Change-Id: Ia32845e2f262ee53366cf41dfa935f4d770c7a30
Reviewed-on: https://go-review.googlesource.com/31444
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
@dsnet
Copy link
Member Author

dsnet commented Oct 19, 2016

Moving milestone to Go1.9. The fix is performed on the Reader. The fix for the Writer is more involved.

The current logic for the Writer is to the assume that it is writing in one format, and then it tries to backtrack if it can't use that format and switch to another. However, it's complicated trying to keep track of what state needs to be undone and what writes have already occurred (or not). A better approach is to verify up-front what formats can be used for the given input file and commit to using that format. There should be no back-tracking.

@gopherbot
Copy link
Contributor

CL https://golang.org/cl/32234 mentions this issue.

gopherbot pushed a commit that referenced this issue Nov 2, 2016
The proper fix for the Writer is too involved to be done in time
for Go 1.8. Instead, we do a localized fix that simply disables the
prefix encoding logic. While this will prevent some legitimate uses
of prefix, it will ensure that we don't keep outputting invalid
GNU format files that have the prefix field populated.

For headers with long filenames that could have used the prefix field,
they will be promoted to use the PAX format, which ensures that we
will still be able to encode all headers that we were able to do before.

Updates #12594
Fixes #17630
Fixes #9683

Change-Id: Ia97b524ac69865390e2ae8bb0dfb664d40a05add
Reviewed-on: https://go-review.googlesource.com/32234
Reviewed-by: Russ Cox <rsc@golang.org>
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
@dsnet dsnet modified the milestones: Go1.10, Go1.9 May 22, 2017
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/54433 mentions this issue: archive/tar: check for permissible output formats first

gopherbot pushed a commit that referenced this issue Aug 11, 2017
The current logic in writeHeader attempts to encode the Header in one
format and if it discovered that it could not it would attempt to
switch to a different format mid-way through. This makes it very
hard to reason about what format will be used in the end and whether
it will even be a valid format.

Instead, we should verify from the start what formats are allowed
to encode the given input Header. If no formats are possible,
then we can return immediately, rejecting the Header.

For now, we continue on to the hairy logic in writeHeader, but
a future CL can split that logic up and specialize them for each
format now that we know what is possible.

Update #9683
Update #12594

Change-Id: I8406ea855dfcb8b478a03a7058ddf8b2b09d46dc
Reviewed-on: https://go-review.googlesource.com/54433
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/55237 mentions this issue: archive/tar: implement specialized logic for GNU format

gopherbot pushed a commit that referenced this issue Aug 14, 2017
Rather than going through writeHeader, which attempts to handle all formats,
implement writeGNUHeader, which only has an understanding of the GNU format.

Currently, the implementation is nearly identical to writeUSTARHeader, except:
* formatNumeric is used instead of formatOctal
* the GNU magic value is used

This is kept as a separate method since it makes more logical sense
when we add support for sparse files, long filenames, and atime/ctime fields,
which do not affect USTAR.

Updates #12594

Change-Id: I76efc0b39dc649efc22646dfc9867a7c165f34a8
Reviewed-on: https://go-review.googlesource.com/55237
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Martin Möhrmann <moehrmann@google.com>
@gopherbot
Copy link
Contributor

Change https://golang.org/cl/55550 mentions this issue: archive/tar: remove writeHeader and writePAXHeaderLegacy

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/55574 mentions this issue: archive/tar: re-implement USTAR path splitting

gopherbot pushed a commit that referenced this issue Aug 15, 2017
The logic for USTAR was disabled because a previous implementation of
Writer had a wrong understanding of the differences between USTAR and GNU,
causing the prefix field is incorrectly be populated in GNU files.

Now that this issue has been fixed, we can re-enable the logic for USTAR
path splitting, which allows Writer to use the USTAR for a wider range
of possible inputs.

Updates #9683
Updates #12594
Updates #17630

Change-Id: I9fe34e5df63f99c6dd56fee3a7e7e4d6ec3995c9
Reviewed-on: https://go-review.googlesource.com/55574
Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com>
TryBot-Result: Gobot Gobot <gobot@golang.org>
Reviewed-by: Ian Lance Taylor <iant@golang.org>
@golang golang locked and limited conversation to collaborators Aug 15, 2018
@rsc rsc unassigned dsnet Jun 23, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
FrozenDueToAge NeedsFix The path to resolution is known, but the work has not been done.
Projects
None yet
Development

No branches or pull requests

4 participants