-
Notifications
You must be signed in to change notification settings - Fork 17.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
archive/tar: package understanding of GNU format is wrong #12594
Comments
My thought is that you obviously know more about this than I do. Do you have any actual question? If not it seems like you know how to proceed. CC @dsymonds |
Given that this is obviously a bug for the reader. The course of action there is clear. As for the writer side, should it still output GNU format under certain conditions? Or should we just completely remove support for it and use PAX. |
I think we should always output PAX format. It's old enough that I don't think we need to worry about generating older formats. |
CL https://golang.org/cl/14623 mentions this issue. |
Fixing this bug (for the Writer) is not trivial. I have a fix for it available, but it would require multiple CLs and I dont have the bandwidth to go through the code reviews for them. I suggest moving this to Go1.7 milestone. |
CL https://golang.org/cl/14669 mentions this issue. |
The Reader and Writer have hard-coded constants regarding the offsets and lengths of certain fields in the tar format sprinkled all over. This makes it harder to verify that the offsets are correct since a reviewer would need to search for them throughout the code. Instead, all information about the layout of header fields should be centralized in one single file. This has the advantage of being both centralized, and also acting as a form of documentation about the header struct format. This method was chosen over using "encoding/binary" since that method would cause an allocation of a header struct every time binary.Read was called. This method causes zero allocations and its logic is no longer than if structs were declared. Updates #12594 Change-Id: Ic7a0565d2a2cd95d955547ace3b6dea2b57fab34 Reviewed-on: https://go-review.googlesource.com/14669 Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
We're in the Go 1.7 freeze; any bugs in archive/tar not introduced in the Go 1.7 dev cycle aren't worth the risk to fix. If users do encounter problems with archive/tar, it is straightforward to run a forked, fixed copy until the standard version is fixed in a future release. |
CL https://golang.org/cl/31444 mentions this issue. |
The GNU format does not have a prefix field, so we should make no attempt to read it. It does however have atime and ctime fields. Since Go previously placed incorrect values here, we liberally read the atime and ctime fields and ignore errors so that old tar files written by Go can at least be partially read. This fixes half of #12594. The Writer is much harder to fix. Updates #12594 Change-Id: Ia32845e2f262ee53366cf41dfa935f4d770c7a30 Reviewed-on: https://go-review.googlesource.com/31444 Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Brad Fitzpatrick <bradfitz@golang.org>
Moving milestone to Go1.9. The fix is performed on the Reader. The fix for the Writer is more involved. The current logic for the Writer is to the assume that it is writing in one format, and then it tries to backtrack if it can't use that format and switch to another. However, it's complicated trying to keep track of what state needs to be undone and what writes have already occurred (or not). A better approach is to verify up-front what formats can be used for the given input file and commit to using that format. There should be no back-tracking. |
CL https://golang.org/cl/32234 mentions this issue. |
The proper fix for the Writer is too involved to be done in time for Go 1.8. Instead, we do a localized fix that simply disables the prefix encoding logic. While this will prevent some legitimate uses of prefix, it will ensure that we don't keep outputting invalid GNU format files that have the prefix field populated. For headers with long filenames that could have used the prefix field, they will be promoted to use the PAX format, which ensures that we will still be able to encode all headers that we were able to do before. Updates #12594 Fixes #17630 Fixes #9683 Change-Id: Ia97b524ac69865390e2ae8bb0dfb664d40a05add Reviewed-on: https://go-review.googlesource.com/32234 Reviewed-by: Russ Cox <rsc@golang.org> Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org>
Change https://golang.org/cl/54433 mentions this issue: |
The current logic in writeHeader attempts to encode the Header in one format and if it discovered that it could not it would attempt to switch to a different format mid-way through. This makes it very hard to reason about what format will be used in the end and whether it will even be a valid format. Instead, we should verify from the start what formats are allowed to encode the given input Header. If no formats are possible, then we can return immediately, rejecting the Header. For now, we continue on to the hairy logic in writeHeader, but a future CL can split that logic up and specialize them for each format now that we know what is possible. Update #9683 Update #12594 Change-Id: I8406ea855dfcb8b478a03a7058ddf8b2b09d46dc Reviewed-on: https://go-review.googlesource.com/54433 Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com> Reviewed-by: Ian Lance Taylor <iant@golang.org>
Change https://golang.org/cl/55237 mentions this issue: |
Rather than going through writeHeader, which attempts to handle all formats, implement writeGNUHeader, which only has an understanding of the GNU format. Currently, the implementation is nearly identical to writeUSTARHeader, except: * formatNumeric is used instead of formatOctal * the GNU magic value is used This is kept as a separate method since it makes more logical sense when we add support for sparse files, long filenames, and atime/ctime fields, which do not affect USTAR. Updates #12594 Change-Id: I76efc0b39dc649efc22646dfc9867a7c165f34a8 Reviewed-on: https://go-review.googlesource.com/55237 Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Martin Möhrmann <moehrmann@google.com>
Change https://golang.org/cl/55550 mentions this issue: |
Change https://golang.org/cl/55574 mentions this issue: |
The logic for USTAR was disabled because a previous implementation of Writer had a wrong understanding of the differences between USTAR and GNU, causing the prefix field is incorrectly be populated in GNU files. Now that this issue has been fixed, we can re-enable the logic for USTAR path splitting, which allows Writer to use the USTAR for a wider range of possible inputs. Updates #9683 Updates #12594 Updates #17630 Change-Id: I9fe34e5df63f99c6dd56fee3a7e7e4d6ec3995c9 Reviewed-on: https://go-review.googlesource.com/55574 Run-TryBot: Joe Tsai <thebrokentoaster@gmail.com> TryBot-Result: Gobot Gobot <gobot@golang.org> Reviewed-by: Ian Lance Taylor <iant@golang.org>
Using
go1.5
Also discovered this while fixing other archive/tar issues (and I found fair number of them, mostly minor). However, fixing this will change the way archive/tar reads and writes certain formats.
What the current archive/tar thinks the GNU format is:
"ustar\x20\x20\x00"
(this is correct).What the GNU manual actually says the format is:
The GNU manual says that the format for headers using this magic is the following (in Go syntax):
In fact, the structure for GNU swaps out the prefix section of POSIX, for a bunch of extra fields for atime, ctime, and sparse file support (contrary to what Go thinks).
Regarding the use of base-256 encoding, it seems that GNU was the first to introduce this encoding back in 1999. Since then, pretty much every tar decoder handles reading base-256 encoding regardless of whether it is GNU format or not. Marking the format as GNU may or may not be necessary just because base-256 encoding was used.
Problem 1:
When reading, if the decoder detects the GNU magic number, it will attempt to read 155bytes for the prefix. This is just plain wrong and will start to read the atime, ctime, etc instead. This causes the prefix to be incorrect.
See this playground example
The paths there have something like "12574544345" prepended to it. This is because when the tar archive tries to read the the prefix, it is actually reading the atime (which is in ASCII octal and is null terminated). Thus, it incorrectly uses the atime as the prefix.
This probably went undetected for so long since the "incremental" mode of GNU tar is rarely used, and thus the atime and ctime fields are never filled out and left as null bytes. This happens to work in the common case, since the cstring for this field ends up being an empty string.
Problem 2:
When writing, if a numeric field was ever too large to represent in octal format, it would trigger the
usedBinary
flag and cause the library to output the GNU magic numbers, but subsequently fail to encode in the GNU format. Since it believes that the GNU format has a prefix field, it erroneously tries to use it, losing some information in the process.This is ultimately what causes #9683, but is rare in practice since the perfect conditions need to be met for GNU format to be used. There is a very narrow range between the use cases of USTAR and PAX where the logic will use GNU.
Solution:
When decoding, change it so that the reader doesn't read the 155byte prefix field (since this is just plain wrong). Optionally, support parsing of the atime and ctime from the GNU format. Nothing needs to change for sparse file support since that logic correctly understood the GNU format.
When encoding, I propose the following order of precedence:
Let's avoid writing the GNU format. In fact the GNU manual itself, says the following under the POSIX section:
The only advantages that GNU offers over USTAR is:
However, PAX offers all of these over USTAR and far more:
Not to mention, we are already outputting PAX in many situations. What's the point of straggling between 3 different output formats?
Thoughts?
The text was updated successfully, but these errors were encountered: