-
Notifications
You must be signed in to change notification settings - Fork 4.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix handling of non-ASCII characters in archive entry file names #18448
Conversation
// USTAR tar headers use an unspecified encoding whereas PAX tar headers always use UTF-8. | ||
// Since this would result in ambiguity with Bazel forcing the default JVM encoding to be | ||
// ISO-8859-1, we explicitly specify the encoding as UTF-8. | ||
TarArchiveInputStream tarStream = new TarArchiveInputStream(decompressorStream, UTF_8.name()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There doesn't seem to be an easy way to keep reading legacy tars in Latin-1 (that is, raw bytes) while still being able to distinguish them from UTF-8 names.
In order to preserve backwards compatibility with tars that use non-ASCII, non-UTF-8 file names without PAX headers, I could inject a custom Charset
via CharsetProvider
that somehow marks all strings it produces in a way that lets us distinguish them from those created by the default UTF-8 charset (e.g. via a WeakIdentityHashMap
). Since that would definitely be a hack, I didn't want to do it without getting feedback on this point first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lberki Can you take a look here? I'm not quite familiar with encoding magic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @tetromino (since he's been thnking about Unicode recently)
My professional opinion is that I don't know. What would his hack be useful for? I thought that what this change was doing was to parse the UTF-8 data into proper Unicode then convert that to Bazel's weird "UTF-8 bytes in ISO8859-1" encoding, but that seems doable without resorting to somehow tagging the String
s produced by it by (essentially) doing "if valid UTF-8, decode it as such, otherwise go directly to the weird Bazel encoding".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more messy than that: The Latin-1 trick only works if all strings are encoded in Latin-1 from source to sink, otherwise the "opaque bag of bytes" property is irrecoverably lost.
But commons-compress breaks this invariant: While tar ustar headers respect the default string encoding (which can also be set via the constructor), the more modern tar pax headers always use UTF-8.
Since Bazel can't determine which encoding was used to produce the file names returned by commons-compress, I found the only non-hacky option to be to force UTF-8 for ustar headers as well, decode UTF-8 and then reencode as Latin-1.
As a result, this change would break users of tars with non-UTF-8 ustar headers. Previously, these were handled correctly via the Latin-1 trick (but any pax header with non-ASCII chars were broken), but now they would be decoded as UTF-8 and thus turn into garbage.
Ideally we could tweak commons-compress to return byte arrays or at least tell us which type of header it parser, but I didn't find a way to do this. We could raise an issue with them, hack around this limitation by tracking the strings via a custom charset or accept that life isn't fair and UTF-8 deserves to be the winner. ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, actually, it's even more messy than that, if I understand correctly: the nice thing about the Latin-1 hack is that it works as long as the encoding on the input is the same as the encoding on the output. However, apparently, pax headers insist that they are UTF-8, which means that they constrain the output of Bazel to UTF-8 also.
Now I see where you are coming from: commons-compress doesn't tell whether there are pax headers and thus it doesn't tell you whether UTF-8 decoding was required. So what you could do is to keep track of String
s that were emitted by the Charset
, thus you know which String
s went through the UTF-8 decoding and thus undo it. Is this correct?
I don't claim any knowledge of either commons-compress or the intricacies of tar files, but TarArchiveEntry
seems to have some methods that have Pax
in their name. Maybe those could be used to disambiguate instead of the hack you proposed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tetromino Friendly ping
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lberki Since I wanted to know whether the hack would work, I went ahead and implemented it. It does seem to work and I didn't even have to use a weak cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lberki @tetromino Friendly ping, I would really like to get this into Bazel 7. We currently have to maintain a number of hacks in rules_go just to be able to extract a Go SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My gut reaction to the encoding hack was pretty negative. Then I reread the code and description a couple times, and decided that the hack is only ugly because the apache api is insufficiently flexible for bad clients like us that want to do weird things with encodings. I'm ok with the hack provided it is sufficiently explained in comments, and doesn't leak outside this file.
I would like us to be pretty explicit in comments about what the format of the input headers and returned string are. I.e., Javadoc for the class CompressedTarFunction
or its decompress()
method, saying that USTAR headers are interpreted as latin-1, PAX headers are interpetted as UTF-8, and both types are turned into Java Strings holding the UTF-8 code units that represent the string content, in order to match our string behavior elsewhere.
One more question. We eventually want to fix Bazel's decoding of Starlark source files to be UTF-8 instead of latin-1. Does anything in this PR make that harder? It seems to me this PR is just preserving an existing behavior (USTAR) while fixing an erroneous one (PAX).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added more extensive comments, including Javadocs for the class.
One more question. We eventually want to fix Bazel's decoding of Starlark source files to be UTF-8 instead of latin-1. Does anything in this PR make that harder? It seems to me this PR is just preserving an existing behavior (USTAR) while fixing an erroneous one (PAX).
I don't think that it would make this harder. It does introduce one or two new places in which ISO_8859_1
would need to be replaced with UTF-8
, but those are only related to prefix stripping, not the actual entry decoding logic that interfaces with commons-compress and relies on the custom decoder hack.
754a626
to
cef4b5a
Compare
cef4b5a
to
e408e31
Compare
20e7202
to
ebd3f3c
Compare
@tetromino is not available these days. I don't have any context on this pull request so I wouldn't be able to give it the attention it deserves, but let me try to find someone who has all the bits swapped into their brain. |
src/main/java/com/google/devtools/build/lib/bazel/repository/BUILD
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/bazel/repository/DecompressorValue.java
Outdated
Show resolved
Hide resolved
src/main/java/com/google/devtools/build/lib/bazel/repository/DecompressorValue.java
Outdated
Show resolved
Hide resolved
When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths. Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations.
ebd3f3c
to
c10070b
Compare
c10070b
to
fe1d084
Compare
Thanks for the additional javadoc/comments. |
@bazel-io flag |
@brandjon Is this still on track for getting merged? |
@brandjon is still fixing some internal breakages. |
Thanks for the update. I didn't know this code is used at Google - isn't it only relevant for external repos? Just curious. |
Might be just some irrelevant breakages, Jon is OOO until next Tuesday, I can help merge this. |
When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths. Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations. Fixes bazelbuild#12986 Fixes bazel-contrib/rules_go#2771 Closes bazelbuild#18448. PiperOrigin-RevId: 571857847 Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89
…mes (#19765) When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths. Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations. Fixes #12986 Fixes bazel-contrib/rules_go#2771 Closes #18448. PiperOrigin-RevId: 571857847 Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89 Fixes #19671
Sorry, internal presubmits failed and then I went OOO. Thanks Yun for completing the merge! |
The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765
The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765
The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765
A feature gate for bazelbuild/bazel#18448
A feature gate for bazelbuild/bazel#18448 Required for bazel-contrib/rules_go#3872
The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765
The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765
The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765
The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765
The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765
The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765
The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765
* Remove Latin-1 workaround on Bazel 6.4.0+ The fix for Latin-1 encoded files was landed in [6.4.0] after being landed on Bazel [master]. We can conditionally use the old, unhermetic, `tar` workaround on modern Bazel versions. [master]: bazelbuild/bazel#18448 [6.4.0]: bazelbuild/bazel#19765 * Use `versions` rather than `bazel_features` to avoid circular load pattern Due to the polyfilling of `bazel_features` to avoid double step loading in `WORKSPACE`.
When creating a
PathFragment
from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths.Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations.
Fixes #12986
Fixes bazel-contrib/rules_go#2771