Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix handling of non-ASCII characters in archive entry file names #18448

Closed
wants to merge 2 commits into from

Conversation

fmeum
Copy link
Collaborator

@fmeum fmeum commented May 19, 2023

When creating a PathFragment from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths.

Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations.

Fixes #12986
Fixes bazel-contrib/rules_go#2771

// USTAR tar headers use an unspecified encoding whereas PAX tar headers always use UTF-8.
// Since this would result in ambiguity with Bazel forcing the default JVM encoding to be
// ISO-8859-1, we explicitly specify the encoding as UTF-8.
TarArchiveInputStream tarStream = new TarArchiveInputStream(decompressorStream, UTF_8.name());
Copy link
Collaborator Author

@fmeum fmeum May 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There doesn't seem to be an easy way to keep reading legacy tars in Latin-1 (that is, raw bytes) while still being able to distinguish them from UTF-8 names.

In order to preserve backwards compatibility with tars that use non-ASCII, non-UTF-8 file names without PAX headers, I could inject a custom Charset via CharsetProvider that somehow marks all strings it produces in a way that lets us distinguish them from those created by the default UTF-8 charset (e.g. via a WeakIdentityHashMap). Since that would definitely be a hack, I didn't want to do it without getting feedback on this point first.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lberki Can you take a look here? I'm not quite familiar with encoding magic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @tetromino (since he's been thnking about Unicode recently)

My professional opinion is that I don't know. What would his hack be useful for? I thought that what this change was doing was to parse the UTF-8 data into proper Unicode then convert that to Bazel's weird "UTF-8 bytes in ISO8859-1" encoding, but that seems doable without resorting to somehow tagging the Strings produced by it by (essentially) doing "if valid UTF-8, decode it as such, otherwise go directly to the weird Bazel encoding".

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's more messy than that: The Latin-1 trick only works if all strings are encoded in Latin-1 from source to sink, otherwise the "opaque bag of bytes" property is irrecoverably lost.

But commons-compress breaks this invariant: While tar ustar headers respect the default string encoding (which can also be set via the constructor), the more modern tar pax headers always use UTF-8.

Since Bazel can't determine which encoding was used to produce the file names returned by commons-compress, I found the only non-hacky option to be to force UTF-8 for ustar headers as well, decode UTF-8 and then reencode as Latin-1.

As a result, this change would break users of tars with non-UTF-8 ustar headers. Previously, these were handled correctly via the Latin-1 trick (but any pax header with non-ASCII chars were broken), but now they would be decoded as UTF-8 and thus turn into garbage.

Ideally we could tweak commons-compress to return byte arrays or at least tell us which type of header it parser, but I didn't find a way to do this. We could raise an issue with them, hack around this limitation by tracking the strings via a custom charset or accept that life isn't fair and UTF-8 deserves to be the winner. ;⁠-⁠)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, actually, it's even more messy than that, if I understand correctly: the nice thing about the Latin-1 hack is that it works as long as the encoding on the input is the same as the encoding on the output. However, apparently, pax headers insist that they are UTF-8, which means that they constrain the output of Bazel to UTF-8 also.

Now I see where you are coming from: commons-compress doesn't tell whether there are pax headers and thus it doesn't tell you whether UTF-8 decoding was required. So what you could do is to keep track of Strings that were emitted by the Charset, thus you know which Strings went through the UTF-8 decoding and thus undo it. Is this correct?

I don't claim any knowledge of either commons-compress or the intricacies of tar files, but TarArchiveEntry seems to have some methods that have Pax in their name. Maybe those could be used to disambiguate instead of the hack you proposed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tetromino Friendly ping

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lberki Since I wanted to know whether the hack would work, I went ahead and implemented it. It does seem to work and I didn't even have to use a weak cache.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lberki @tetromino Friendly ping, I would really like to get this into Bazel 7. We currently have to maintain a number of hacks in rules_go just to be able to extract a Go SDK.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My gut reaction to the encoding hack was pretty negative. Then I reread the code and description a couple times, and decided that the hack is only ugly because the apache api is insufficiently flexible for bad clients like us that want to do weird things with encodings. I'm ok with the hack provided it is sufficiently explained in comments, and doesn't leak outside this file.

I would like us to be pretty explicit in comments about what the format of the input headers and returned string are. I.e., Javadoc for the class CompressedTarFunction or its decompress() method, saying that USTAR headers are interpreted as latin-1, PAX headers are interpetted as UTF-8, and both types are turned into Java Strings holding the UTF-8 code units that represent the string content, in order to match our string behavior elsewhere.

One more question. We eventually want to fix Bazel's decoding of Starlark source files to be UTF-8 instead of latin-1. Does anything in this PR make that harder? It seems to me this PR is just preserving an existing behavior (USTAR) while fixing an erroneous one (PAX).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more extensive comments, including Javadocs for the class.

One more question. We eventually want to fix Bazel's decoding of Starlark source files to be UTF-8 instead of latin-1. Does anything in this PR make that harder? It seems to me this PR is just preserving an existing behavior (USTAR) while fixing an erroneous one (PAX).

I don't think that it would make this harder. It does introduce one or two new places in which ISO_8859_1 would need to be replaced with UTF-8, but those are only related to prefix stripping, not the actual entry decoding logic that interfaces with commons-compress and relies on the custom decoder hack.

@fmeum fmeum force-pushed the fix-archive-file-names branch from 754a626 to cef4b5a Compare May 19, 2023 09:04
@fmeum fmeum marked this pull request as ready for review May 19, 2023 09:17
@fmeum fmeum requested review from Wyverald and meteorcloudy May 19, 2023 09:17
@github-actions github-actions bot added awaiting-review PR is awaiting review from an assigned reviewer team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. labels May 19, 2023
@meteorcloudy meteorcloudy requested a review from lberki June 6, 2023 11:42
@fmeum fmeum force-pushed the fix-archive-file-names branch from cef4b5a to e408e31 Compare September 5, 2023 06:57
@fmeum fmeum requested a review from tetromino September 5, 2023 11:10
@fmeum fmeum force-pushed the fix-archive-file-names branch 5 times, most recently from 20e7202 to ebd3f3c Compare September 6, 2023 08:59
@lberki
Copy link
Contributor

lberki commented Sep 25, 2023

@tetromino is not available these days. I don't have any context on this pull request so I wouldn't be able to give it the attention it deserves, but let me try to find someone who has all the bits swapped into their brain.

@brandjon brandjon self-requested a review September 26, 2023 20:14
@brandjon brandjon self-assigned this Sep 26, 2023
When creating a `PathFragment` from a ZIP or TAR entry file name, the
raw bytes of the name are now wrapped into a Latin-1 encoded String,
which is how Bazel internally represents file paths.

Previously, ZIP entries as well as TAR entries with PAX headers would
result in ordinary decoded Java strings, resulting in corrupted file
names when passed to Bazel's file system operations.
@fmeum fmeum force-pushed the fix-archive-file-names branch from ebd3f3c to c10070b Compare September 27, 2023 12:01
@fmeum fmeum force-pushed the fix-archive-file-names branch from c10070b to fe1d084 Compare September 27, 2023 12:04
@fmeum fmeum requested a review from brandjon September 27, 2023 12:20
@brandjon
Copy link
Member

Thanks for the additional javadoc/comments.

@fmeum
Copy link
Collaborator Author

fmeum commented Sep 28, 2023

@bazel-io flag

@bazel-io bazel-io added the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Sep 28, 2023
@meteorcloudy meteorcloudy removed the awaiting-review PR is awaiting review from an assigned reviewer label Sep 28, 2023
@bazel-io bazel-io removed the potential release blocker Flagged by community members using "@bazel-io flag". Should be added to a release blocker milestone label Sep 28, 2023
@fmeum
Copy link
Collaborator Author

fmeum commented Oct 6, 2023

@brandjon Is this still on track for getting merged?

@meteorcloudy
Copy link
Member

@brandjon is still fixing some internal breakages.

@fmeum
Copy link
Collaborator Author

fmeum commented Oct 6, 2023

@brandjon is still fixing some internal breakages.

Thanks for the update. I didn't know this code is used at Google - isn't it only relevant for external repos? Just curious.

@meteorcloudy
Copy link
Member

Might be just some irrelevant breakages, Jon is OOO until next Tuesday, I can help merge this.

@copybara-service copybara-service bot closed this in 10169bb Oct 9, 2023
@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label Oct 9, 2023
@fmeum fmeum deleted the fix-archive-file-names branch October 9, 2023 07:46
fmeum added a commit to fmeum/bazel that referenced this pull request Oct 9, 2023
When creating a `PathFragment` from a ZIP or TAR entry file name, the raw bytes of the name are now wrapped into a Latin-1 encoded String, which is how Bazel internally represents file paths.

Previously, ZIP entries as well as TAR entries with PAX headers would result in ordinary decoded Java strings, resulting in corrupted file names when passed to Bazel's file system operations.

Fixes bazelbuild#12986

Fixes bazel-contrib/rules_go#2771

Closes bazelbuild#18448.

PiperOrigin-RevId: 571857847
Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89
meteorcloudy pushed a commit that referenced this pull request Oct 9, 2023
…mes (#19765)

When creating a `PathFragment` from a ZIP or TAR entry file name, the
raw bytes of the name are now wrapped into a Latin-1 encoded String,
which is how Bazel internally represents file paths.

Previously, ZIP entries as well as TAR entries with PAX headers would
result in ordinary decoded Java strings, resulting in corrupted file
names when passed to Bazel's file system operations.

Fixes #12986

Fixes bazel-contrib/rules_go#2771

Closes #18448.

PiperOrigin-RevId: 571857847
Change-Id: Ie578724e75ddbefbe05255601b0afab706835f89

Fixes #19671
@brandjon
Copy link
Member

Sorry, internal presubmits failed and then I went OOO. Thanks Yun for completing the merge!

mattyclarkson added a commit to mattyclarkson/rules_go that referenced this pull request Feb 24, 2024
The fix for Latin-1 encoded files was landed in [6.4.0] after
being landed on Bazel [master].

We can conditionally use the old, unhermetic, `tar` workaround
on modern Bazel versions.

[master]: bazelbuild/bazel#18448
[6.4.0]: bazelbuild/bazel#19765
mattyclarkson added a commit to mattyclarkson/rules_go that referenced this pull request Feb 24, 2024
The fix for Latin-1 encoded files was landed in [6.4.0] after
being landed on Bazel [master].

We can conditionally use the old, unhermetic, `tar` workaround
on modern Bazel versions.

[master]: bazelbuild/bazel#18448
[6.4.0]: bazelbuild/bazel#19765
mattyclarkson added a commit to mattyclarkson/rules_go that referenced this pull request Feb 24, 2024
The fix for Latin-1 encoded files was landed in [6.4.0] after
being landed on Bazel [master].

We can conditionally use the old, unhermetic, `tar` workaround
on modern Bazel versions.

[master]: bazelbuild/bazel#18448
[6.4.0]: bazelbuild/bazel#19765
mattyclarkson added a commit to mattyclarkson/bazel_features that referenced this pull request Feb 24, 2024
fmeum pushed a commit to bazel-contrib/bazel_features that referenced this pull request Feb 24, 2024
mattyclarkson added a commit to mattyclarkson/rules_go that referenced this pull request Feb 25, 2024
The fix for Latin-1 encoded files was landed in [6.4.0] after
being landed on Bazel [master].

We can conditionally use the old, unhermetic, `tar` workaround
on modern Bazel versions.

[master]: bazelbuild/bazel#18448
[6.4.0]: bazelbuild/bazel#19765
mattyclarkson added a commit to mattyclarkson/rules_go that referenced this pull request Feb 25, 2024
The fix for Latin-1 encoded files was landed in [6.4.0] after
being landed on Bazel [master].

We can conditionally use the old, unhermetic, `tar` workaround
on modern Bazel versions.

[master]: bazelbuild/bazel#18448
[6.4.0]: bazelbuild/bazel#19765
mattyclarkson added a commit to mattyclarkson/rules_go that referenced this pull request Feb 25, 2024
The fix for Latin-1 encoded files was landed in [6.4.0] after
being landed on Bazel [master].

We can conditionally use the old, unhermetic, `tar` workaround
on modern Bazel versions.

[master]: bazelbuild/bazel#18448
[6.4.0]: bazelbuild/bazel#19765
mattyclarkson added a commit to mattyclarkson/rules_go that referenced this pull request Feb 25, 2024
The fix for Latin-1 encoded files was landed in [6.4.0] after
being landed on Bazel [master].

We can conditionally use the old, unhermetic, `tar` workaround
on modern Bazel versions.

[master]: bazelbuild/bazel#18448
[6.4.0]: bazelbuild/bazel#19765
mattyclarkson added a commit to mattyclarkson/rules_go that referenced this pull request Feb 25, 2024
The fix for Latin-1 encoded files was landed in [6.4.0] after
being landed on Bazel [master].

We can conditionally use the old, unhermetic, `tar` workaround
on modern Bazel versions.

[master]: bazelbuild/bazel#18448
[6.4.0]: bazelbuild/bazel#19765
mattyclarkson added a commit to mattyclarkson/rules_go that referenced this pull request Feb 25, 2024
The fix for Latin-1 encoded files was landed in [6.4.0] after
being landed on Bazel [master].

We can conditionally use the old, unhermetic, `tar` workaround
on modern Bazel versions.

[master]: bazelbuild/bazel#18448
[6.4.0]: bazelbuild/bazel#19765
mattyclarkson added a commit to mattyclarkson/rules_go that referenced this pull request Feb 26, 2024
The fix for Latin-1 encoded files was landed in [6.4.0] after
being landed on Bazel [master].

We can conditionally use the old, unhermetic, `tar` workaround
on modern Bazel versions.

[master]: bazelbuild/bazel#18448
[6.4.0]: bazelbuild/bazel#19765
fmeum pushed a commit to bazel-contrib/rules_go that referenced this pull request Feb 26, 2024
* Remove Latin-1 workaround on Bazel 6.4.0+

The fix for Latin-1 encoded files was landed in [6.4.0] after
being landed on Bazel [master].

We can conditionally use the old, unhermetic, `tar` workaround
on modern Bazel versions.

[master]: bazelbuild/bazel#18448
[6.4.0]: bazelbuild/bazel#19765

* Use `versions` rather than `bazel_features` to avoid circular load pattern

Due to the polyfilling of `bazel_features` to avoid double step loading in
`WORKSPACE`.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file.
Projects
None yet
7 participants