Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update commons-compress to 1.26.1 #22213

Closed
wants to merge 13 commits into from
Closed

Conversation

mark-thm
Copy link
Contributor

@mark-thm mark-thm commented May 2, 2024

Fixes #20269.

Update commons-compress to 1.26.1 and swap use of GZIPInputStream to commons-compress' GzipCompressorInputStream, which deals correctly with concatenated gz files. Add a test to demonstrate this fixes the ruff extraction (thanks, fmeum) and update all related lockfiles.

@github-actions github-actions bot added the awaiting-review PR is awaiting review from an assigned reviewer label May 2, 2024
@iancha1992 iancha1992 added the team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file. label May 2, 2024
@sgowroji sgowroji added awaiting-user-response Awaiting a response from the author and removed awaiting-review PR is awaiting review from an assigned reviewer labels May 2, 2024
@fmeum
Copy link
Collaborator

fmeum commented May 2, 2024

@mark-thm
Copy link
Contributor Author

mark-thm commented May 2, 2024

@fmeum yes, that’d be helpful, thanks. What i know so far:

  • in 1.26.x, commons-compress started calling Charset.forName(encoding) for the encoding passed to TarArchiveInputStream
  • There’s no path (today) to pass a Charset directly instead of a String name
  • It looks like the CharsetProvider you created should work (I’m curious if newer versions of Java depend on availableCharsets for some optimization, but hadn’t chased down docs/code fully)

It seems plausible we could fix the second thing by contributing a new constructor to ArchiveInputStream, but otherwise out of ideas.

@fmeum
Copy link
Collaborator

fmeum commented May 2, 2024

Turns out this was only an issue in Bazel's bootstrap script. Could you try adding this patch?

diff --git a/scripts/bootstrap/compile.sh b/scripts/bootstrap/compile.sh
index 3c49679e1a..dcc7e7f846 100755
--- a/scripts/bootstrap/compile.sh
+++ b/scripts/bootstrap/compile.sh
@@ -155,12 +155,17 @@ function create_deploy_jar() {
   local output=$3
   shift 3
   local packages=""
-  for i in $output/classes/*; do
+  # Only keep the services subdirectory of META-INF (needed for AutoService).
+  for i in $output/classes/META-INF/*; do
     local package=$(basename $i)
-    if [[ "$package" != "META-INF" ]]; then
-      packages="$packages -C $output/classes $package"
+    if [[ "$package" != "services" ]]; then
+      rm -r "$i"
     fi
   done
+  for i in $output/classes/*; do
+    local package=$(basename $i)
+    packages="$packages -C $output/classes $package"
+  done

   log "Creating $name.jar..."
   echo "Main-Class: $mainClass" > $output/MANIFEST.MF

@fmeum
Copy link
Collaborator

fmeum commented May 2, 2024

@mark-thm The java.desktop module increases the size of the Bazel binary by 5MB. What kind of error do you see if you remove it?

@mark-thm
Copy link
Contributor Author

mark-thm commented May 2, 2024

@fmeum there’s a difftest for that file that fails without the change. Haven’t looked at why the update causes it to be required just yet.

@mark-thm
Copy link
Contributor Author

mark-thm commented May 2, 2024

via jdeps for commons-compress and its dependencies commons-io, commons-codec, and commons-lang3:

org.apache.commons.compress -> java.desktop
   org.apache.commons.compress.harmony.pack200        -> java.beans           java.desktop
   org.apache.commons.compress.java.util.jar          -> java.beans           java.desktop
   requires java.desktop
org.apache.commons.lang3 -> java.desktop
   org.apache.commons.lang3.concurrent                -> java.beans           java.desktop

I don't think Bazel is using these packages. I see there's a denylist and could instead add these classes to the denylist, but unsure what the protocol is here.

@fmeum
Copy link
Collaborator

fmeum commented May 2, 2024

Yeah, maybe try that first. We can be pretty confident that Bazel tests are at least loading all packages that Bazel depends on.

@mark-thm
Copy link
Contributor Author

mark-thm commented May 2, 2024

@fmeum it looks like deny-listing the classes that were pulling in java.beans/java.desktop has reined in the bloat from ~50MB to around 3MB -- and weirdly the 3MB only seems to show up as an overage on MacOS.

@mark-thm
Copy link
Contributor Author

mark-thm commented May 2, 2024

In case it's helpful:

364K commons-codec-1.17.0.jar
1.0M commons-compress-1.26.1.jar
497K commons-io-2.16.1.jar
643K commons-lang3-3.14.0.jar

@fmeum fmeum requested a review from Wyverald May 2, 2024 16:33
@fmeum fmeum added awaiting-review PR is awaiting review from an assigned reviewer and removed awaiting-user-response Awaiting a response from the author labels May 2, 2024
@Wyverald
Copy link
Member

Wyverald commented May 2, 2024

Does this actually fix #20269, though? It seems to include a change that allows bzip files to be concatenated, but #20269 is about a TAR file.

@fmeum
Copy link
Collaborator

fmeum commented May 2, 2024

Good point, the mailing list post referenced in the JIRA issue actually refers to the unrelated 651, not 654.

@mark-thm
Copy link
Contributor Author

mark-thm commented May 2, 2024

Before I started down the path, I repeated the failing test linked at the top of COMPRESS-654 and then updated commons-compress to 1.26.1 to find that with new commons-compress I was able to successfully extract ruff. I figured I'd start with the simple bump before getting into test authoring, the bump seemed easier to get going with than the test. Live and learn.

Re: Adding a test:

  • I'm not able to get http_archive to fail on the ruff artifacts with a production 7.1.1 build or in a test on 7.1.1, master, or this branch, for that matter.
  • I'm also not able to get rctx.download_and_extract to fail in a test setup on the 7.1.1 tag or on master; it does appear to fail on a production 7.1.1 build in a real-world use-case, so my test authoring is probably bad. To test rctx.download_and_extract, I modified bazel_workspace_test's test_download_and_extract to point it at ruff with the right sha256, and was able to have the output verification checks show the sha256 gets produced.

Maybe this one isn't a good starter contrib?

@fmeum
Copy link
Collaborator

fmeum commented May 2, 2024

Running bazel test //src/test/shell/bazel:bazel_workspaces_test --test_output=streamed --test_filter=test_sparse_tar on this PR with the patch below applied still fails for me with Truncated TAR archive:

git diff | cat
diff --git a/src/test/shell/bazel/bazel_workspaces_test.sh b/src/test/shell/bazel/bazel_workspaces_test.sh
index 8056929902..7f562bd0a0 100755
--- a/src/test/shell/bazel/bazel_workspaces_test.sh
+++ b/src/test/shell/bazel/bazel_workspaces_test.sh
@@ -504,6 +504,22 @@ function test_extract_default_zip_non_ascii_utf8_file_names() {
   ensure_output_contains_exactly_once "external/repo/out_dir/Ä_foo_∅.txt" "bar"
 }

+function test_sparse_tar() {
+  set_workspace_command "
+  repository_ctx.download_and_extract(
+      url='https://github.com/astral-sh/ruff/releases/download/v0.1.6/ruff-aarch64-apple-darwin.tar.gz',
+      sha256='0b626e88762b16908b3dbba8327341ddc13b37ebe6ec1a0db3f033ce5a44162d',
+  )"
+
+  build_and_process_log --exclude_rule "repository @@local_config_cc"
+
+  ensure_contains_exactly 'location: .*repos.bzl:3:25' 1
+  ensure_contains_atleast 'context: "repository @@repo"' 2
+  ensure_contains_exactly 'download_and_extract_event' 1
+
+  [[ -f $(bazel info output_base)/external/repo/ruff-aarch64-apple-darwin/ruff ]] || fail "Expected ruff binary to be extracted"
+}
+
 function test_file() {
   set_workspace_command 'repository_ctx.file("filefile.sh", "echo filefile", True)'

@fmeum
Copy link
Collaborator

fmeum commented May 2, 2024

But this works, so it looks like we need to do more to pick up the fix:

#!/usr/bin/env bash

set -o errexit -o nounset

echo "Downloading commons-compress"
wget https://repo1.maven.org/maven2/org/apache/commons/commons-compress/1.26.1/commons-compress-1.26.1.jar
wget https://repo1.maven.org/maven2/commons-io/commons-io/2.16.1/commons-io-2.16.1.jar
echo "Downloading sample sparse archive"
wget https://github.com/astral-sh/ruff/releases/download/v0.1.6/ruff-aarch64-apple-darwin.tar.gz
gunzip ruff-aarch64-apple-darwin.tar.gz

echo "Testing with system tar"
tar -tf ruff-aarch64-apple-darwin.tar
echo "Testing with commons-compress"
java -cp commons-compress-1.26.1.jar:commons-io-2.16.1.jar org.apache.commons.compress.archivers.Lister ruff-aarch64-apple-darwin.tar

@mark-thm
Copy link
Contributor Author

mark-thm commented May 2, 2024

Yep, that's what I tried before getting started.

ensure_contains_atleast 'context: "repository @@repo"' 2
ensure_contains_exactly 'download_and_extract_event' 1

[[ -f "$(bazel info output_base)/external/repo/ruff" ]] || fail "Expected ruff binary to be extracted"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the ruff tgzs have no prefix folder.

@mark-thm
Copy link
Contributor Author

mark-thm commented May 3, 2024

@Wyverald ready for another look when you have a moment -- updated the PR description to explain the fix that's applied here, and with @fmeum's help there's now a test that demonstrates #20269 is resolved.

Copy link
Member

@Wyverald Wyverald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! thanks for the fix.

@Wyverald Wyverald added awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally and removed awaiting-review PR is awaiting review from an assigned reviewer labels May 6, 2024
@Wyverald
Copy link
Member

Wyverald commented May 6, 2024

just realized I had this comment on draft, which may or may not be correct/useful... but leaving it here for posterity:

looks like the Lister app uses TarFile, while Bazel uses TarArchiveInputStream. Amazingly, these two things seem completely disjoint -- neither uses the other... Does this mean we actually need to switch to TarFile in Bazel?

@mark-thm
Copy link
Contributor Author

mark-thm commented May 6, 2024

I believe your comment is correct but, as the tests show, it's not required to switch to TarFile to pick up the improvement.

I also looked through the other compression formats and whether a similar switch from the compression's native InputStream impls to the commons-compress ones was required, and found that for Zstd there's no option to flip, and for Xz we are already using the correct option. This PR fixes both gz and bz2 formats, the latter since we were here anyway.

@fmeum
Copy link
Collaborator

fmeum commented May 6, 2024

@bazel-io fork 7.2.0

@mark-thm
Copy link
Contributor Author

mark-thm commented May 7, 2024

What’s the procedure for actually getting this merged?

Also, what’s your preferred approach at this point for resolving module lockfile conflicts … should I just continue rebasing?

@Wyverald
Copy link
Member

Wyverald commented May 7, 2024

No action required on your side. We'll import this into Google's codebase, and then Copybara will copy it back out. We'll fix the lockfile during the import.

@copybara-service copybara-service bot closed this in 2330a11 May 7, 2024
@github-actions github-actions bot removed the awaiting-PR-merge PR has been approved by a reviewer and is ready to be merge internally label May 7, 2024
@mark-thm mark-thm deleted the patch-1 branch May 7, 2024 20:05
Wyverald pushed a commit that referenced this pull request May 8, 2024
Fixes #20269.

Update commons-compress to 1.26.1 and swap use of GZIPInputStream to commons-compress' GzipCompressorInputStream, which [deals correctly with concatenated gz files](https://github.com/apache/commons-compress/blob/53c5e19208caaf63946a41d2763cda1f1b7eadc8/src/main/java/org/apache/commons/compress/compressors/gzip/GzipCompressorInputStream.java#L38-L70). Add a test to demonstrate this fixes the ruff extraction (thanks, fmeum) and update all related lockfiles.

Closes #22213.

PiperOrigin-RevId: 631509796
Change-Id: I4038244bfbdfbace747554e988587663ca580c16
github-merge-queue bot pushed a commit that referenced this pull request May 9, 2024
Original PRs/commits:

* #22017
* #22213
*
81117aa

---------

Co-authored-by: Mark Elliot <123787712+mark-thm@users.noreply.github.com>
Co-authored-by: Fabian Meumertzheim <fabian@meumertzhe.im>
Kila2 pushed a commit to Kila2/bazel that referenced this pull request May 13, 2024
Fixes bazelbuild#20269.

Update commons-compress to 1.26.1 and swap use of GZIPInputStream to commons-compress' GzipCompressorInputStream, which [deals correctly with concatenated gz files](https://github.com/apache/commons-compress/blob/53c5e19208caaf63946a41d2763cda1f1b7eadc8/src/main/java/org/apache/commons/compress/compressors/gzip/GzipCompressorInputStream.java#L38-L70). Add a test to demonstrate this fixes the ruff extraction (thanks, fmeum) and update all related lockfiles.

Closes bazelbuild#22213.

PiperOrigin-RevId: 631509796
Change-Id: I4038244bfbdfbace747554e988587663ca580c16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
team-ExternalDeps External dependency handling, remote repositiories, WORKSPACE file.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Truncated TAR archive error during decompressing tar file
5 participants