Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(gzip): change compression block size #508

Closed
wants to merge 1 commit into from

Conversation

rchincha
Copy link
Contributor

Several tools set this as the default value.

image/copy/copy.go
56:var compressionBufferSize = 1048576

skopeo/vendor/github.com/containers/image/v5/copy/compression.go
24: compressionBufferSize = 1048576

Choosing a different value causes the final gzip'ed layer to become different although the source tar is identical, causing an content-addressable invariants to break.

Several tools set this as the default value.

image/copy/copy.go
56:var compressionBufferSize = 1048576

skopeo/vendor/github.com/containers/image/v5/copy/compression.go
24:     compressionBufferSize = 1048576

Choosing a different value causes the final gzip'ed layer to become
different although the source tar is identical, causing an
content-addressable invariants to break.

Signed-off-by: Ramkumar Chinchani <rchincha@cisco.com>
@cyphar
Copy link
Member

cyphar commented Oct 11, 2023

Unfortunately, a lot of things can cause the output blobs to change despite having the same input -- updates to the gzip compression library will cause changes, for instance (this is why we have tests for this in CI, and is the reason I haven't updated the gzip library we use -- but I think the data is too small to detect the difference in the buffer size).

causing an content-addressable invariants to break

I'm not sure what invariants you're referring to -- there is no guarantee that you can get the same gzip blob from the same input with different programs/libraries (like all compression algorithms, gzip does not have a canonical representation). The best we've been able to do with umoci is try our best to guarantee that we will not produce different blobs for the same input across versions. (IMHO it seems pretty clear at this point that baking in compression to the image blob format was a mistake.)

Or are you running into this because you use umoci as a library as well as containers/image, and the different settings cause the gzip stream of the same input to have different outputs? That's annoying, but it seems possible for github.com/klauspost/pgzip to have non-deterministic output and so assuming that two gzip streams of the same input will have the same hash seems like a questionable assumption (presumably this is the issue you're having?).

(FWIW, I don't mind merging this, though it will mean that we will start outputting streams with different hashes to previous versions...)

@codecov-commenter
Copy link

codecov-commenter commented Oct 11, 2023

Codecov Report

Merging #508 (fe16290) into main (782bbb1) will decrease coverage by 0.05%.
Report is 13 commits behind head on main.
The diff coverage is 0.00%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #508      +/-   ##
==========================================
- Coverage   73.53%   73.48%   -0.05%     
==========================================
  Files          60       60              
  Lines        4881     4884       +3     
==========================================
  Hits         3589     3589              
- Misses        933      935       +2     
- Partials      359      360       +1     
Files Coverage Δ
mutate/compress.go 43.33% <0.00%> (ø)

... and 3 files with indirect coverage changes

@rchincha
Copy link
Contributor Author

rchincha commented Oct 11, 2023

(IMHO it seems pretty clear at this point that baking in compression to the image blob format was a mistake.)

Without question - anything that affects the final output should have been considered.

In the OCI world, unfortunately no single tool does all jobs and one has to be careful that they interoperate well. Our use case was to deduplicate content and unfortunately changing compression params changes the output although the input was identical (across tools of course)

As you may be already aware, we use umoci as a library in https://github.com/project-stacker/stacker.

(FWIW, I don't mind merging this, though it will mean that we will start outputting streams with different hashes to previous versions...)

At least speaking just for us, we are ok with this ^

@mikemccracken
Copy link
Contributor

I think this might be better off as a configuration, there may indeed be cases where this will break assumptions in tools using this.

@rchincha
Copy link
Contributor Author

containers/image@5b23c4b
^ Adding more context wrt to this issue.

@rchincha
Copy link
Contributor Author

rchincha commented Oct 13, 2023

@cyphar et al
Pls look also at: #509

@rchincha
Copy link
Contributor Author

Closing this in favor of PR #509

@rchincha rchincha closed this Oct 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants