Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Support adding sparse files to archives #375

Merged
merged 2 commits into from
Sep 20, 2024

Conversation

xzfc
Copy link
Collaborator

@xzfc xzfc commented Aug 22, 2024

This PR extends Builder to support archiving sparse files in the GNU format. This is the format that GNU tar produces by default giving the --sparse option, assuming POSIXLY_CORRECT is not set.

Implementing sparse archives gives two useful properties:

  • Archiving a small on-disk file will result in a small archive, even if the file contains large holes. (although this could be solved by piping it into a compressor, e.g. zstd)
  • Pack-unpack roundtrip will preserve exact on-disk file sizes. (not solved by piping archive into a compressor)

Design choices in this PR

  1. It uses the gnu/oldgnu format rather than the newer GNU.sparse extension to the PAX format.
    Reasons:

    1. The builder coupled with the GnuHeader struct already.
    2. Tar-rs already supports reading the old format since Implement sparse files support #62.
      The PR to support the PAX format Add support for PAX Format, Version 1.0 #298 remains open.
  2. It is enabled by default like BSD tar does, and unlike GNU tar does.

  3. It's configured using Builder::sparse(&mut self, sparse: bool) method.
    If you think it should be configured via HeaderMode instead, please suggest the desired enum layout.

  4. It uses SEEK_HOLE to detect holes in the file. For comparison, other methods and implementations are listed below.

    Method System This PR GNU tar libarchive
    BSD tar
    Raw Any
    SEEK_HOLE Unix-ish,
    Linux 3.1+
    FIOMAP Linux 2.6.28+
    FSCTL_QUERY_ALLOCATED_RANGES Windows

    The "Raw" method of GNU tar ignores actual holes and works by reading the file in 512-byte blocks and comparing them with zeros.
    I did not include it as I believe it puts this feature into the "compression algorithm" realm, rather than the "preserve filesystem properties" realm.

    I think implementing FSCTL_QUERY_ALLOCATED_RANGES method for Windows would be a useful addition, but currently, I don't have much interest in it.

  5. Linux, Solaris-like, and FreeBSD-like systems are supported.
    After Add SeekData and SeekHole to Whence for hurd/apple targets nix-rust/nix#2473 got into the release (v0.30.0 I think), it could be possible to extend it to Hurd and Apple.

    System Support status In this PR
    Linux Supported. Enabled.
    freebsd Supported with a caveat on UFS: the last block (32KiB) of a file is always dense. Source: my tests on 14.1. Not a huge problem, but needs a special case in unit test. Enabled.
    hurd
    solaris
    illumus
    Supported according to docs, didn't tested. Disabled due to lack of testing from my side.
    dragonfly
    netbsd
    openbsd
    Not supported at all. Source: [dragonfly lseek.c], [gnulib lseek doc]. Disabled.
    macos Claimed to be supported, but highly problematic. See [gnulib unistd.h.in], [gnulib lseek.c], [gnulib lseek doc]. Also, I did catch a weird behavior on macOS github runners: files loose their sparseness right after the moment you close them. Disabled.
  6. A new dependency nix is pulled for convenience.
    As a consequence, the MSRV is raised to 1.69.
    Although it's possible to rewrite it with unsafe libc calls if that's preferred.

  7. During archive creation, it allocates 16 bytes of memory per hole. (see struct SparseEntries)
    In the worst pathological case, it will allocate 2 MiB per GiB of a file in a "zebra" pattern. (4096 bytes of data followed by a 4096-byte hole)
    If it's a concern, I think, it's possible to implement it in an allocation-free way for Builder<SW> where W: Seek + Write.

  8. Two tests involving sparse files are added.
    Tests are run unconditionally, but strict checks are only done in the following cases:

    • tests/all.rs writing_sparse: strict only on Linux. Not strict on FreeBSD due to the UFS caveat mentioned above.
    • src/builder.rs test_find_sparse_entries: strict on Linux and FreeBSD.

    Potentially, these strict checks might fail when run on a filesystem that doesn't support holes.

@xzfc xzfc force-pushed the sparse-builder branch 2 times, most recently from 63f17d1 to ef748ac Compare August 23, 2024 13:03
Copy link
Owner

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! I'll apologize though in that I don't have a huge amount of time and energy for this crate so while this looks quite good it's also large-ish so will probably take some time to land. If you're up for making some changes though I think this can be made easier to land perhaps.

  • For turning this on-by-default the main thing that I'd be worried about is that this can affect Cargo. Cargo wants good compatibility with older versions of Cargo as well so I suspect that Cargo will want to disable this by default when updating. Would you be up for helping to coordinate that? I'll note though that I agree turning this on-by-default is reasonable.
  • For depending on nix, I'd personally prefer not to. How difficult would it be to use libc and syscalls there?
  • Could you write some more comments in various places? I don't have the tar specification in my head unfortunately so if you're able to link to various places as to what sparse is and how it works that'd be helpful. For example there's one location where it says let mut it = entries.entries.iter().skip(4).peekable(); but I'm not sure why 4 entries are skipped.
  • Do you think there's some extra tests that could be written? The tests of find_sparse_entries look pretty good (thanks!) but the integration tests in tests/all.rs are relatively light. If you feel everything is covered though that's also ok.

And finally, this is only if you're interested and is optional, but passing around sparse: bool like the other parameters looks like it's not the greatest thing in the world. It might be best to wrap up a options like that in some sort of struct stored in the builder which can be passed aroudn to avoid having lots of parameters. If you're interested in such a refactoring I think it'd be worth doing that, but again it's optional and I don't think it's required for this.

tests/all.rs Outdated
Comment on lines 1214 to 1215
assert!(data.len() <= 0x4_000); // Typically 0x1800 but potentially might
// vary based on a filesystem.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this entire test run unconditionally on all systems, and this assert be limited to only the known OSes that implement sparse file support?

@xzfc
Copy link
Collaborator Author

xzfc commented Sep 5, 2024

[…] I suspect that Cargo will want to disable this by default when updating. Would you be up for helping to coordinate that? […]

Sure. What we need is to add a few builder.sparse(false); calls. I'll create PR after this PR gets merged.
Perhaps it is also worth mentioning (maybe in release notes) potential compatibility issues:

Note

Starting from this version, the Builder will create sparse archives by default, unless disabled explicitly by calling builder.sparse(false). Ability to read sparse entries support was added in tar-rs 0.4.6. Earlier versions of this crate won't be able to extract sparse files from archives.

Also, for a timeline, the corresponding tar-rs bump for Cargo was rust-lang/cargo#2973, which is included in Cargo 0.13 which is included in rust 1.12.

For depending on nix, I'd personally prefer not to. How difficult would it be to use libc and syscalls there?

Not difficult, just a bit more of unsafe calls. I've updated the PR to remove it.

Could you write some more comments in various places? […] I'm not sure why 4 entries are skipped.

Done. As for 4, I've added a const GNU_SPARSE_HEADERS_COUNT and a comment about it with a link.

Do you think there's some extra tests that could be written? […] the integration tests in tests/all.rs are relatively light.

Updated the integration test to be more comprehensive.

[…] passing around sparse: bool like the other parameters looks like it's not the greatest thing in the world. […] If you're interested in such a refactoring I think it'd be worth doing that […]

I've prepended a commit to introduce BuilderOptions.


Besides that, I've updated the list of supported platforms, see today's edits of the PR description.

@alexcrichton
Copy link
Owner

I apologize for the delay in getting back to this, but I also want to express many thanks again for your work on this! I really appreciate your willingness to send a PR here and put up with me being slow, and I think this has shaped up well!

Given the age of sparse file support I think it might be reasonable to start off opening an issue on the Cargo repo alerting maintainers to the status of the situation and asking them if they think it's worth adding sparse(false) calls. Rust 1.12 is so old I think it'd be reasonable to leave as-is, but it's worth looping them in regardless.

@alexcrichton alexcrichton merged commit 7a12150 into alexcrichton:main Sep 20, 2024
7 checks passed
@alexcrichton
Copy link
Owner

Also, if you're willing, I'd love to have help maintaining this repo if you're interested!

@xzfc
Copy link
Collaborator Author

xzfc commented Sep 24, 2024

I've created an issue at the Cargo repo (see the link above).

Also, if you're willing, I'd love to have help maintaining this repo if you're interested!

Yes, I'd be happy to help maintain this repo.

@alexcrichton
Copy link
Owner

Thanks! That also reminds me that I'd forgotten to publish these changes, which I've done so now.

I've added you and @cgwalters to this repo now. Feel free to review/merge PRs and/or help with issue triage as you see fit! I'll continue to help out with publication and actually running cargo publish for now if that's ok.

@cgwalters
Copy link
Collaborator

I created #379 to track followup to this (having important discussion on a closed PR feels odd).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants