-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for unpacking using custom Unpacker API #355
base: main
Are you sure you want to change the base?
Conversation
This introduces a new Unpacker trait to the API, and new `unpack_with` methods to Archive and Entry. Custom unpackers can be used to abstract/optimize IO and filesystem operations, for example by using io_uring on linux. The current/default unpacking code is moved to DefaultUnpacker.
The high level goal makes total sense. That said though, is your software already doing decompression in a separate thread from the one doing unpacking? If you just implement this software in the obvious way of putting a zstd decoder reader in front of I'm curious, is your program also processing multiple entries asynchronously? IOW when you have multiple regular files in a row, you're writing them to disk simultaneously? This all said...this is a big PR, and it's not clear to me it's really a win versus just directly processing the entries externally. Or maybe instead of this PR, what you could do is maintain e.g. a tokio-tar-uring crate (that's a whole other thing in this, there's also https://docs.rs/tokio-tar/latest/tokio_tar/ and many people doing stuff like this want to do async too). |
Thanks for taking a look! :)
Yes I'm writing multiple entries asynchronously, and I'm doing a bunch of other stat/link etc syscalls asynchronously with io-uring too. In my case the tarball is large and contains many entries (~2 million files), so syscall and fs overhead are significant. I want to do writes and all other syscalls as fast as possible. I have profiled what I'm doing extensively, wrote a custom profiler for it and gave a talk about it at a rustau meetup a while ago if you're interested https://www.youtube.com/watch?v=JX0aVnpHomk.
Not entirely sure I understand what you mean? Externally how?
Sure I can fork, but I'd rather not fork if I can avoid it. |
First, bear in mind I have no “official” maintainer role on this repository. I just commented because I have other outstanding PRs here and I try to help.
> This all said...this is a big PR, and it's not clear to me it's *really* a win versus just directly processing the entries externally.
>
Not entirely sure I understand what you mean? Externally how?
Maybe I am misunderstanding but it seems like you are changing the unpacking code here to have it drive a virtual file system API effectively.
But you can just use the existing ability to read the data directly from the tar archive and have your own code do the unpacking, right? The cost here would just be needing to handle some of the “normalization” (especially security related) that the code here does. That said IMO since you are only targeting Linux you can use openat2 RESOLVE_BENEATH anyways.
Sure I can fork, but I'd rather not fork if I can avoid it.
Yeah, of course. That said maybe what we really need in the ecosystem is a “sans-io” tar crate.
|
I figured, all help is welcome :)
It's not "just" normalization tho, It preserves mtimes/ownerships handles tar extensions etc. Sure I can do it in my code, but I'm trying to avoid having to do exactly that and for everyone who wants to extract large tarballs efficiently to have to do it too.
Yeah - isn't this PR a step in that direction? There's a default IO implementation, or you can bring your own via the Unpacker trait. I did call it My view is this: this crate is very popular and it does do IO today. It works well for reasonably large files. It doesn't work so well for very large files, and there's nothing a consumer of the API can do about it other than forking and becoming a tar expert. We could create another sans-io crate, but that wouldn't fix the issue that unpacking large files with this crate is not great, and that seems like a fixable thing not something that justifies a whole new crate. If we had something like the Unpacker trait in this PR, then I would be happy to maintain the fast io-uring code in a separate crate. In this crate we could add docs to Lastly, I concede that the PR is big, but it's mostly shuffling existing nested function definitions around. I can probably reduce the diff - I didn't spend too much time trying to make the perfect PR because I was seeking feedback on the general direction. And again, I do appreciate your feedback! |
The default As far as preserving mtimes and ownership - that's the thing, ultimately you as a consumer of this "custom unpacking api" would need to bridge between the mtimes provided and an io-uring or whatever other API anyways, right? It seems to me the main advantage of an "unpacker API" here is that the implementation would effectively be forced to handle all the tar attributes (or at least, explicitly ignore them). As far as other tar extensions like xattrs; I think we should have a nice high level API to access them that's used by the unpacking code. (This is very tangential but we have a different problem in that there's no API to build archives with some of those extensions) |
I wanted to update you here with a bit of feedback; see #379 (comment) At the current time I personally think it makes more sense for us to aim more twards a "sans-io" style crate, and leave unpacking/packing from the filesystem like we have today as "sugar". This is going in a bit of a contrary direction to that, and in the end it's not doing anything that one couldn't do external to this crate today either. Bridging between tar and io-uring is cool of course, but I think what we want here is another crate that is that bridge. IOW nothing stops you from publishing a "tar-event-stream" or something crate that is basically this "unpacker callback API' right? |
Hello!
This PR proposes introducing a new
Unpacker
trait to the API, and newunpack_with
methods toArchive
andEntry
. The aim is to offer users a way to use specialized fs and IO operations to speed up unpacking.Background: I work on a program that unpacks a really large .tar.zst (60G compressed, 188G uncompressed) at launch. Using tar-rs main which uses regular fs ops and io::copy, it takes about 4m30s to unpack. Using io_uring on linux, with pre-allocated huge page buffers etc, it takes about 1m30s uncompressing at ~1.8GB/s which is the max zstd can do.
This PR introduces the API I need to plumb io_uring, without it being specific to io_uring.