-
-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Streaming archive/unarchive capabilities #2815
Comments
I think this is a great idea! I think the best way to approach this would be a separate subcommand. So I was imagining something like
where source could be any local or remote directory and destination could be any local or remote file. so
With the analogous However having looked at your archive command I think you are probably thinking a bit more ambitiously that just zip... so the command could be Or maybe detection of the file extension should be enough to work out what it should do
and maybe with some flags to say unambiguously what was required
How would you do the detection? With file name - that would be easy enough. Rclone can read mime types of things too. |
Hmm, perhaps it would be better to deal with this using your archiver binary and |
So, yeah -- the archiver package mostly encourages and in some cases enforces file extensions for detecting file type. When reading archives using high-level functions that aren't specific to a certain format, it prefers the actual file header (but we need to add support for compressed tar.* files by header). When writing files, it requires that the extension matches the format to avoid confusion like I had to deal with once that put me back by HOURS. Anyway, file extension is a fine way to go generally, and people who want a different extension can rename before or after the archival operation.
How would that work? The main reason I didn't implement streaming archival operations in the binary/command (the library supports this, though) is because I wasn't sure which protocol to use to differentiate different files in the stream. We could of course read in a tar stream, but then the program outputting that stream might as well be writing the tar file itself! See what I mean? If the program emitting the stream can delineate the files and write them out, it might as well write the tar stream itself. I agree, though, if there's a way for users to just glue existing commands together, that'd be better. If this was added to rclone:
This is a good interface, I like it. Similarly:
would read the zip file and extract them into the destination. In these examples, I might even be able to submit a PR just for fun, if you think it's not feature creep. Let me know what you decide! |
I was thinking that the files would be local and the compressed archive would be streamed in or out. That isn't as general purpose though. It does give me an idea though that either the source or destination could be I could see something like You can do
Would you use your archiver library for that? What formats would you support? |
The only thing about streaming zip files is that when reading them (not writing), you need to know the length of the stream when you begin.
True; it could be nice to make it cross-platform.
Yes, and probably all the formats that the I guess I'll shelve this proposal for now until I have a more concrete understanding of how the API/interface will work, and to see if there's any other demand for it. |
Rclone objects know their size. You can seek them too but it is a bit clunky - you have to close them and re-open the streams.
:-) |
As I have said in #2891, I think one solution is like this: There are other issues to consider if we want to take this approach, including but not limited to:
|
I think the tar support should be written in go. I have done a PHP implementation to stream a directory as tar: https://github.com/Ciantic/archive-to-output-stream/blob/master/tardir.php (it just streams the whole directory as a tar file, so it should work for arbitrarily big directories) Ideally the go implementation should have similar properties. It should be streaming, so that it doesn't gather up the files, and no compression either. |
I will second it. My use case is exactly this. I need to create an archive from a remote folder. I tried My additional suggestion is to support outputting the archive to stdout if the target is not set. So |
If this would be implemented in a way you have currently in mind, would the archiving happen on the fly? Let's say I have 300 GB of files on an external disk and I want to bundle them in a single archive and store in S3. My PC does not have 300 GB of free space to archive it first and then upload. Would this backend be able to archive it on the fly, not needing a free memory/disk space of equal size to the archive? I also think that having the ability to bundle small files into larger archives should speed up the upload to storage like S3. |
Archiver v4 has a very good, stream-oriented API that would be perfect for this. And yes, it can create archives on the fly in memory. https://github.com/mholt/archiver |
In v1.58.0-beta.5990.02faa6f05.zip-backend on branch zip-backend (uploaded in 15-30 mins) is an experiment I did a while back to make a zip backend using @mholt that should give you an idea of what interfaces rclone needs. Roughly the interfaces rclone would want from archiver are:
And the interfaces rclone would provide to archiver are
Is this something archiver can do? |
One important aspect of archiving is the final file size. Would From what I understand, we can only determine the final file size when there is no compression, i.e., store mode. But that may be desirable. |
@ncw Awesome! That's an impressive amount of work. 😳
True. To clarify, my archiver package does give you the Regarding your interface questions:
Yep, see (Since archives can contain many many entries, this is a Walk-style interface instead of returning a slice. But if you really want a slice you can use the
Yeah, files can be streamed, but does even But, in any case, you'd probably want
Yes, but new archives only. I haven't seen any robust literature -- let alone Go implementations -- that suggest you can reliably append to the Zip archive format. I think even the
If I understand correctly, then yeah, Archiver can close out archives properly when you're done writing them. Again, it uses
Yes, archiver uses these types. It actually requires Hope I understood your questions correctly. Let me know if there are more questions! |
I think so.
That looks fine. I assume it doesn't actually read the data unless you ask for it?
Rclone will discard bytes read until the seek point if the stream can't seek which is inefficient, but works!
Yes, I'm assuming that we are only ever creating new archives not updating old ones. Looking at your interfaces, I think the biggest problem is
This assumes that we know all the What rclone would like is to be able to archive files one at a time then it can supply the file data and the file metadata at the same time. Only that way does rclone not have to buffer the files to disk. If the prototype was something like this, with the caller being expected to close
Then rclone could supply |
Correct. It won't even open the file until you call
Good point! Well, rclone would have to at least iterate the list of files before calling So yeah, if that won't work for you, I'm totally down for adding a "builder"-style interface, that incrementally adds files as you discover them, kind of like I'll see if I can come up with some sort of |
@mholt wrote:
Alas, that won't work for rclone. Rclone doesn't really deal in files, only in streams. The internals of rclone expect the file to be uploaded once the
Yes I like the idea of multi archive type support very much!
Either style would work for me :-) |
@ncw I've implemented It's an optional interface, but both Zip and Tar implement it, so you can just type-assert your |
I came here exactly because of this reason. I have lots of folders with thousands of images that do not need to be out of a zip. Besides being a pain to synchronize, I'm near the limit of # of files on gdrive. |
@AllanVan - have a go with the binary I posted in this message: #2815 (comment) |
I was just wondering about the use case where on remote you have a Do I understand right that this use case would also be covered by the above issue? |
How about |
Hi all, could please someone (maybe @ncw ) confirm this feature is currently not available in rclone? I'm particularly interested in the the use case where we store compressed tar.gz files remotely on S3, and we stream and unarchive them onto the local filesystem without having to create a copy first. Thank you. |
I would be interested in this feature too. Very helpful for big files :) |
What is your current rclone version (output from
rclone version
)?1.45
What problem are you are trying to solve?
It could be useful to download or upload files as an archive, without having to first download or upload the files and then create the archive, causing the space usage to be almost doubled potentially.
In other words, rather than downloading the files and then creating a zip or tar.gz archive, then deleting the original downloaded files, it'd be nice to download the files as an archive, so that the files don't use so much extra space on disk just to make the archive.
How do you think rclone should be changed to solve that?
I could imagine a few ways to do it. Rclone could take a folder or archive file on one end, and spit out the opposite on the other end. For example:
folder source -> archive destination
creates an archive file, andarchive source -> folder destination
extracts it, regardless of backends being used.I'm not sure, though, if detecting whether the source or destination is an archive file is trivial.
So maybe a flag?
--archive
or something.cf: #675 (comment)
Thanks so much for your work on rclone!
The text was updated successfully, but these errors were encountered: