Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Streaming archive/unarchive capabilities #2815

Open
mholt opened this issue Dec 5, 2018 · 24 comments
Open

Streaming archive/unarchive capabilities #2815

mholt opened this issue Dec 5, 2018 · 24 comments

Comments

@mholt
Copy link
Contributor

mholt commented Dec 5, 2018

What is your current rclone version (output from rclone version)?

1.45

What problem are you are trying to solve?

It could be useful to download or upload files as an archive, without having to first download or upload the files and then create the archive, causing the space usage to be almost doubled potentially.

In other words, rather than downloading the files and then creating a zip or tar.gz archive, then deleting the original downloaded files, it'd be nice to download the files as an archive, so that the files don't use so much extra space on disk just to make the archive.

How do you think rclone should be changed to solve that?

I could imagine a few ways to do it. Rclone could take a folder or archive file on one end, and spit out the opposite on the other end. For example: folder source -> archive destination creates an archive file, and archive source -> folder destination extracts it, regardless of backends being used.

I'm not sure, though, if detecting whether the source or destination is an archive file is trivial.

So maybe a flag? --archive or something.

cf: #675 (comment)

Thanks so much for your work on rclone!

@ncw ncw added the enhancement label Dec 5, 2018
@ncw ncw added this to the Help Wanted milestone Dec 5, 2018
@ncw
Copy link
Member

ncw commented Dec 5, 2018

I think this is a great idea!

I think the best way to approach this would be a separate subcommand.

So I was imagining something like

rclone zip source destination.zip

where source could be any local or remote directory and destination could be any local or remote file.

so

rclone zip /path/to/directory googledrive:files.zip
rclone zip googledrive:files files.zip

With the analogous unzip.

However having looked at your archive command I think you are probably thinking a bit more ambitiously that just zip... so the command could be archive with sub-subcommands zip, unzip.

Or maybe detection of the file extension should be enough to work out what it should do

rclone archive source destination.zip -- zips source into destination.zip

and maybe with some flags to say unambiguously what was required --unzip --zip if for some reason the user didn't want to use the normal file extensions.

I'm not sure, though, if detecting whether the source or destination is an archive file is trivial.

How would you do the detection? With file name - that would be easy enough. Rclone can read mime types of things too.

@ncw
Copy link
Member

ncw commented Dec 5, 2018

Hmm, perhaps it would be better to deal with this using your archiver binary and rclone cat and rclone rcat...

@mholt
Copy link
Contributor Author

mholt commented Dec 5, 2018

So, yeah -- the archiver package mostly encourages and in some cases enforces file extensions for detecting file type. When reading archives using high-level functions that aren't specific to a certain format, it prefers the actual file header (but we need to add support for compressed tar.* files by header). When writing files, it requires that the extension matches the format to avoid confusion like I had to deal with once that put me back by HOURS.

Anyway, file extension is a fine way to go generally, and people who want a different extension can rename before or after the archival operation.

Hmm, perhaps it would be better to deal with this using your archiver binary and rclone cat and rclone rcat...

How would that work? The main reason I didn't implement streaming archival operations in the binary/command (the library supports this, though) is because I wasn't sure which protocol to use to differentiate different files in the stream. We could of course read in a tar stream, but then the program outputting that stream might as well be writing the tar file itself! See what I mean? If the program emitting the stream can delineate the files and write them out, it might as well write the tar stream itself.

I agree, though, if there's a way for users to just glue existing commands together, that'd be better.

If this was added to rclone:

rclone archive source destination.zip

This is a good interface, I like it. Similarly:

rclone archive source.zip destination

would read the zip file and extract them into the destination. In these examples, source and destination would be any rclone storage system. And archive format is determined by file extension. Nice.

I might even be able to submit a PR just for fun, if you think it's not feature creep. Let me know what you decide!

@ncw
Copy link
Member

ncw commented Dec 6, 2018

How would that work? The main reason I didn't implement streaming archival operations in the binary/command (the library supports this, though) is because I wasn't sure which protocol to use to differentiate different files in the stream. We could of course read in a tar stream, but then the program outputting that stream might as well be writing the tar file itself! See what I mean? If the program emitting the stream can delineate the files and write them out, it might as well write the tar stream itself.

I was thinking that the files would be local and the compressed archive would be streamed in or out. That isn't as general purpose though.

It does give me an idea though that either the source or destination could be - meaning stdin or stdout. You'd need some flags to control the format though -output-format .zip --input-format .tar.gz

I could see something like rclone archive source destination.zip being quite popular as zips in particular don't normally lend themselves to being streamed...

You can do tar zcvf /path/to/dir | rclone rcat remote:file.tar.gz and rclone cat remote:file.tar.gz | tar zxvf - which is fine for unix users, but it leaves windows and zip users out in the cold...

I might even be able to submit a PR just for fun, if you think it's not feature creep. Let me know what you decide!

Would you use your archiver library for that? What formats would you support?

@mholt
Copy link
Contributor Author

mholt commented Dec 6, 2018

I could see something like rclone archive source destination.zip being quite popular as zips in particular don't normally lend themselves to being streamed...

The only thing about streaming zip files is that when reading them (not writing), you need to know the length of the stream when you begin.

You can do tar zcvf /path/to/dir | rclone rcat remote:file.tar.gz and rclone cat remote:file.tar.gz | tar zxvf - which is fine for unix users, but it leaves windows and zip users out in the cold...

True; it could be nice to make it cross-platform.

Would you use your archiver library for that? What formats would you support?

Yes, and probably all the formats that the archiver package supports (assuming we can infer format from file extension or a flag or something).

I guess I'll shelve this proposal for now until I have a more concrete understanding of how the API/interface will work, and to see if there's any other demand for it.

@ncw
Copy link
Member

ncw commented Dec 6, 2018

The only thing about streaming zip files is that when reading them (not writing), you need to know the length of the stream when you begin.

Rclone objects know their size. You can seek them too but it is a bit clunky - you have to close them and re-open the streams.

I guess I'll shelve this proposal for now until I have a more concrete understanding of how the API/interface will work, and to see if there's any other demand for it.

:-)

@QuantumGhost
Copy link

As I have said in #2891, I think one solution is like this:
We add a tar command to read all files in a given path and output their information in a tar compatible manner, then we pipe the output to tar, then pipe the output of tar to rclone and use rcat to save it to another remote.
Thus, we could avoid saving files to compress to disk.
Also, I think it's not a good idea to include compress/decompress feature directly in rclone. I'd like to keep rclone simple and lean (it's also hard to support many

There are other issues to consider if we want to take this approach, including but not limited to:

  • tar format stores more information than what we have (For example a file in tar has user id and group id), how should we set those fields when we're writing tar streams.
  • The approach above works well for *nix users. However it may not be that convenient for windows users. (I don't know whether compression softwares on Windows support this)

@Ciantic
Copy link

Ciantic commented Jul 5, 2019

I think the tar support should be written in go.

I have done a PHP implementation to stream a directory as tar: https://github.com/Ciantic/archive-to-output-stream/blob/master/tardir.php (it just streams the whole directory as a tar file, so it should work for arbitrarily big directories) Ideally the go implementation should have similar properties. It should be streaming, so that it doesn't gather up the files, and no compression either.

@sntran
Copy link
Contributor

sntran commented Dec 21, 2021

I will second it. My use case is exactly this. I need to create an archive from a remote folder. I tried rclone cat and pipe the stream to tar | gzip, etc... but the content of all files inside the folder is concatenated into one, so it's not very useful.

My additional suggestion is to support outputting the archive to stdout if the target is not set. So rclone zip remote:folder will send the archive content to stdout, similar to rclone cat. This way, the end user can decide to either rclone rcat to another remote, or pipe to other commands to process.

@m-radzikowski
Copy link

If this would be implemented in a way you have currently in mind, would the archiving happen on the fly? Let's say I have 300 GB of files on an external disk and I want to bundle them in a single archive and store in S3. My PC does not have 300 GB of free space to archive it first and then upload.

Would this backend be able to archive it on the fly, not needing a free memory/disk space of equal size to the archive?

I also think that having the ability to bundle small files into larger archives should speed up the upload to storage like S3.

@mholt
Copy link
Contributor Author

mholt commented Feb 6, 2022

Archiver v4 has a very good, stream-oriented API that would be perfect for this. And yes, it can create archives on the fly in memory. https://github.com/mholt/archiver

@ncw
Copy link
Member

ncw commented Feb 6, 2022

In v1.58.0-beta.5990.02faa6f05.zip-backend on branch zip-backend (uploaded in 15-30 mins) is an experiment I did a while back to make a zip backend using archive/zip. You use it as :zip:remote:path/to/file.zip and it can read from zips or write to them. It can't update zips though!

@mholt that should give you an idea of what interfaces rclone needs. archive/zip is nice because it provides CRC32s also which rclone can end to end check.

Roughly the interfaces rclone would want from archiver are:

  • list the files in the archive
  • stream a file from the archive, ideally being able to seek to a given point in it, ie an io.ReadSeeker
  • write a file to the archive with an io.Writer
  • finalize the archive

And the interfaces rclone would provide to archiver are

  • a seekable (for read) interface to the archive so an io.ReadSeeker
  • a non-seekable (for write) interface to the archive, so an io.Writer

Is this something archiver can do?

@sntran
Copy link
Contributor

sntran commented Feb 6, 2022

One important aspect of archiving is the final file size. Would rclone size work with a zip?

From what I understand, we can only determine the final file size when there is no compression, i.e., store mode. But that may be desirable.

@mholt
Copy link
Contributor Author

mholt commented Feb 7, 2022

@ncw Awesome! That's an impressive amount of work. 😳

archive/zip is nice because it provides CRC32s also which rclone can end to end check.

True. To clarify, my archiver package does give you the *zip.Header when reading from zip archives if you type-assert the Header field: https://pkg.go.dev/github.com/mholt/archiver/v4#File.Header -- which should give you the CRC32 I believe. Is that all you'd need?

Regarding your interface questions:

Roughly the interfaces rclone would want from archiver are:

  • list the files in the archive

Yep, see Extract: https://pkg.go.dev/github.com/mholt/archiver/v4#Extractor

(Since archives can contain many many entries, this is a Walk-style interface instead of returning a slice. But if you really want a slice you can use the io/fs APIs: https://pkg.go.dev/github.com/mholt/archiver/v4#ArchiveFS)

  • stream a file from the archive, ideally being able to seek to a given point in it, ie an io.ReadSeeker

Yeah, files can be streamed, but does even archive/zip give you a ReadSeeker? The term Seek doesn't appear anywhere on the docs page for that package. I think it's just a ReadCloser. (mholt/archiver uses archive/zip under the hood.)

But, in any case, you'd probably want Extract: https://pkg.go.dev/github.com/mholt/archiver/v4#Zip.Extract

  • write a file to the archive with an io.Writer

Yes, but new archives only. I haven't seen any robust literature -- let alone Go implementations -- that suggest you can reliably append to the Zip archive format. I think even the zip command on Linux will create a new archive when you use -r.

  • finalize the archive

If I understand correctly, then yeah, Archiver can close out archives properly when you're done writing them. Again, it uses archive/zip. 👍

And the interfaces rclone would provide to archiver are

  • a seekable (for read) interface to the archive so an io.ReadSeeker
  • a non-seekable (for write) interface to the archive, so an io.Writer

Is this something archiver can do?

Yes, archiver uses these types. It actually requires ReadAt() and Seek() when reading from zip archives. (io.SectionReader can help here if needed.)

Hope I understood your questions correctly. Let me know if there are more questions!

@ncw
Copy link
Member

ncw commented Feb 7, 2022

archive/zip is nice because it provides CRC32s also which rclone can end to end check.

True. To clarify, my archiver package does give you the *zip.Header when reading from zip archives if you type-assert the Header field: https://pkg.go.dev/github.com/mholt/archiver/v4#File.Header -- which should give you the CRC32 I believe. Is that all you'd need?

I think so.

Regarding your interface questions:

Roughly the interfaces rclone would want from archiver are:

  • list the files in the archive

Yep, see Extract: https://pkg.go.dev/github.com/mholt/archiver/v4#Extractor

(Since archives can contain many many entries, this is a Walk-style interface instead of returning a slice. But if you really want a slice you can use the io/fs APIs: https://pkg.go.dev/github.com/mholt/archiver/v4#ArchiveFS)

That looks fine. I assume it doesn't actually read the data unless you ask for it?

  • stream a file from the archive, ideally being able to seek to a given point in it, ie an io.ReadSeeker

Yeah, files can be streamed, but does even archive/zip give you a ReadSeeker? The term Seek doesn't appear anywhere on the docs page for that package. I think it's just a ReadCloser. (mholt/archiver uses archive/zip under the hood.)

Rclone will discard bytes read until the seek point if the stream can't seek which is inefficient, but works!

But, in any case, you'd probably want Extract: https://pkg.go.dev/github.com/mholt/archiver/v4#Zip.Extract

  • write a file to the archive with an io.Writer

Yes, but new archives only. I haven't seen any robust literature -- let alone Go implementations -- that suggest you can reliably append to the Zip archive format. I think even the zip command on Linux will create a new archive when you use -r.

Yes, I'm assuming that we are only ever creating new archives not updating old ones.

Looking at your interfaces, I think the biggest problem is Archiver.

Archive(ctx context.Context, output io.Writer, files []File) error

This assumes that we know all the files we are streaming in advance of calling this, which means rclone would need to buffer them on disk, etc which it doesn't have to with the zip prototype backend.

What rclone would like is to be able to archive files one at a time then it can supply the file data and the file metadata at the same time. Only that way does rclone not have to buffer the files to disk.

If the prototype was something like this, with the caller being expected to close files at the end:

Archive(ctx context.Context, output io.Writer, files <- chan File) error

Then rclone could supply File struct when it had them. This would require Archive to process the File struct completely before reading the next one though.

@mholt
Copy link
Contributor Author

mholt commented Feb 7, 2022

@ncw

I assume it doesn't actually read the data unless you ask for it?

Correct. It won't even open the file until you call Open().

This assumes that we know all the files we are streaming in advance of calling this, which means rclone would need to buffer them on disk, etc which it doesn't have to with the zip prototype backend.

Good point! Well, rclone would have to at least iterate the list of files before calling Archive(), but you wouldn't have to buffer all the files' contents too. Just keep a pointer to their Open() functions as you iterate. Of course, I'm not familiar with your exact constraints here. But in the "ordinary" case of just adding files from disk, it's simply a matter of appending to the slice as you walk each file: https://github.com/mholt/archiver/blob/a44c8d26e207192467f094777c1143024b505ae8/archiver.go#L111-L113 -- you don't need to buffer the files at all.

So yeah, if that won't work for you, I'm totally down for adding a "builder"-style interface, that incrementally adds files as you discover them, kind of like archive/zip (I also really like the channel approach you suggest). The benefit of still using mholt/archiver over archive/zip, though, is that with archiver, you get unified multi-format support. So you wouldn't be limited to just zip files.

I'll see if I can come up with some sort of files <- chan File prototype this week, or maybe I'll go with a Open(), Insert(), Close() thing. (I like the channel idea more, although concurrency here might be meh...)

@ncw
Copy link
Member

ncw commented Feb 8, 2022

@mholt wrote:

This assumes that we know all the files we are streaming in advance of calling this, which means rclone would need to buffer them on disk, etc which it doesn't have to with the zip prototype backend.

Good point! Well, rclone would have to at least iterate the list of files before calling Archive(), but you wouldn't have to buffer all the files' contents too. Just keep a pointer to their Open() functions as you iterate. Of course, I'm not familiar with your exact constraints here.

Alas, that won't work for rclone. Rclone doesn't really deal in files, only in streams. The internals of rclone expect the file to be uploaded once the Put or Update call returns.

But in the "ordinary" case of just adding files from disk, it's simply a matter of appending to the slice as you walk each file: https://github.com/mholt/archiver/blob/a44c8d26e207192467f094777c1143024b505ae8/archiver.go#L111-L113 -- you don't need to buffer the files at all.

So yeah, if that won't work for you, I'm totally down for adding a "builder"-style interface, that incrementally adds files as you discover them, kind of like archive/zip (I also really like the channel approach you suggest). The benefit of still using mholt/archiver over archive/zip, though, is that with archiver, you get unified multi-format support. So you wouldn't be limited to just zip files.

Yes I like the idea of multi archive type support very much!

I'll see if I can come up with some sort of files <- chan File prototype this week, or maybe I'll go with a Open(), Insert(), Close() thing. (I like the channel idea more, although concurrency here might be meh...)

Either style would work for me :-)

@mholt
Copy link
Contributor Author

mholt commented Feb 9, 2022

@ncw I've implemented ArchiveAsync() on this PR: mholt/archiver#320

It's an optional interface, but both Zip and Tar implement it, so you can just type-assert your Archiver to an ArchiverAsync to get access to the ArchiveAsync() method. I haven't tested this yet.... but I hope it is what you need! Let me know what you think when you get around to it.

@AllanVan
Copy link

AllanVan commented May 28, 2022

I will second it. My use case is exactly this. I need to create an archive from a remote folder. I tried rclone cat and pipe the stream to tar | gzip, etc... but the content of all files inside the folder is concatenated into one, so it's not very useful.

I came here exactly because of this reason. I have lots of folders with thousands of images that do not need to be out of a zip. Besides being a pain to synchronize, I'm near the limit of # of files on gdrive.
I'm not knowledgeable enough to help implement it, but it's great to see others are. I'll help testing!

@ncw
Copy link
Member

ncw commented May 30, 2022

@AllanVan - have a go with the binary I posted in this message: #2815 (comment)

@RafalSkolasinski
Copy link

I was just wondering about the use case where on remote you have a tar.gz file (containing files or folders) and you would like rclone copy ... to unarchive such that you get locally the content of tarball straight away.

Do I understand right that this use case would also be covered by the above issue?

@martinwang2002
Copy link

How about tar files? Tar can append file to it and it's already implemented by mholt/archiver.
https://github.com/mholt/archiver/blob/62ea3699423b5e2ac638af0c7dff408347e47777/tar.go#L106

@RickyDepop
Copy link

RickyDepop commented Jun 11, 2024

Hi all,

could please someone (maybe @ncw ) confirm this feature is currently not available in rclone?

I'm particularly interested in the the use case where we store compressed tar.gz files remotely on S3, and we stream and unarchive them onto the local filesystem without having to create a copy first.

Thank you.

@bert2002
Copy link

Hi all,

could please someone (maybe @ncw ) confirm this feature is currently not available in rclone?

I'm particularly interested in the the use case where we store compressed tar.gz files remotely on S3, and we stream and unarchive them onto the local filesystem without having to create a copy first.

Thank you.

I would be interested in this feature too. Very helpful for big files :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests