add tarball merger #853

tomkinsc · 2018-07-19T17:28:40Z

add a new utility function that merges separate tarballs into one tarball; data can be piped in and/or out, and the contents can optionally be extracted to disk during the repack

add a new utility function that merges separate tarballs into one tarball data can be piped in and/or out

…te/viral-ngs into ct-add-tarball-merger

dpark01 · 2018-07-20T01:12:58Z

I'll try to look at this more tomorrow... is this pre-requisite for breaking out the tar-repack post-upload step into its own dx applet?

Part of me wonders if we should bite the bullet soon and switch our compression/decompression stuff to blosc. Maybe a separate PR. But we could shed all the binaries for pigz, bzip2, lz4 and just add the python library and gain a few other formats like zstd.

I also wonder whether we could implement a streaming tarcat standalone method that avoids the unpacking to disk by withholding the EOF zero markers for the first N-1 input streams... boggles my mind that no one has implemented it yet (GNU tar's --concatenate option modifies an input posix file; it doesn't work on pipes like cat does).

BTW if you add a new top-level python file, there's various places it needs to be added including coveralls/py.test invocation scripts, readthedocs/sphinx code, and possibly some other places I'm forgetting at the moment.

tomkinsc · 2018-07-20T15:23:51Z

The intent was to have tar repack functionality as part of viral-ngs, to support either a separate dx applet or repack-capable demux. @mlin now has a branch with a WIP yml-specified applet to perform the merge operation, so this may be a bit redundant, but including it as part of demux would save an extra download of the packed tarball onto the demux instance.

Switching (de)compression to blosc should probably be a separate PR since we call the various binaries in various places. Something I'm not sure of is whether some of the magic of blosc relies on compile-time optimizations targeting the instruction set extensions/CPU cache available, or if it compiles to include multiple code paths. We may lose some optimizations if it is installed from a source like conda. The best thing is to probably just try it and see if real-world performance is improved over what we have now.

The function included in this PR has a few different code paths. By default it acts as a streaming tarcat and file data never touches the disk; it reads from the untarred stream, buffered by python's tarfile, and writtes directly to the tarfile output stream (TarFile.extractfile() can return a stream file object). That's for the simple case of a repack. A flag, --extractToDiskPath, is available for cases where we want to read the input tarballs once and do two things: extract the files to disk, and repack them. That's intended for use on a demux instance. In that case, the default read function is wrapped by a custom class that provides file-like read, and writes out bytes to a file before returning them to be written to the output stream (the wrapper sets file attributes on close). The sequential nature of tarfiles is helpful here, and it's easy because the stream object of Python's tarfile pads 512-byte blocks and 20-block records as necessary when writing, so we don't need to do the bookkeeping. In something of a hacky shortcut, non-file tarball members (directories, symlinks, etc.) are written to disk and then added from disk in the case where we are writing the output to disk anyway (so we do not need a complicated read() diverter for non-file data to create directories, etc. on disk.) It adds a disk roundtrip, but the members are small so it's quick, and it avoids duplicating the (internal, private) logic within Python's tarfile to inspect and create the various member types with appropriate attributes. For the simple case where extracted output on disk is not desired, the small members are streamed along with everything else.

I had the same thought as you about concatenating tarballs in the old-school tape drive way, by stripping the final two 512-byte zero blocks off each (N-1) tarball. That would be fairly straightforward if the files were not compressed, since we could know the size in advance and pipe to dd (or even head with a negative offset) to truncate the tarfile content. Compression complicates it because the original size of the tarball can only be reliably determined after it is decompressed. The gz format stores the length at the end of the file as four bytes encoding an unsigned long, making it impossible to get the uncompressed size for streamed files without a second pass, or for files >4GB in size. We can use pigz -l to get the decompressed size, but it seems to rely on the same long trick so it fails for files >4GB and needs to read the entire file for streams.

We can, of course, simply cat the tar.gz files together if the resulting file is always read with tar -i to ignore zero blocks, but that seems dangerous since the consumer would have to know to apply -i or else the output would be truncated. The function in this PR tries to be a reasonable compromise that will work with streamed data as input and/or output.

Something I'm not sure of yet is how dxWDL handles an Array[File]+ input with parameter_meta set to stream. Does it allow the selection of multiple files and stream all of them?

dpark01 · 2018-07-20T15:50:37Z

Fascinating. A few thoughts.

I think some of the speed magic of blosc isn't really about blosc, but the underlying algorithms that it implements.. a lot of the newer ones are computationally simple enough to saturate the memory bandwidth of any machine. Agreed though that we should keep it separate.

dxWDL does not currently handle Array[File] streaming at all. Ohad had said there would be some implementation difficulty around that.. personally I think a non-localized / NIO approach that WDL 1.0 seems to advocate would require some greater rethinking of this anyway.

Anyway the short of it is that you're saying that the default execution behavior of this is to be able to repack tarballs while both streaming the inputs and outputs and avoiding any disk I/O. That's great, and quite ideal actually! That should significantly speed up repacking of large tarballs (avoiding the disk I/O). In my mind, a hacked together tarcat would have always been streaming: instead of calculating file sizes and offsets, I would've just introduced a 1kB buffer and dropped any EOF blocks from the stream except for the last one. But better to use an established python library.

Of the different code paths, how many of them are tested? Can you test both the --extractToDiskPath and default diskless behavior?

The advantage to @mlin 's standalone applet for repack prior to demux is that the repacked tarball always emits first, regardless of whether demux fails, and the standalone applet doesn't need to waste 5 mins pulling the viral-ngs docker image. The downside to it though is that it is hitting disk. Since DNAnexus seems to rely exclusively on non-EBS-backed EC2 instances, the speed hit isn't too big of a deal, but it does mean that the instance sizes have to scale based on the data size (since local disk has to be big enough for the uncompressed flowcell, which wouldn't be true with your repacker). I guess the best of both worlds would be to separate out your python tarfile based repacker into a minimalist docker that pulls faster...

tomkinsc · 2018-07-20T15:59:58Z

Ok, so it sounds like we'll be using @mlin's implementation for several reasons. I can add additional tests for this PR if you think it is useful enough to be included.

dpark01 · 2018-07-20T16:01:04Z

Yeah I think we'd want this anyway, it does look quite useful

yesimon · 2018-07-19T18:50:37Z

util/file.py

+ raise IOError("An input file of unknown type was provided: %s" % filepath)
+ return return_obj
+
+ def create_containing_dirs(path):


Is this not a mkdir_p style invocation?

yesimon · 2018-07-20T19:23:20Z

util/file.py

+ if out_compressed_tarball != "-":
+ out_compress_ps = subprocess.Popen(choose_compressor(out_compressed_tarball)["compress_cmd"], stdout=None if out_compressed_tarball == "-" else outfile, stdin=subprocess.PIPE)
+ else:
+ assert out_compressed_tarball != '-' or pipe_hint, "cannot autodetect compression for stdout output unless pipeHint provided"


Isn't out_compressed_tarball != '-' just checked?

Also may want to avoid using assert outside of unit test code.

yesimon · 2018-07-20T19:23:48Z

util/file.py

+ if not os.path.exists(path) and len(path):
+ os.mkdir(path)
+
+ class FileDiverter(object):


I'm not a huge fan of class inside the function. This function is already very long

Yes, I agree, feels ugly but it's also not used beyond here and it relies on the attributes possessing TarInfo attributes so its function is quite internal to this function.

yesimon · 2018-07-20T19:24:45Z

util/file.py

+
+ def __del__(self):
+ self.written_mirror_file.flush()
+ self.written_mirror_file.close()


close calls flush

yesimon · 2018-07-20T19:28:32Z

util/file.py

+
+ fileinfo = tar_in.next()
+ while fileinfo is not None:
+ if extract_to_disk_path:


Break out into subfunction?

yesimon · 2018-07-20T19:33:30Z

util/file.py

+
+ if avoid_disk_roundtrip:
+ fileobj = tar_in.extractfile(fileinfo)
+ #tar_out.addfile(fileinfo)


yesimon · 2018-07-20T19:35:59Z

util/file.py

+ out_compress_ps = subprocess.Popen(choose_compressor(out_compressed_tarball)["compress_cmd"], stdout=None if out_compressed_tarball == "-" else outfile, stdin=subprocess.PIPE)
+ else:
+ assert out_compressed_tarball != '-' or pipe_hint, "cannot autodetect compression for stdout output unless pipeHint provided"
+ out_compress_ps = subprocess.Popen(choose_compressor(pipe_hint)["compress_cmd"], stdout=None if out_compressed_tarball == "-" else outfile, stdin=subprocess.PIPE)


DRY this because only the choose_compressor call is different

yesimon · 2018-07-20T19:38:04Z

file_utils.py

+ help='If specified, the tar contents will also be extracted to a local directory.')
+ parser.add_argument('--pipeHint',
+ dest="pipe_hint",
+ default=".gz",


Maybe a better UI is not to have the leading .

Agreed, I was just trying to be consistent with util.file.extract_tarball() (which we can also change).

To be clear, the pipeHint used in other places is actually meant to be any kind of file path or URI even, and the logic that matches on it isn't looking for strings that match .gz, but rather it's doing string.endswith() calls on it. The idea is that you could just supply a hint or you could be lazy and supply the whole filename, bucket path, or whatever.

yesimon · 2018-07-20T19:38:44Z

file_utils.py

+ 'out_tarball', 
+ help='''output tarball (*.tar.gz|*.tar.lz4|*.tar.bz2|-) 
+ Note: if "-" is used, a gzip-compressed tarball 
+ will be written to stdout''')


Add some comment about how output compression is inferred by the file extension.

yesimon · 2018-07-20T19:39:51Z

test/unit/test_file_utils.py

+
+ assert_equal_contents(self, inf, outf)
+
+ def test_merge_with_extract(self):


Test avoid_roundtrip = False as well?

tomkinsc · 2018-07-20T20:56:03Z

Thanks for the code review!

…being created when stdout is used

…te/viral-ngs into ct-add-tarball-merger

in tarball unpacking code, allow concatenated tarballs. Owing to tar's history as a way to create Tape ARchive backups, tar files can be joined by being concatenated together. The final block is padded with zeros though (indicating EOF), which can cause tar to terminate prematurely when concatenated tarballs are being unpacked unless it is told to tolerate these early stops. This adds the `--ignore-zeros` flag to make tarball extraction more permissive. Note: this applies only to uncompressed tarballs (including concatenated tarballs within compressed archives). Our tarball repacking code already tolerates such tarballs; background info here: broadinstitute/viral-ngs#853 (comment)

tomkinsc and others added 5 commits July 18, 2018 23:39

add tarball merger

026a970

add a new utility function that merges separate tarballs into one tarball data can be piped in and/or out

add simple unit test for tarball merger

802ac5e

Merge branch 'master' into ct-add-tarball-merger

fc09e3c

merge_tarballs relative path correction for in-mem directory repack

511baa9

Merge branch 'ct-add-tarball-merger' of ssh://github.com/broadinstitu…

8238e64

…te/viral-ngs into ct-add-tarball-merger

tomkinsc added 2 commits July 20, 2018 09:31

add "--cov file_utils" to PYTEST_ADDOPTS

ecbbaf9

add to docs

ff15ace

add test for extracted files; bugfix extracted files given abs path

e2c7b1b

yesimon reviewed Jul 20, 2018

View reviewed changes

incorporate changes following code review by @dpark01 and @yesimon

1007dad

add additional tests for tarball_merger; prevent dummy '-' file from …

772e117

…being created when stdout is used

tomkinsc added the enhancement label Jul 21, 2018

tomkinsc and others added 3 commits July 20, 2018 20:52

Merge branch 'master' into ct-add-tarball-merger

ad6fae4

cruft removal

88b3fd8

Merge branch 'ct-add-tarball-merger' of ssh://github.com/broadinstitu…

204d98a

…te/viral-ngs into ct-add-tarball-merger

dpark01 approved these changes Jul 21, 2018

View reviewed changes

tomkinsc added 2 commits July 20, 2018 21:28

Merge branch 'master' into ct-add-tarball-merger

c912c02

Merge branch 'master' into ct-add-tarball-merger

b9f356d

tomkinsc merged commit 54c912b into master Jul 23, 2018

tomkinsc deleted the ct-add-tarball-merger branch July 23, 2018 15:18

tomkinsc mentioned this pull request Jan 28, 2021

in tarball unpacking code, allow concatenated tarballs broadinstitute/viral-core#53

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add tarball merger #853

add tarball merger #853

tomkinsc commented Jul 19, 2018

dpark01 commented Jul 20, 2018

tomkinsc commented Jul 20, 2018 •

edited

Loading

dpark01 commented Jul 20, 2018

tomkinsc commented Jul 20, 2018

dpark01 commented Jul 20, 2018

yesimon Jul 19, 2018

yesimon Jul 20, 2018

dpark01 Jul 20, 2018

yesimon Jul 20, 2018

tomkinsc Jul 20, 2018

yesimon Jul 20, 2018

yesimon Jul 20, 2018

yesimon Jul 20, 2018

yesimon Jul 20, 2018

yesimon Jul 20, 2018

tomkinsc Jul 20, 2018

dpark01 Jul 20, 2018

yesimon Jul 20, 2018

yesimon Jul 20, 2018

tomkinsc commented Jul 20, 2018


		assert_equal_contents(self, inf, outf)

		def test_merge_with_extract(self):

add tarball merger #853

add tarball merger #853

Conversation

tomkinsc commented Jul 19, 2018

dpark01 commented Jul 20, 2018

tomkinsc commented Jul 20, 2018 • edited Loading

dpark01 commented Jul 20, 2018

tomkinsc commented Jul 20, 2018

dpark01 commented Jul 20, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomkinsc commented Jul 20, 2018

tomkinsc commented Jul 20, 2018 •

edited

Loading