-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changes to support multiple files #1658
Conversation
This feels overly complicated, but some of the changes are also down to improved error detection and identification of clashing options, such as region querying or rebgzip. However the bulk of the functionality could be as simple as a do while loop on optind:
Obviously that needs better handling of the rebgzip case and it's probably better to have a pair of do-while loops handling compress and decompress separately, but I'm not entirely sure the large scale changes are justified by this feature request. That said, some code restructuring isn't a bad idea given the length of main. It just makes it a little trickier to work out what's changed and whether all of the functionality has been kept. Specifically the stdin/stdout code feels inherently wrong and I don't entirely understand why it's not just dealt with automatically as any other filename. (We can add an implicit "-" in the case of no args and stdin isn't a tty.) Opening it in main and having custom code for it feels a bit strange and very tricky to follow. What am I missing here? Is it the |
So if there are useful bgzip operations involving reading multiple streams from stdin (probably not) or writing multiple streams to stdout (maybe), they are currently going to have to do something special for stdin/stdout. Even if that is only not actually calling The code is now more complicated than I was expecting too. Perhaps all the casting could be avoided by separating |
c4d04bb
to
e3a614d
Compare
Updated to avoid the complexities, uses the do..while framework and uses dup to open/close stdin/out as required. |
0832d64
to
357c43f
Compare
Upon further reflection, I think stdin/stdout should just be dealt with by PR #1665 modifies |
Updated and uses new shared stdin/out hfile. Uses the do/while as earlier and explicitly closes stdout at the end. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems to work bar one corner case.
Create two multi-block bgzip files:
head -10000 /usr/share/dict/words > _w1
tail -10000 /usr/share/dict/words > _w2
./bgzip -i -I _w1.gz.gzi -c _w1 > _w1.gz
./bgzip -i -I _w2.gz.gzi -c _w2 > _w2.gz
md5sum _w[12].gz.gzi
2f17308a50726f862258407527d50dad _w1.gz.gzi
c47ffdd4c264f34200370048ae62ebbc _w2.gz.gzi
Compare this to compression both together.
./bgzip -i -I _w12.gz.gzi -c _w1 _w2 > _w12.gz
md5sum _w12.gz.gzi
c47ffdd4c264f34200370048ae62ebbc _w12.gz.gzi
The compression worked, with bgzip _w12.gz
matching cat _w1 _w2
, however the index file _w12.gz.gzi is the same as the index for the 2nd file.
Note we don't need -c
and > file.gz
for this to fail. Just compressing a whole bunch of files when using a named index (-I
) will also fail with the same logic. This is just nonsensical. Technoically it's done the right thing - build an index for each file we asked for and write it to the named file, overwriting the previous index. That's a case of asking for something stupid gets you something stupid.
However the file concatenation to stdout where this could actually be used. As the concatenation works and we could then do bgzip -r _w12.gz
to index it. Fixing it though is hard as our main loop opens and closes the output file each time, in this case stdout. We could get it to not close when using stdout, so we only have one file and the index then works, but it would make the code more complex and it also changes functionality - we're not getting the same output as concatenating _w1.gz _w2.gz together.
My personal preference here is the simplest solution which is simply to spot poor usage and complain. If someone later comes along with a clear use case showing this behaviour must work, then we can reconsider.
Ie:
diff --git a/bgzip.c b/bgzip.c
index 829ebe8a..4b85272e 100644
--- a/bgzip.c
+++ b/bgzip.c
@@ -199,6 +199,10 @@ int main(int argc, char **argv)
fprintf(stderr, "[bgzip] Index file name expected with rebgzip. See -I option.\n");
return 1;
}
+ if (index && index_fname && argc - optind > 1) {
+ fprintf(stderr, "[bgzip] Cannot specify index filename for more than one data file.\n");
+ return 1;
+ }
do {
isstdin = optind >= argc ? 1 : !strcmp("-", argv[optind]); //using stdin or not?
(Edit: Vasudeva verbally pointed out rebgzip can apply the same index to multiple files, so this check needs to be explicitly for writing indices)
Other than that it passes my tests, and actually fixes a whole bunch of long-standanding bgzip bugs. Eg a lot of the checks for stdin weren't working if the user specified stdin as "-" (giving us "-.gz.gzi" filenames for example).
Also, this sounded on paper like a trivial issue to resolve, but the myriad of interaction between all the command line arguments shows it's far trickier than I expected! Good going on navigating the minefield, especially fixing the long-standing issues caused by said minefield. :) |
eac0ec4
to
504c532
Compare
Updated to avoid explicit index file with multiple input files, during index and reindex. |
Reimplemented to reduce complexities. Update to avoid test failure in windows Updated to work with shared stdin/out hfile
504c532
to
ba4823e
Compare
This PR contains the changes for #1642.
Multiple files are supported and given operation is performed on all inputs, except with rebgzip.
Returns failure when multiple inputs are given with -I option.
Returns failure with rebgzip and index used together.
-b / -s options, which requires index file, work with single input when index is explicitly providing using -I.
-b / -s can work with multiple inputs if implicit indexes, i.e. <compressed filename.gzi>, are in use (i.e. no -I option
and all having their implicit index file along with it).
Corrected the error message with rebgzip invocation without -I option.
hfile.c/h, bgzf.c updated to add new methods which doesn't close the stdin/out, to support multi file processing.