Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changes to support multiple files #1658

Merged
merged 1 commit into from
Sep 14, 2023
Merged

Conversation

vasudeva8
Copy link
Contributor

@vasudeva8 vasudeva8 commented Aug 2, 2023

This PR contains the changes for #1642.

Multiple files are supported and given operation is performed on all inputs, except with rebgzip.
Returns failure when multiple inputs are given with -I option.
Returns failure with rebgzip and index used together.
-b / -s options, which requires index file, work with single input when index is explicitly providing using -I.
-b / -s can work with multiple inputs if implicit indexes, i.e. <compressed filename.gzi>, are in use (i.e. no -I option
and all having their implicit index file along with it).

Corrected the error message with rebgzip invocation without -I option.

hfile.c/h, bgzf.c updated to add new methods which doesn't close the stdin/out, to support multi file processing.

hfile.c Outdated Show resolved Hide resolved
@jkbonfield
Copy link
Contributor

This feels overly complicated, but some of the changes are also down to improved error detection and identification of clashing options, such as region querying or rebgzip.

However the bulk of the functionality could be as simple as a do while loop on optind:

diff --git a/bgzip.c b/bgzip.c
index 589f79f6..fe6eacb0 100644
--- a/bgzip.c
+++ b/bgzip.c
@@ -188,6 +188,7 @@ int main(int argc, char **argv)
         fprintf(stderr, "[bgzip] Illegal region: [%ld, %ld]\n", start, end);
         return 1;
     }
+    do {
     if (compress == 1) {
         hFILE* f_src = NULL;
         char out_mode[3] = "w\0";
@@ -369,7 +370,6 @@ int main(int argc, char **argv)
             error("Input close failed\n");
         if (argc > optind && !pstdout && !keep) unlink(argv[optind]);
         free(buffer);
-        return 0;
     }
     else if ( reindex )
     {
@@ -401,7 +401,6 @@ int main(int argc, char **argv)
         }
 
         if ( bgzf_close(fp)<0 ) error("Close failed: Error %d\n",fp->errcode);
-        return 0;
     }
     else
     {
@@ -494,6 +493,7 @@ int main(int argc, char **argv)
 #ifdef _WIN32
         _setmode(f_dst, O_BINARY);
 #endif
+        long start_reg = start, end_reg = end;
         while (1) {
             if (end < 0) c = bgzf_read(fp, buffer, WINDOW_SIZE);
             else c = bgzf_read(fp, buffer, (end - start > WINDOW_SIZE)? WINDOW_SIZE:(end - start));
@@ -508,9 +508,12 @@ int main(int argc, char **argv)
             }
             if (end >= 0 && start >= end) break;
         }
+        start = start_reg;
+        end = end_reg;
         free(buffer);
         if (bgzf_close(fp) < 0) error("Close failed: Error %d\n",fp->errcode);
         if (argc > optind && !pstdout && !test && !keep) unlink(argv[optind]);
-        return 0;
     }
+    } while (++optind < argc);
+    return 0;
 }

Obviously that needs better handling of the rebgzip case and it's probably better to have a pair of do-while loops handling compress and decompress separately, but I'm not entirely sure the large scale changes are justified by this feature request. That said, some code restructuring isn't a bad idea given the length of main. It just makes it a little trickier to work out what's changed and whether all of the functionality has been kept.

Specifically the stdin/stdout code feels inherently wrong and I don't entirely understand why it's not just dealt with automatically as any other filename. (We can add an implicit "-" in the case of no args and stdin isn't a tty.) Opening it in main and having custom code for it feels a bit strange and very tricky to follow. What am I missing here?

Is it the ./bgzip -cd _a.gz _b.gz case where we wish to output two files to the same stream (which my naive do-while didn't handle I expect)? If so we can probably resolve this by only closing stdout hts streams at the end of the program, so we can concatenate outputs.

@jmarshall
Copy link
Member

Specifically the stdin/stdout code feels inherently wrong and I don't entirely understand why it's not just dealt with automatically as any other filename.

hopen_fd_stdinout() just creates an hFILE_fd with an fd of 0 or 1. Hence in = hopen("-", "r") … hclose(in) … in2 = hopen("-", "r") would fail because fd 0 is no longer open or no longer refers to the original stdin. This could be complexified by having it use dup() instead of literally STDIN_FILENO (0) or STDOUT_FILENO (1) so hclose() wouldn't close the only reference to the original stdin/stdout stream, but really this has caused no significant problems in all the years since it was implemented so far — largely because reading two BAM files from stdin or writing two BAM files to stdout just isn't a useful operation.

So if there are useful bgzip operations involving reading multiple streams from stdin (probably not) or writing multiple streams to stdout (maybe), they are currently going to have to do something special for stdin/stdout. Even if that is only not actually calling hclose() except the last time.

The code is now more complicated than I was expecting too. Perhaps all the casting could be avoided by separating operate() into separate functions for the different modes?

@vasudeva8 vasudeva8 force-pushed the bgzip_upd branch 3 times, most recently from c4d04bb to e3a614d Compare August 14, 2023 15:19
@vasudeva8
Copy link
Contributor Author

Updated to avoid the complexities, uses the do..while framework and uses dup to open/close stdin/out as required.
-I is supported with multiple inputs as well - as we do rebgzip!

@jmarshall
Copy link
Member

Specifically the stdin/stdout code feels inherently wrong and I don't entirely understand why it's not just dealt with automatically as any other filename.

Upon further reflection, I think stdin/stdout should just be dealt with by hts_open/hts_close automatically just as any other filename, and @vasudeva8 should be able to repeatedly hts_open and hts_close stdin and stdout just as for any other filename.

PR #1665 modifies hopen_fd_stdinout() to enable that. If that approach is accepted by the maintainers, this PR can be rebased onto that once merged, and all the dup() gymnastics can be removed — which should enable some simplification of the code.

@vasudeva8
Copy link
Contributor Author

Updated and uses new shared stdin/out hfile. Uses the do/while as earlier and explicitly closes stdout at the end.
Moved a few validations to ensure they are made before output file update.

Copy link
Contributor

@jkbonfield jkbonfield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to work bar one corner case.

Create two multi-block bgzip files:

head -10000 /usr/share/dict/words > _w1                                                             
tail -10000 /usr/share/dict/words > _w2                                                             
./bgzip -i -I _w1.gz.gzi -c _w1 > _w1.gz
./bgzip -i -I _w2.gz.gzi -c _w2 > _w2.gz
md5sum _w[12].gz.gzi
2f17308a50726f862258407527d50dad  _w1.gz.gzi
c47ffdd4c264f34200370048ae62ebbc  _w2.gz.gzi

Compare this to compression both together.

./bgzip -i -I _w12.gz.gzi -c _w1 _w2 > _w12.gz                                                      
md5sum _w12.gz.gzi                                                                                  
c47ffdd4c264f34200370048ae62ebbc  _w12.gz.gzi                                                       

The compression worked, with bgzip _w12.gz matching cat _w1 _w2, however the index file _w12.gz.gzi is the same as the index for the 2nd file.

Note we don't need -c and > file.gz for this to fail. Just compressing a whole bunch of files when using a named index (-I) will also fail with the same logic. This is just nonsensical. Technoically it's done the right thing - build an index for each file we asked for and write it to the named file, overwriting the previous index. That's a case of asking for something stupid gets you something stupid.

However the file concatenation to stdout where this could actually be used. As the concatenation works and we could then do bgzip -r _w12.gz to index it. Fixing it though is hard as our main loop opens and closes the output file each time, in this case stdout. We could get it to not close when using stdout, so we only have one file and the index then works, but it would make the code more complex and it also changes functionality - we're not getting the same output as concatenating _w1.gz _w2.gz together.

My personal preference here is the simplest solution which is simply to spot poor usage and complain. If someone later comes along with a clear use case showing this behaviour must work, then we can reconsider.

Ie:

diff --git a/bgzip.c b/bgzip.c
index 829ebe8a..4b85272e 100644
--- a/bgzip.c
+++ b/bgzip.c
@@ -199,6 +199,10 @@ int main(int argc, char **argv)
         fprintf(stderr, "[bgzip] Index file name expected with rebgzip.  See -I option.\n");
         return 1;
     }
+    if (index && index_fname && argc - optind > 1) {
+        fprintf(stderr, "[bgzip] Cannot specify index filename for more than one data file.\n");
+        return 1;
+    }
 
     do {
         isstdin = optind >= argc ? 1 : !strcmp("-", argv[optind]);          //using stdin or not?

(Edit: Vasudeva verbally pointed out rebgzip can apply the same index to multiple files, so this check needs to be explicitly for writing indices)

Other than that it passes my tests, and actually fixes a whole bunch of long-standanding bgzip bugs. Eg a lot of the checks for stdin weren't working if the user specified stdin as "-" (giving us "-.gz.gzi" filenames for example).

bgzip.1 Outdated Show resolved Hide resolved
bgzip.c Outdated Show resolved Hide resolved
@jkbonfield
Copy link
Contributor

Also, this sounded on paper like a trivial issue to resolve, but the myriad of interaction between all the command line arguments shows it's far trickier than I expected! Good going on navigating the minefield, especially fixing the long-standing issues caused by said minefield. :)

@vasudeva8
Copy link
Contributor Author

Updated to avoid explicit index file with multiple input files, during index and reindex.
These options are not considered during decompress/read operation.

Reimplemented to reduce complexities.
Update to avoid test failure in windows
Updated to work with shared stdin/out hfile
@jkbonfield jkbonfield merged commit 1f0375c into samtools:develop Sep 14, 2023
8 checks passed
@vasudeva8 vasudeva8 deleted the bgzip_upd branch September 15, 2023 10:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants