-
Notifications
You must be signed in to change notification settings - Fork 245
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Exception is thrown when loading the BAM from a URL #1507
Comments
I've confirmed this and am working on a simple test case. Here is how IGV instantiates reader. It works if the bam and csi are local, fails if loading by URL
I've put test data here
|
@SergejN @jrobinso Thank you for reporting this and for the help reproducing it. I'm not surprised that there is a bad interaction somewhere with URLs and CSI. Both are sort of specially handled and there must be a bad interaction, probably in the SamFactory open method. I suspect it will be easy enough to track down now that you found the problem. |
Ok. I found the problem. It's because it's loading the index as a stream in this case. We never updated that path to check if it's bai or csi explicitly and it just guesses that it's probably bai. You can see the problem here:
|
great job @lbergelson ! I hope it's easy to fix |
@lbergelson I've confirmed that cram indexes (.crai) work by URL. Not that it was a question, but thought it worth checking. There is a related potential problem elsewhere where the file extension is used to determine index type. Actually not even the extension necessarily but the end of the string (urls can have parameters). This is problematic for certain URLs that are basically mangled hashes, such as Amazon signed URLs and Google Drive urls. Perhaps we can solve both problems by reading the first few bytes of an index to check its magic number. I can look into that if you want to go that direction. Alternatively, and perhaps less disruptively, we could add a parameter to tell the htsjdk explicitly what the index type is. |
In this case we can scan the first bytes and determine if it's csi/bai. That's probably what I'll do here. In the future I want to let you specify the type explicitly but that's a longer term change I think. |
@whaleberg @lbergelson @SergejN Just checking, where are we with this? |
bump |
any news? |
Hello, i am having the same issue and interested on its resolution as well, any idea when this will be solved ? |
Hi, wanted to update. I intend to patch this but I haven't gotten a chance. I've been running at much reduced capacity lately due to the pandemic and working at home. Sorry that it's taking so long! |
Any updates ? |
Hello |
@lbergelson @whaleberg Can I suggest something like this as a solution (in BAMFileReader)? The idea is to peek at the stream to see if its gzipped, if it is guess "CSI", which is probably correct. A more rigorous test would check the CSI magic number, but to do that you first have to see if its gzipped, and gunzip the first block. I think it would suffice, and be a big improvement, to just check for gzip and return "CSI" if it is. We aren't validating BAI magic numbers here. Relying on file index from URLs is problematic, and will for sure break for certain providers (e.g. Google Drive and potentially signed Amazon URLs).
|
@lbergelson @whaleberg Actually just determining type is not enough, getIndex() assumes there is an mIndexFile. Another proposal, provide a constructor or factory function that I can pass a BAMIndex to, and IGV will deal with determining the type and reading the index. Perhaps this exists, but I don't see it. I'm currently using SamInputResource methods to create readers in this class https://github.com/igvteam/igv/blob/master/src/main/java/org/broad/igv/sam/reader/SamReaderPool.java |
@lbergelson @whaleberg OK, 1 last attempt, see below. If not a file (mIndexFile == null) I do the magic bytes test on the stream. Actually for CSI I just check if its gzipped, so its a guess but a good one. Then when creating the index I also check if its a stream or file, in the current implementation it just assume that a type of "null" is a stream. This looks like it should work, but it fails later in CSIIndex checking the magic number, because it just reads the bytes from the supplied stream and does not gunzip (or bgunzip) first. Does that constructor assume a gzip stream? The supplied SeekableStream could be wrapped in a GUnzipStream but at this point I'm not sure how its supposed to work. Sorry for the spam, hope this is helpful.
|
thank you for all of this @jrobinso |
I have a fix that seems to be working. Needs some more tests. Also, I need to follow up since I think there might be some leaks in the CSI code - needs a little more investigation. |
Description of the issue:
Dear htsjdk team,
I encountered a strange bug when trying to display a BAM file in IGV. I work with the Mexican axolotl (genome size 32Gb). Therefore, I have to use CSI indices for the BAM files.
In IGV, if I specify the URL to the BAM file and provide an explicit URL to the CSI index, an exception is thrown in
htsjdk/src/main/java/htsjdk/samtools/AbstractBAMFileIndex.java
Line 372 in edb8c60
I tried my best to check what is going on there. It seems that the path/url is correct in IGV itself. It is also correct in htsjdk, because the error message (see screenshot) has the correct path in it.
I checked if the file is corrupted. If I download the files from the server and open them locally, everything works as expected. Next, I tested if the error occurs generally (suggesting a mis-configuration of the server) or is related to the CSI file. I extracted the reads from a locus chr10p:1-500000000 to make sure I can use the BAI index. With the BAI index everything works just fine.
From the code above it seems that htsjdk tries to load the index as BAI even if it's a CSI index.
Your environment:
Steps to reproduce
Start IGV 2.8 and load the reference and the BAM files locally and through the URL. The reads will be displayed if the BAM file is loaded locally and fail to be displayed if the file is fetched from the URL. Since the path seems to be correct in the error message, I guess the error occurs within htsjdk. However, to be on the safe side, I add the link to the IGV issue (igvteam/igv#843) I opened yesterday and also add @jrobinso to this issue.
Unfortunately, I cannot provide a path to our server since it is not accessible from the internet. But I put the relevant data to Dropbox. https://www.dropbox.com/sh/dsgj4fe0gpej4jm/AADWmPPzQnOGkSnL6SUx3Ceja?dl=0
Important note: yesterday, when I tried to get it to run, I checked if it worked if I'd rename the CSI file to BAI (it does with the UCSC browser, though, therefore, it wasn't a complete stupid assumption). It did not. However, when I created the test dataset for IGV bug report, I forgot to rename the file back to CSI. So, what you have in the archive is a CSI file with the wrong extension. It's not the reason for the bug reported here.
Expected behaviour
CSI index should be loaded correctly
Actual behaviour
htskdk interprets the index as BAI
thank you!
Best,
Sergej
The text was updated successfully, but these errors were encountered: