Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception is thrown when loading the BAM from a URL #1507

Closed
SergejN opened this issue Sep 2, 2020 · 18 comments · Fixed by #1595
Closed

Exception is thrown when loading the BAM from a URL #1507

SergejN opened this issue Sep 2, 2020 · 18 comments · Fixed by #1595

Comments

@SergejN
Copy link

SergejN commented Sep 2, 2020

Description of the issue:

Dear htsjdk team,

I encountered a strange bug when trying to display a BAM file in IGV. I work with the Mexican axolotl (genome size 32Gb). Therefore, I have to use CSI indices for the BAM files.
In IGV, if I specify the URL to the BAM file and provide an explicit URL to the CSI index, an exception is thrown in

protected void verifyIndexMagicNumber(final String sourceName) {
        // Verify the magic number.
        seek(0);
        final byte[] buffer = new byte[4];
        readBytes(buffer);
        if (!Arrays.equals(buffer, BAMFileConstants.BAI_INDEX_MAGIC)) {
            throw new RuntimeIOException("Invalid file header in BAM index " + sourceName +
                    ": " + new String(buffer));
        }
    }

protected void verifyIndexMagicNumber(final String sourceName) {

I tried my best to check what is going on there. It seems that the path/url is correct in IGV itself. It is also correct in htsjdk, because the error message (see screenshot) has the correct path in it.
image

I checked if the file is corrupted. If I download the files from the server and open them locally, everything works as expected. Next, I tested if the error occurs generally (suggesting a mis-configuration of the server) or is related to the CSI file. I extracted the reads from a locus chr10p:1-500000000 to make sure I can use the BAI index. With the BAI index everything works just fine.
From the code above it seems that htsjdk tries to load the index as BAI even if it's a CSI index.

Your environment:

  • version of htsjdk: 2.23.0 (although IGV uses an older version of the library, I replaced it in my custom IGV installation)
  • version of java: the one that is shipped with IGV - OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+10)
  • which OS: MacOS 10.14.6

Steps to reproduce

Start IGV 2.8 and load the reference and the BAM files locally and through the URL. The reads will be displayed if the BAM file is loaded locally and fail to be displayed if the file is fetched from the URL. Since the path seems to be correct in the error message, I guess the error occurs within htsjdk. However, to be on the safe side, I add the link to the IGV issue (igvteam/igv#843) I opened yesterday and also add @jrobinso to this issue.
Unfortunately, I cannot provide a path to our server since it is not accessible from the internet. But I put the relevant data to Dropbox. https://www.dropbox.com/sh/dsgj4fe0gpej4jm/AADWmPPzQnOGkSnL6SUx3Ceja?dl=0

Important note: yesterday, when I tried to get it to run, I checked if it worked if I'd rename the CSI file to BAI (it does with the UCSC browser, though, therefore, it wasn't a complete stupid assumption). It did not. However, when I created the test dataset for IGV bug report, I forgot to rename the file back to CSI. So, what you have in the archive is a CSI file with the wrong extension. It's not the reason for the bug reported here.

Expected behaviour

CSI index should be loaded correctly

Actual behaviour

htskdk interprets the index as BAI

thank you!
Best,
Sergej

@SergejN SergejN changed the title CSI index fails when loading the BAM throiugh Exception is thrown when loading the BAM from a URL Sep 2, 2020
@jrobinso
Copy link
Contributor

jrobinso commented Sep 2, 2020

I've confirmed this and am working on a simple test case. Here is how IGV instantiates reader. It works if the bam and csi are local, fails if loading by URL

        final SamReaderFactory factory = SamReaderFactory.makeDefault().
                referenceSource(new IGVReferenceSource()).
                validationStringency(ValidationStringency.SILENT);
        SamInputResource resource = SamInputResource.of(IGVSeekableStreamFactory.getInstance().getStreamFor(url));
        SeekableStream indexStream = IGVSeekableStreamFactory.getInstance().getStreamFor(HttpUtils.createURL(indexPath));
        resource = resource.index(indexStream);
        return factory.open(resource);

I've put test data here

https://s3.amazonaws.com/igv.org.test/csi_test/chr10p.bam
https://s3.amazonaws.com/igv.org.test/csi_test/chr10p.bam.csi

@lbergelson
Copy link
Member

@SergejN @jrobinso Thank you for reporting this and for the help reproducing it. I'm not surprised that there is a bad interaction somewhere with URLs and CSI. Both are sort of specially handled and there must be a bad interaction, probably in the SamFactory open method. I suspect it will be easy enough to track down now that you found the problem.

@lbergelson
Copy link
Member

Ok. I found the problem. It's because it's loading the index as a stream in this case. We never updated that path to check if it's bai or csi explicitly and it just guesses that it's probably bai.

You can see the problem here:

   if (samIndex == null) {
                mIndex = mEnableIndexCaching ? new CachingBAMFileIndex(mIndexStream, getFileHeader().getSequenceDictionary())
                        : new DiskBasedBAMFileIndex(mIndexStream, getFileHeader().getSequenceDictionary());
            } else if (samIndex.equals(SamIndexes.BAI)) {
                    mIndex = mEnableIndexCaching ? new CachingBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping)
                            : new DiskBasedBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping);
            } else if (samIndex.equals(SamIndexes.CSI)) {
                    mIndex = new CSIIndex(mIndexFile, mEnableIndexMemoryMapping, getFileHeader().getSequenceDictionary());
            } else {
                throw new SAMFormatException("Unsupported BAM index file: " + mIndexFile.getName());
            }

@SergejN
Copy link
Author

SergejN commented Sep 2, 2020

great job @lbergelson ! I hope it's easy to fix

@jrobinso
Copy link
Contributor

jrobinso commented Sep 2, 2020

@lbergelson I've confirmed that cram indexes (.crai) work by URL. Not that it was a question, but thought it worth checking.

There is a related potential problem elsewhere where the file extension is used to determine index type. Actually not even the extension necessarily but the end of the string (urls can have parameters). This is problematic for certain URLs that are basically mangled hashes, such as Amazon signed URLs and Google Drive urls. Perhaps we can solve both problems by reading the first few bytes of an index to check its magic number. I can look into that if you want to go that direction. Alternatively, and perhaps less disruptively, we could add a parameter to tell the htsjdk explicitly what the index type is.

@whaleberg
Copy link

In this case we can scan the first bytes and determine if it's csi/bai. That's probably what I'll do here. In the future I want to let you specify the type explicitly but that's a longer term change I think.

@jrobinso
Copy link
Contributor

@whaleberg @lbergelson @SergejN Just checking, where are we with this?

@jrobinso
Copy link
Contributor

jrobinso commented Dec 5, 2020

bump

@SergejN
Copy link
Author

SergejN commented Dec 5, 2020

any news?

@tougai
Copy link

tougai commented Dec 10, 2020

Hello, i am having the same issue and interested on its resolution as well, any idea when this will be solved ?

@lbergelson
Copy link
Member

Hi, wanted to update. I intend to patch this but I haven't gotten a chance. I've been running at much reduced capacity lately due to the pandemic and working at home. Sorry that it's taking so long!

@tougai
Copy link

tougai commented Jan 22, 2021

Any updates ?

@FredericBGA
Copy link

Hello
Any news regarding this issue?
The last version on IGV seems to be still unable to open a BAM and its .csi index from an URL.
igvteam/igv#843
Thank you.

@jrobinso
Copy link
Contributor

jrobinso commented Sep 9, 2021

@lbergelson @whaleberg Can I suggest something like this as a solution (in BAMFileReader)? The idea is to peek at the stream to see if its gzipped, if it is guess "CSI", which is probably correct. A more rigorous test would check the CSI magic number, but to do that you first have to see if its gzipped, and gunzip the first block. I think it would suffice, and be a big improvement, to just check for gzip and return "CSI" if it is. We aren't validating BAI magic numbers here.

Relying on file index from URLs is problematic, and will for sure break for certain providers (e.g. Google Drive and potentially signed Amazon URLs).


    public SamIndexes getIndexType() {
        if (mIndexFile != null) {
             ...
        } 
           else if(mIndexStream != null) {
                  // Read the first 2 bytes from mIndexStream to see if its gzipped, if it is its probably CSI (and is definitely not BAI).
        }
 
        return null;
    }

@jrobinso
Copy link
Contributor

jrobinso commented Sep 9, 2021

@lbergelson @whaleberg Actually just determining type is not enough, getIndex() assumes there is an mIndexFile.

Another proposal, provide a constructor or factory function that I can pass a BAMIndex to, and IGV will deal with determining the type and reading the index. Perhaps this exists, but I don't see it. I'm currently using SamInputResource methods to create readers in this class https://github.com/igvteam/igv/blob/master/src/main/java/org/broad/igv/sam/reader/SamReaderPool.java

@jrobinso
Copy link
Contributor

jrobinso commented Sep 9, 2021

@lbergelson @whaleberg OK, 1 last attempt, see below. If not a file (mIndexFile == null) I do the magic bytes test on the stream. Actually for CSI I just check if its gzipped, so its a guess but a good one. Then when creating the index I also check if its a stream or file, in the current implementation it just assume that a type of "null" is a stream. This looks like it should work, but it fails later in CSIIndex checking the magic number, because it just reads the bytes from the supplied stream and does not gunzip (or bgunzip) first. Does that constructor assume a gzip stream? The supplied SeekableStream could be wrapped in a GUnzipStream but at this point I'm not sure how its supposed to work.

Sorry for the spam, hope this is helpful.


    /**
     * Retrieves the index for the given file type.  Ensure that the index is of the specified type.
     * @return An index of the given type.
     */
    @Override
    public BAMIndex getIndex() {
        if(!hasIndex())
            throw new SAMException("No index is available for this BAM file.");
        if(mIndex == null) {
            SamIndexes samIndex = getIndexType();
            if(mIndexFile != null) {
                if (samIndex.equals(SamIndexes.BAI)) {
                    mIndex = mEnableIndexCaching ? new CachingBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping)
                            : new DiskBasedBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping);
                } else if (samIndex.equals(SamIndexes.CSI)) {
                    mIndex = new CSIIndex(mIndexFile, mEnableIndexMemoryMapping, getFileHeader().getSequenceDictionary());
                } else {
                    throw new SAMFormatException("Unsupported BAM index file: " + mIndexFile.getName());
                }
            } else if(mIndexStream != null) {
                if (samIndex.equals(SamIndexes.BAI)) {
                    mIndex = new CachingBAMFileIndex(mIndexStream, getFileHeader().getSequenceDictionary());
                } else if (samIndex.equals(SamIndexes.CSI)) {
                    mIndex = new CSIIndex(mIndexStream,  getFileHeader().getSequenceDictionary());
                } else {
                    throw new SAMFormatException("Unsupported BAM index stream ");
                }
            }
        }

        return mIndex;
    }

    /**
     * Return the type of the BAM index, BAI or CSI.
     * @return one of {@link SamIndexes#BAI} or {@link SamIndexes#CSI} or null
     */
    public SamIndexes getIndexType() {
        if (mIndexFile != null) {
            if (mIndexFile.getName().toLowerCase().endsWith(FileExtensions.BAI_INDEX)) {
                return SamIndexes.BAI;
            } else if (mIndexFile.getName().toLowerCase().endsWith(FileExtensions.CSI)) {
                return SamIndexes.CSI;
            }

            throw new SAMFormatException("Unknown BAM index file type: " + mIndexFile.getName());
        }
        // NEW STUFF STARTS HERE
        else if(mIndexStream != null) {
            long pos = -1;
            try {
                pos = mIndexStream.position();

                byte [] magicBytes = new byte[4];
                mIndexStream.readFully(magicBytes);
                if(Arrays.equals(magicBytes, BAMFileConstants.BAI_INDEX_MAGIC)) {
                    return SamIndexes.BAI;
                }
                else if(magicBytes[0] ==  31 && magicBytes[1] == -117) {  // JUST CHECKING FOR GZIPPED,
                    return SamIndexes.CSI;
                } else {
                    throw new SAMFormatException("Unknown BAM index magic number: " + new String(magicBytes));
                }

            } catch (IOException e) {
                // TODO -- log ?
            }
            finally {
                try {
                    mIndexStream.seek(pos);   // reset stream
                } catch (IOException e) {

                }
            }
        }
        // NEW STUFF ENDS HERE

        return null;
    }

@FredericBGA
Copy link

FredericBGA commented Sep 9, 2021

thank you for all of this @jrobinso

@cmnbroad
Copy link
Collaborator

I have a fix that seems to be working. Needs some more tests. Also, I need to follow up since I think there might be some leaks in the CSI code - needs a little more investigation.

lbergelson pushed a commit that referenced this issue Apr 26, 2022
* Support loading of CSI from URLs/streams.
* fixes #1507
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants