Exception is thrown when loading the BAM from a URL #1507

SergejN · 2020-09-02T15:59:28Z

Description of the issue:

Dear htsjdk team,

I encountered a strange bug when trying to display a BAM file in IGV. I work with the Mexican axolotl (genome size 32Gb). Therefore, I have to use CSI indices for the BAM files.
In IGV, if I specify the URL to the BAM file and provide an explicit URL to the CSI index, an exception is thrown in

protected void verifyIndexMagicNumber(final String sourceName) {
        // Verify the magic number.
        seek(0);
        final byte[] buffer = new byte[4];
        readBytes(buffer);
        if (!Arrays.equals(buffer, BAMFileConstants.BAI_INDEX_MAGIC)) {
            throw new RuntimeIOException("Invalid file header in BAM index " + sourceName +
                    ": " + new String(buffer));
        }
    }

htsjdk/src/main/java/htsjdk/samtools/AbstractBAMFileIndex.java

Line 372 in edb8c60

protected void verifyIndexMagicNumber(final String sourceName) {

I tried my best to check what is going on there. It seems that the path/url is correct in IGV itself. It is also correct in htsjdk, because the error message (see screenshot) has the correct path in it.

I checked if the file is corrupted. If I download the files from the server and open them locally, everything works as expected. Next, I tested if the error occurs generally (suggesting a mis-configuration of the server) or is related to the CSI file. I extracted the reads from a locus chr10p:1-500000000 to make sure I can use the BAI index. With the BAI index everything works just fine.
From the code above it seems that htsjdk tries to load the index as BAI even if it's a CSI index.

Your environment:

version of htsjdk: 2.23.0 (although IGV uses an older version of the library, I replaced it in my custom IGV installation)
version of java: the one that is shipped with IGV - OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.7+10)
which OS: MacOS 10.14.6

Steps to reproduce

Start IGV 2.8 and load the reference and the BAM files locally and through the URL. The reads will be displayed if the BAM file is loaded locally and fail to be displayed if the file is fetched from the URL. Since the path seems to be correct in the error message, I guess the error occurs within htsjdk. However, to be on the safe side, I add the link to the IGV issue (igvteam/igv#843) I opened yesterday and also add @jrobinso to this issue.
Unfortunately, I cannot provide a path to our server since it is not accessible from the internet. But I put the relevant data to Dropbox. https://www.dropbox.com/sh/dsgj4fe0gpej4jm/AADWmPPzQnOGkSnL6SUx3Ceja?dl=0

Important note: yesterday, when I tried to get it to run, I checked if it worked if I'd rename the CSI file to BAI (it does with the UCSC browser, though, therefore, it wasn't a complete stupid assumption). It did not. However, when I created the test dataset for IGV bug report, I forgot to rename the file back to CSI. So, what you have in the archive is a CSI file with the wrong extension. It's not the reason for the bug reported here.

Expected behaviour

CSI index should be loaded correctly

Actual behaviour

htskdk interprets the index as BAI

thank you!
Best,
Sergej

The text was updated successfully, but these errors were encountered:

jrobinso · 2020-09-02T16:07:46Z

I've confirmed this and am working on a simple test case. Here is how IGV instantiates reader. It works if the bam and csi are local, fails if loading by URL

        final SamReaderFactory factory = SamReaderFactory.makeDefault().
                referenceSource(new IGVReferenceSource()).
                validationStringency(ValidationStringency.SILENT);
        SamInputResource resource = SamInputResource.of(IGVSeekableStreamFactory.getInstance().getStreamFor(url));
        SeekableStream indexStream = IGVSeekableStreamFactory.getInstance().getStreamFor(HttpUtils.createURL(indexPath));
        resource = resource.index(indexStream);
        return factory.open(resource);

I've put test data here

https://s3.amazonaws.com/igv.org.test/csi_test/chr10p.bam
https://s3.amazonaws.com/igv.org.test/csi_test/chr10p.bam.csi

lbergelson · 2020-09-02T20:21:31Z

@SergejN @jrobinso Thank you for reporting this and for the help reproducing it. I'm not surprised that there is a bad interaction somewhere with URLs and CSI. Both are sort of specially handled and there must be a bad interaction, probably in the SamFactory open method. I suspect it will be easy enough to track down now that you found the problem.

lbergelson · 2020-09-02T21:15:58Z

Ok. I found the problem. It's because it's loading the index as a stream in this case. We never updated that path to check if it's bai or csi explicitly and it just guesses that it's probably bai.

You can see the problem here:

   if (samIndex == null) {
                mIndex = mEnableIndexCaching ? new CachingBAMFileIndex(mIndexStream, getFileHeader().getSequenceDictionary())
                        : new DiskBasedBAMFileIndex(mIndexStream, getFileHeader().getSequenceDictionary());
            } else if (samIndex.equals(SamIndexes.BAI)) {
                    mIndex = mEnableIndexCaching ? new CachingBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping)
                            : new DiskBasedBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping);
            } else if (samIndex.equals(SamIndexes.CSI)) {
                    mIndex = new CSIIndex(mIndexFile, mEnableIndexMemoryMapping, getFileHeader().getSequenceDictionary());
            } else {
                throw new SAMFormatException("Unsupported BAM index file: " + mIndexFile.getName());
            }

SergejN · 2020-09-02T21:28:05Z

great job @lbergelson ! I hope it's easy to fix

jrobinso · 2020-09-02T23:39:02Z

@lbergelson I've confirmed that cram indexes (.crai) work by URL. Not that it was a question, but thought it worth checking.

There is a related potential problem elsewhere where the file extension is used to determine index type. Actually not even the extension necessarily but the end of the string (urls can have parameters). This is problematic for certain URLs that are basically mangled hashes, such as Amazon signed URLs and Google Drive urls. Perhaps we can solve both problems by reading the first few bytes of an index to check its magic number. I can look into that if you want to go that direction. Alternatively, and perhaps less disruptively, we could add a parameter to tell the htsjdk explicitly what the index type is.

whaleberg · 2020-09-03T03:55:46Z

In this case we can scan the first bytes and determine if it's csi/bai. That's probably what I'll do here. In the future I want to let you specify the type explicitly but that's a longer term change I think.

jrobinso · 2020-11-10T06:03:30Z

@whaleberg @lbergelson @SergejN Just checking, where are we with this?

jrobinso · 2020-12-05T05:03:51Z

bump

SergejN · 2020-12-05T09:22:39Z

any news?

tougai · 2020-12-10T15:31:20Z

Hello, i am having the same issue and interested on its resolution as well, any idea when this will be solved ?

lbergelson · 2020-12-21T18:03:24Z

Hi, wanted to update. I intend to patch this but I haven't gotten a chance. I've been running at much reduced capacity lately due to the pandemic and working at home. Sorry that it's taking so long!

tougai · 2021-01-22T15:17:18Z

Any updates ?

FredericBGA · 2021-09-07T14:30:22Z

Hello
Any news regarding this issue?
The last version on IGV seems to be still unable to open a BAM and its .csi index from an URL.
igvteam/igv#843
Thank you.

jrobinso · 2021-09-09T03:36:09Z

@lbergelson @whaleberg Can I suggest something like this as a solution (in BAMFileReader)? The idea is to peek at the stream to see if its gzipped, if it is guess "CSI", which is probably correct. A more rigorous test would check the CSI magic number, but to do that you first have to see if its gzipped, and gunzip the first block. I think it would suffice, and be a big improvement, to just check for gzip and return "CSI" if it is. We aren't validating BAI magic numbers here.

Relying on file index from URLs is problematic, and will for sure break for certain providers (e.g. Google Drive and potentially signed Amazon URLs).


    public SamIndexes getIndexType() {
        if (mIndexFile != null) {
             ...
        } 
           else if(mIndexStream != null) {
                  // Read the first 2 bytes from mIndexStream to see if its gzipped, if it is its probably CSI (and is definitely not BAI).
        }
 
        return null;
    }

jrobinso · 2021-09-09T14:39:53Z

@lbergelson @whaleberg Actually just determining type is not enough, getIndex() assumes there is an mIndexFile.

Another proposal, provide a constructor or factory function that I can pass a BAMIndex to, and IGV will deal with determining the type and reading the index. Perhaps this exists, but I don't see it. I'm currently using SamInputResource methods to create readers in this class https://github.com/igvteam/igv/blob/master/src/main/java/org/broad/igv/sam/reader/SamReaderPool.java

jrobinso · 2021-09-09T15:37:55Z

@lbergelson @whaleberg OK, 1 last attempt, see below. If not a file (mIndexFile == null) I do the magic bytes test on the stream. Actually for CSI I just check if its gzipped, so its a guess but a good one. Then when creating the index I also check if its a stream or file, in the current implementation it just assume that a type of "null" is a stream. This looks like it should work, but it fails later in CSIIndex checking the magic number, because it just reads the bytes from the supplied stream and does not gunzip (or bgunzip) first. Does that constructor assume a gzip stream? The supplied SeekableStream could be wrapped in a GUnzipStream but at this point I'm not sure how its supposed to work.

Sorry for the spam, hope this is helpful.


    /**
     * Retrieves the index for the given file type.  Ensure that the index is of the specified type.
     * @return An index of the given type.
     */
    @Override
    public BAMIndex getIndex() {
        if(!hasIndex())
            throw new SAMException("No index is available for this BAM file.");
        if(mIndex == null) {
            SamIndexes samIndex = getIndexType();
            if(mIndexFile != null) {
                if (samIndex.equals(SamIndexes.BAI)) {
                    mIndex = mEnableIndexCaching ? new CachingBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping)
                            : new DiskBasedBAMFileIndex(mIndexFile, getFileHeader().getSequenceDictionary(), mEnableIndexMemoryMapping);
                } else if (samIndex.equals(SamIndexes.CSI)) {
                    mIndex = new CSIIndex(mIndexFile, mEnableIndexMemoryMapping, getFileHeader().getSequenceDictionary());
                } else {
                    throw new SAMFormatException("Unsupported BAM index file: " + mIndexFile.getName());
                }
            } else if(mIndexStream != null) {
                if (samIndex.equals(SamIndexes.BAI)) {
                    mIndex = new CachingBAMFileIndex(mIndexStream, getFileHeader().getSequenceDictionary());
                } else if (samIndex.equals(SamIndexes.CSI)) {
                    mIndex = new CSIIndex(mIndexStream,  getFileHeader().getSequenceDictionary());
                } else {
                    throw new SAMFormatException("Unsupported BAM index stream ");
                }
            }
        }

        return mIndex;
    }

    /**
     * Return the type of the BAM index, BAI or CSI.
     * @return one of {@link SamIndexes#BAI} or {@link SamIndexes#CSI} or null
     */
    public SamIndexes getIndexType() {
        if (mIndexFile != null) {
            if (mIndexFile.getName().toLowerCase().endsWith(FileExtensions.BAI_INDEX)) {
                return SamIndexes.BAI;
            } else if (mIndexFile.getName().toLowerCase().endsWith(FileExtensions.CSI)) {
                return SamIndexes.CSI;
            }

            throw new SAMFormatException("Unknown BAM index file type: " + mIndexFile.getName());
        }
        // NEW STUFF STARTS HERE
        else if(mIndexStream != null) {
            long pos = -1;
            try {
                pos = mIndexStream.position();

                byte [] magicBytes = new byte[4];
                mIndexStream.readFully(magicBytes);
                if(Arrays.equals(magicBytes, BAMFileConstants.BAI_INDEX_MAGIC)) {
                    return SamIndexes.BAI;
                }
                else if(magicBytes[0] ==  31 && magicBytes[1] == -117) {  // JUST CHECKING FOR GZIPPED,
                    return SamIndexes.CSI;
                } else {
                    throw new SAMFormatException("Unknown BAM index magic number: " + new String(magicBytes));
                }

            } catch (IOException e) {
                // TODO -- log ?
            }
            finally {
                try {
                    mIndexStream.seek(pos);   // reset stream
                } catch (IOException e) {

                }
            }
        }
        // NEW STUFF ENDS HERE

        return null;
    }

FredericBGA · 2021-09-09T15:48:38Z

thank you for all of this @jrobinso

cmnbroad · 2022-02-23T22:39:29Z

I have a fix that seems to be working. Needs some more tests. Also, I need to follow up since I think there might be some leaks in the CSI code - needs a little more investigation.

* Support loading of CSI from URLs/streams. * fixes #1507

SergejN changed the title ~~CSI index fails when loading the BAM throiugh~~ Exception is thrown when loading the BAM from a URL Sep 2, 2020

jrobinso mentioned this issue May 13, 2021

New version of htsjdk igvteam/igv#843

Closed

cmnbroad mentioned this issue Feb 28, 2022

Support loading of CSI from URLs/streams. #1595

Merged

lbergelson closed this as completed in #1595 Apr 26, 2022

lbergelson pushed a commit that referenced this issue Apr 26, 2022

Support loading of CSI from URLs/streams. #1507 (#1595)

1449dec

* Support loading of CSI from URLs/streams. * fixes #1507

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exception is thrown when loading the BAM from a URL #1507

Exception is thrown when loading the BAM from a URL #1507

SergejN commented Sep 2, 2020

jrobinso commented Sep 2, 2020 •

edited

Loading

lbergelson commented Sep 2, 2020

lbergelson commented Sep 2, 2020

SergejN commented Sep 2, 2020

jrobinso commented Sep 2, 2020

whaleberg commented Sep 3, 2020

jrobinso commented Nov 10, 2020

jrobinso commented Dec 5, 2020

SergejN commented Dec 5, 2020

tougai commented Dec 10, 2020 •

edited

Loading

lbergelson commented Dec 21, 2020

tougai commented Jan 22, 2021

FredericBGA commented Sep 7, 2021

jrobinso commented Sep 9, 2021 •

edited

Loading

jrobinso commented Sep 9, 2021

jrobinso commented Sep 9, 2021

FredericBGA commented Sep 9, 2021 •

edited

Loading

cmnbroad commented Feb 23, 2022

Exception is thrown when loading the BAM from a URL #1507

Exception is thrown when loading the BAM from a URL #1507

Comments

SergejN commented Sep 2, 2020

Description of the issue:

Your environment:

Steps to reproduce

Expected behaviour

Actual behaviour

jrobinso commented Sep 2, 2020 • edited Loading

lbergelson commented Sep 2, 2020

lbergelson commented Sep 2, 2020

SergejN commented Sep 2, 2020

jrobinso commented Sep 2, 2020

whaleberg commented Sep 3, 2020

jrobinso commented Nov 10, 2020

jrobinso commented Dec 5, 2020

SergejN commented Dec 5, 2020

tougai commented Dec 10, 2020 • edited Loading

lbergelson commented Dec 21, 2020

tougai commented Jan 22, 2021

FredericBGA commented Sep 7, 2021

jrobinso commented Sep 9, 2021 • edited Loading

jrobinso commented Sep 9, 2021

jrobinso commented Sep 9, 2021

FredericBGA commented Sep 9, 2021 • edited Loading

cmnbroad commented Feb 23, 2022

jrobinso commented Sep 2, 2020 •

edited

Loading

tougai commented Dec 10, 2020 •

edited

Loading

jrobinso commented Sep 9, 2021 •

edited

Loading

FredericBGA commented Sep 9, 2021 •

edited

Loading