HTSJDK performance on CRAM #722

jpaul6 · 2016-10-10T21:39:08Z

I've recently been trying to use recent builds of IGV on CRAM files. Especially when using the Broad-hosted reference files, performance is very poor. I've traced this back to three (semi-independent) issues with the interface that HTSJDK provides/implements.

The CRAMReferenceSource interface has a single function that demands a byte[] for the entire chromosome. This takes an age to load over a slow internet connection.
HTSJDK checks the MD5 of each slice, and in sparse regions, slices can encompass a very large (>50Mb) region.
HTSJDK, even when a relatively small reference range of the CRAM file is requested, iterates over (and inflates) CRAM blocks that are entirely non-overlapping with that reference range. I suspect this is related to using a BAI rather than a CRAI for the index?

I see that these are all related, and the reason that I'd separated them out is that I'd like to propose the following solutions (which must all be realized in order to make IGV performant):

Make the CRAMReferenceSource interface have a function that requests a chromosomal range, and leave the caching up to the implementation. In most cases, a very simple caching will require 4-5 orders of magnitude less data to be fetched.
Leave the MD5 checking to an interface (at least optionally) so that clients aren't forced to do this checking, or to do it in a different way (i.e. caching).
If no CRAMReferenceSource is provided, do not fully inflate the record (i.e. do what's been requested here: Allow CRAM to be read if reference sequences are missing. #719), and provide a mechanism for the client to inflate the record. This doesn't directly address the issue, but it does allow the client a lot more flexibility to ignore (w/o incurring substantial computation expense) those records that aren't relevant to the current query.

I've PoC'd much of this, and it is all relatively straightforward. I'd be happy to clean things up and submit a PR if that's useful. Looking forward to the discussion. Many thanks.

cmnbroad · 2016-10-11T14:28:43Z

@jrobinso @vadimzalunin Thoughts on this ? @jpaul6 Have you PoC'd #3 - I'd be curious to see what that looks like.

jpaul6 · 2016-10-11T17:24:28Z

@cmnbroad I have not, and I agree that's probably among the stickier parts.

tfenne · 2016-10-11T18:48:57Z

I've also been experimenting with IGV and CRAM files, and it's pretty painful. For my own personal use I tend to load a local genome when using CRAM as then the reference sequence fetching is fast, and in general things are only 1-2X slower than working with a BAM file. The downside of this, of course, is all the standard annotations in the server-hosted genomes aren't at my finger tips any more (genes, repeats, etc.).

To echo @jpaul6's comments, it seems that the real problem is that the CRAMReferenceSource API is specified in a way that works well for many common use cases in data processing and archival pipelines, but is at odds with the way IGV works when displaying files. I could imagine a couple of ways to solve this pretty easily:

Change the API to return a CRAMReference or ReferenceSequence object instead of a byte[], which must then be able to access sub-sequences by API. The existing implementation could cache the whole chromosome, and then serve up regions, but alternative implementations could do something different.
Add to the CRAMReferenceSource API a default method that fetches just the desired region, and prefer that method if it is implemented, and if not then fall back to loading the entire chromosome.

Either of these approaches would allow a separate IGVCramReferenceSource that does things in a way more appropriate for an interactive viewer with a remote genome reference.

jrobinso · 2016-10-11T19:18:32Z

I like Tim's approach of returning an object instead of a byte [] array,
but IIRC this might be a substantial undertaking on the HTSJDK side.

Tim, IIRC when you first load a server-hosted genome you can check a box
that says download sequence (or fasta). The intent of that was to allow
you to work with a local fasta. To be honest I didn't implement that, and
don't know if it even works, but if it doesn't it could be made to work.
This might be an acceptable workaround.

Before all that, however, has anyone profiled this to confirm that
CRAMReferenceSource is the problem? It likely is, particularly if working
with a local fasta fixes the problem.

On Tue, Oct 11, 2016 at 11:49 AM, Tim Fennell notifications@github.com
wrote:

I've also been experimenting with IGV and CRAM files, and it's pretty
painful. For my own personal use I tend to load a local genome when using
CRAM as then the reference sequence fetching is fast, and in general things
are only 1-2X slower than working with a BAM file. The downside of
this, of course, is all the standard annotations in the server-hosted
genomes aren't at my finger tips any more (genes, repeats, etc.).

To echo @jpaul6 https://github.com/jpaul6's comments, it seems that the
real problem is that the CRAMReferenceSource API is specified in a way
that works well for many common use cases in data processing and archival
pipelines, but it at odds with the way IGV works when displaying files. I
could imagine a couple of ways to solve this pretty easily:

Change the API to return a CRAMReference or ReferenceSequence object
instead of a byte[], which must then be able to access sub-sequences by
API. The existing implementation could cache the whole chromosome, and then
serve up regions, but alternative implementations could do something
different.

Add to the CRAMReferenceSource API a default method that fetches just
the desired region, and prefer that method if it is implemented, and if not
then fall back to loading the entire chromosome.

Either of these approaches would allow a separate IGVCramReferenceSource
that does things in a way more appropriate for an interactive viewer with a
remote genome reference.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#722 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA49HNZyChAGZ2W6-_i_lv2b7-husaOoks5qy9odgaJpZM4KTBHe
.

tfenne · 2016-10-11T19:24:19Z

@jrobinso I haven't profiled - but the performance is night and day different using the hosted genome vs. a local genome. I had meant to say in my comment, that while there are workarounds (use a local genome, the checkbox you suggest) those are not great workarounds for less experienced users. For example I support a number of lab where there's a plethora of lab techs and mol-bio people using IGV - getting everyone setup to use a local genome or similar is likely to be quite challenging.

jrobinso · 2016-10-11T19:27:19Z

OK. Since disk space is not usually a problem these days, perhaps IGV
could notice you are using a CRAM and cache entire sequences on disk when
first accessed.

On Tue, Oct 11, 2016 at 12:24 PM, Tim Fennell notifications@github.com
wrote:

@jrobinso https://github.com/jrobinso I haven't profiled - but the
performance is night and day different using the hosted genome vs. a local
genome. I had meant to say in my comment, that while there are workarounds
(use a local genome, the checkbox you suggest) those are not great
workarounds for less experienced users. For example I support a number of
lab where there's a plethora of lab techs and mol-bio people using IGV -
getting everyone setup to use a local genome or similar is likely to be
quite challenging.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#722 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA49HAVBCzMC6w5gkUEhgNDI_od0p-xiks5qy-JmgaJpZM4KTBHe
.

jrobinso · 2016-10-11T19:28:40Z

BTW I think the best solution is to change the CRAMReferenceSource API to
not require entire sequences, but I don't know if that's feasible and I
definitely would not want to loose the MD5 sequence check.

On Tue, Oct 11, 2016 at 12:27 PM, James Robinson <
jrobinso@broadinstitute.org> wrote:

OK. Since disk space is not usually a problem these days, perhaps IGV
could notice you are using a CRAM and cache entire sequences on disk when
first accessed.

On Tue, Oct 11, 2016 at 12:24 PM, Tim Fennell notifications@github.com
wrote:

@jrobinso https://github.com/jrobinso I haven't profiled - but the
performance is night and day different using the hosted genome vs. a local
genome. I had meant to say in my comment, that while there are workarounds
(use a local genome, the checkbox you suggest) those are not great
workarounds for less experienced users. For example I support a number of
lab where there's a plethora of lab techs and mol-bio people using IGV -
getting everyone setup to use a local genome or similar is likely to be
quite challenging.

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#722 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA49HAVBCzMC6w5gkUEhgNDI_od0p-xiks5qy-JmgaJpZM4KTBHe
.

tfenne · 2016-10-11T19:31:41Z

I agree on not losing the MD5 check .. but perhaps the API can require that the returned object yield up the MD5 of the full sequence, and then also allow querying of sub-sequences? I don't think it's necessary for the CRAM code to fetch the full sequence and compute the MD5, if, e.g. there's a sequence dictionary or similar file available with the reference genome I think it should be sufficient to check the pre-computed reference MD5s vs. those in the CRAM header.

jpaul6 · 2016-10-11T23:23:43Z

Presently the MD5 check is implemented only at the level of the slice in HTSJDK (as far as I can tell). My personal preference would be to allow the client code to decide whether and how to check at the genome and/or contig and/or slice level, or not at all (i.e. to enable something like #719).

@jrobinso to your question re: profiling, I did not profile, but I did implement the CRAMReferenceSource interface @tfenne and I mentioned, and it does help enormously. Unfortunately, this other problem whereby HTSJDK returns far more blocks/records than span the queried range (number 3, in my initial post), still requires querying a rather large amount of sequence -- I'm still not certain that's an index issue, as it could conceivably be something to do with i.e. fetching mate pairs. I'm hoping someone with a better command of how indexing (bai versus crai, etc) can weigh in on that.

dbolser-ebi · 2017-01-19T13:57:04Z

Sorry for ignorance, but is an MD5SUM of each reference sequence a part of the CRAM specification?

We're having difficulty processing CRAMs with ~1G reference sequences (an assembly believe it or not). It takes ages (about 5 hours per file) to check the given MD5SUM against the CRAM. Not sure if this issue is related or if I should discuss it on a separate thread.

Thanks,

cmnbroad · 2017-01-19T19:20:22Z

I believe it is part of the spec (because CRAM is reference-compressed, without this check its possible to silently get completely bogus results):

"2. All CRAM reader implementations are expected to check for reference MD5 checksums and report any missing or mismatching entries. Consequently, all writer implementations are expected to ensure that all checksums are injected or checked during compression time."

We had originally discussed an "opt-out" arg, but I don't think it was implemented. What app are you seeing this with - IGV ?

jrobinso · 2017-01-19T19:36:40Z

Should it take 5 hours to compute MD5 on a 1G sequence? Sorry I'm not familiar with the algorithm or how it scales.

…

On Thu, Jan 19, 2017 at 11:20 AM, Chris Norman ***@***.***> wrote: I believe it is part of the spec (because CRAM is reference-compressed, without this check its possible to silently get completely bogus results): "2. All CRAM reader implementations are expected to check for reference MD5 checksums and report any missing or mismatching entries. Consequently, all writer implementations are expected to ensure that all checksums are injected or checked during compression time." We had originally discussed an "opt-out" arg, but I don't think it was implemented. What app are you seeing this with - IGV ? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#722 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA49HKrLz2CHUdxOQr1M93Z_wjCvPGJbks5rT7d5gaJpZM4KTBHe> .

tfenne · 2017-01-19T20:49:47Z

@jrobinso I want to believe you're being sarcastic, because the answer is that it should take nowhere near that long. On my ~3 year old mac laptop it takes ~8s to compute the md5 of the entire hs38DH reference. You can try for yourself.

Calculate the MD5 checksum of the whole human genome:

# Mac
time md5 hs38DH.fa
# Linux
time md5sum hs38DH.fa

jrobinso · 2017-01-19T23:45:16Z

Tim, not being sarcastic at all, the OP stated that is was taking 5 hours. So I don't want us off on a tangent (e.g. turning off MD5 checking) if that's not the problem.

…

On Thu, Jan 19, 2017 at 12:49 PM, Tim Fennell ***@***.***> wrote: @jrobinso <https://github.com/jrobinso> I want to believe you're being sarcastic, because the answer is that it should take nowhere near that long. On my ~3 year old mac laptop it takes ~8s to compute the md5 of the entire hs38DH reference. You can try for yourself. Calculate the MD5 checksum of the whole human genome: # Mactime md5 hs38DH.fa# Linuxtime md5sum hs38DH.fa — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#722 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AA49HIcJdOmwPvTZaT_K2rsULtZyEYtmks5rT8xsgaJpZM4KTBHe> .

dbolser-ebi · 2017-01-20T11:15:18Z

Sorry for confusion, in this case there are 735,943 reference sequences (scaffolds) for a total of approximately 13 Gbp. http://plants.ensembl.org/Triticum_aestivum/Info/Annotation

The problem isn't generating the MD5SUMs, it comes from the fact that ~1G sequences need to be checked against the CRAM Reference Registry for a few thousand CRAM files. Because we're using HTSJDK in our processing pipeline, we'd like to skip this step (for example when we have checked one file exhaustively, and computed the MD5SUM of all MD5SUMs and can quickly confirm the next file has identical sequences).

Is this reasonable?

jrobinso · 2017-06-16T03:30:23Z

Hi all, @tfenne , @jpaul6, I recently pushed some changes that should help with performance, sometimes by quite a lot, please check the latest release of 3.0 beta or nightly snapshot and comment here or open an IGV discussion on what you find. I've cycled back to the CRAM problem and want to fix whatever can be fixed in IGV. As long as the interface requires the entire sequence (chromosome) to decode records the initial delay will remain, but with the latest release further browsing on the same chromsome should be faster.

Also @jpaul6 you mentioned an alternative implementation of CRAMReferenceSource that sped things up. Is that something you can share?

jrobinso · 2017-06-26T05:32:08Z

Bump, any hope of movement on this issue? I've optimized all I can in IGV without an improved CRAMReferenceSource API.

cmnbroad added the cram label Oct 19, 2016

kamyshova mentioned this issue Apr 11, 2017

Make CRAMFileReader works with CRAI directly #851

Open

cmnbroad closed this as completed Jun 27, 2017

eugenegardner mentioned this issue Jul 6, 2018

CRAM random access very slow #1162

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTSJDK performance on CRAM #722

HTSJDK performance on CRAM #722

jpaul6 commented Oct 10, 2016

cmnbroad commented Oct 11, 2016

jpaul6 commented Oct 11, 2016

tfenne commented Oct 11, 2016 •

edited

Loading

jrobinso commented Oct 11, 2016

tfenne commented Oct 11, 2016

jrobinso commented Oct 11, 2016

jrobinso commented Oct 11, 2016

tfenne commented Oct 11, 2016

jpaul6 commented Oct 11, 2016

dbolser-ebi commented Jan 19, 2017

cmnbroad commented Jan 19, 2017

jrobinso commented Jan 19, 2017 via email

tfenne commented Jan 19, 2017

jrobinso commented Jan 19, 2017 via email

dbolser-ebi commented Jan 20, 2017

jrobinso commented Jun 16, 2017

jrobinso commented Jun 26, 2017

HTSJDK performance on CRAM #722

HTSJDK performance on CRAM #722

Comments

jpaul6 commented Oct 10, 2016

cmnbroad commented Oct 11, 2016

jpaul6 commented Oct 11, 2016

tfenne commented Oct 11, 2016 • edited Loading

jrobinso commented Oct 11, 2016

tfenne commented Oct 11, 2016

jrobinso commented Oct 11, 2016

jrobinso commented Oct 11, 2016

tfenne commented Oct 11, 2016

jpaul6 commented Oct 11, 2016

dbolser-ebi commented Jan 19, 2017

cmnbroad commented Jan 19, 2017

jrobinso commented Jan 19, 2017 via email

tfenne commented Jan 19, 2017

jrobinso commented Jan 19, 2017 via email

dbolser-ebi commented Jan 20, 2017

jrobinso commented Jun 16, 2017

jrobinso commented Jun 26, 2017

tfenne commented Oct 11, 2016 •

edited

Loading