Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTSJDK performance on CRAM #722

Closed
jpaul6 opened this issue Oct 10, 2016 · 17 comments
Closed

HTSJDK performance on CRAM #722

jpaul6 opened this issue Oct 10, 2016 · 17 comments
Labels

Comments

@jpaul6
Copy link

jpaul6 commented Oct 10, 2016

I've recently been trying to use recent builds of IGV on CRAM files. Especially when using the Broad-hosted reference files, performance is very poor. I've traced this back to three (semi-independent) issues with the interface that HTSJDK provides/implements.

  1. The CRAMReferenceSource interface has a single function that demands a byte[] for the entire chromosome. This takes an age to load over a slow internet connection.
  2. HTSJDK checks the MD5 of each slice, and in sparse regions, slices can encompass a very large (>50Mb) region.
  3. HTSJDK, even when a relatively small reference range of the CRAM file is requested, iterates over (and inflates) CRAM blocks that are entirely non-overlapping with that reference range. I suspect this is related to using a BAI rather than a CRAI for the index?

I see that these are all related, and the reason that I'd separated them out is that I'd like to propose the following solutions (which must all be realized in order to make IGV performant):

  1. Make the CRAMReferenceSource interface have a function that requests a chromosomal range, and leave the caching up to the implementation. In most cases, a very simple caching will require 4-5 orders of magnitude less data to be fetched.
  2. Leave the MD5 checking to an interface (at least optionally) so that clients aren't forced to do this checking, or to do it in a different way (i.e. caching).
  3. If no CRAMReferenceSource is provided, do not fully inflate the record (i.e. do what's been requested here: Allow CRAM to be read if reference sequences are missing. #719), and provide a mechanism for the client to inflate the record. This doesn't directly address the issue, but it does allow the client a lot more flexibility to ignore (w/o incurring substantial computation expense) those records that aren't relevant to the current query.

I've PoC'd much of this, and it is all relatively straightforward. I'd be happy to clean things up and submit a PR if that's useful. Looking forward to the discussion. Many thanks.

@cmnbroad
Copy link
Collaborator

@jrobinso @vadimzalunin Thoughts on this ? @jpaul6 Have you PoC'd #3 - I'd be curious to see what that looks like.

@jpaul6
Copy link
Author

jpaul6 commented Oct 11, 2016

@cmnbroad I have not, and I agree that's probably among the stickier parts.

@tfenne
Copy link
Member

tfenne commented Oct 11, 2016

I've also been experimenting with IGV and CRAM files, and it's pretty painful. For my own personal use I tend to load a local genome when using CRAM as then the reference sequence fetching is fast, and in general things are only 1-2X slower than working with a BAM file. The downside of this, of course, is all the standard annotations in the server-hosted genomes aren't at my finger tips any more (genes, repeats, etc.).

To echo @jpaul6's comments, it seems that the real problem is that the CRAMReferenceSource API is specified in a way that works well for many common use cases in data processing and archival pipelines, but is at odds with the way IGV works when displaying files. I could imagine a couple of ways to solve this pretty easily:

  1. Change the API to return a CRAMReference or ReferenceSequence object instead of a byte[], which must then be able to access sub-sequences by API. The existing implementation could cache the whole chromosome, and then serve up regions, but alternative implementations could do something different.
  2. Add to the CRAMReferenceSource API a default method that fetches just the desired region, and prefer that method if it is implemented, and if not then fall back to loading the entire chromosome.

Either of these approaches would allow a separate IGVCramReferenceSource that does things in a way more appropriate for an interactive viewer with a remote genome reference.

@jrobinso
Copy link
Contributor

I like Tim's approach of returning an object instead of a byte [] array,
but IIRC this might be a substantial undertaking on the HTSJDK side.

Tim, IIRC when you first load a server-hosted genome you can check a box
that says download sequence (or fasta). The intent of that was to allow
you to work with a local fasta. To be honest I didn't implement that, and
don't know if it even works, but if it doesn't it could be made to work.
This might be an acceptable workaround.

Before all that, however, has anyone profiled this to confirm that
CRAMReferenceSource is the problem? It likely is, particularly if working
with a local fasta fixes the problem.

On Tue, Oct 11, 2016 at 11:49 AM, Tim Fennell notifications@github.com
wrote:

I've also been experimenting with IGV and CRAM files, and it's pretty
painful. For my own personal use I tend to load a local genome when using
CRAM as then the reference sequence fetching is fast, and in general things
are only 1-2X slower than working with a BAM file. The downside of
this, of course, is all the standard annotations in the server-hosted
genomes aren't at my finger tips any more (genes, repeats, etc.).

To echo @jpaul6 https://github.com/jpaul6's comments, it seems that the
real problem is that the CRAMReferenceSource API is specified in a way
that works well for many common use cases in data processing and archival
pipelines, but it at odds with the way IGV works when displaying files. I
could imagine a couple of ways to solve this pretty easily:

  1. Change the API to return a CRAMReference or ReferenceSequence object
    instead of a byte[], which must then be able to access sub-sequences by
    API. The existing implementation could cache the whole chromosome, and then
    serve up regions, but alternative implementations could do something
    different.
  2. Add to the CRAMReferenceSource API a default method that fetches just
    the desired region, and prefer that method if it is implemented, and if not
    then fall back to loading the entire chromosome.

Either of these approaches would allow a separate IGVCramReferenceSource
that does things in a way more appropriate for an interactive viewer with a
remote genome reference.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#722 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA49HNZyChAGZ2W6-_i_lv2b7-husaOoks5qy9odgaJpZM4KTBHe
.

@tfenne
Copy link
Member

tfenne commented Oct 11, 2016

@jrobinso I haven't profiled - but the performance is night and day different using the hosted genome vs. a local genome. I had meant to say in my comment, that while there are workarounds (use a local genome, the checkbox you suggest) those are not great workarounds for less experienced users. For example I support a number of lab where there's a plethora of lab techs and mol-bio people using IGV - getting everyone setup to use a local genome or similar is likely to be quite challenging.

@jrobinso
Copy link
Contributor

OK. Since disk space is not usually a problem these days, perhaps IGV
could notice you are using a CRAM and cache entire sequences on disk when
first accessed.

On Tue, Oct 11, 2016 at 12:24 PM, Tim Fennell notifications@github.com
wrote:

@jrobinso https://github.com/jrobinso I haven't profiled - but the
performance is night and day different using the hosted genome vs. a local
genome. I had meant to say in my comment, that while there are workarounds
(use a local genome, the checkbox you suggest) those are not great
workarounds for less experienced users. For example I support a number of
lab where there's a plethora of lab techs and mol-bio people using IGV -
getting everyone setup to use a local genome or similar is likely to be
quite challenging.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#722 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA49HAVBCzMC6w5gkUEhgNDI_od0p-xiks5qy-JmgaJpZM4KTBHe
.

@jrobinso
Copy link
Contributor

BTW I think the best solution is to change the CRAMReferenceSource API to
not require entire sequences, but I don't know if that's feasible and I
definitely would not want to loose the MD5 sequence check.

On Tue, Oct 11, 2016 at 12:27 PM, James Robinson <
jrobinso@broadinstitute.org> wrote:

OK. Since disk space is not usually a problem these days, perhaps IGV
could notice you are using a CRAM and cache entire sequences on disk when
first accessed.

On Tue, Oct 11, 2016 at 12:24 PM, Tim Fennell notifications@github.com
wrote:

@jrobinso https://github.com/jrobinso I haven't profiled - but the
performance is night and day different using the hosted genome vs. a local
genome. I had meant to say in my comment, that while there are workarounds
(use a local genome, the checkbox you suggest) those are not great
workarounds for less experienced users. For example I support a number of
lab where there's a plethora of lab techs and mol-bio people using IGV -
getting everyone setup to use a local genome or similar is likely to be
quite challenging.


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#722 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AA49HAVBCzMC6w5gkUEhgNDI_od0p-xiks5qy-JmgaJpZM4KTBHe
.

@tfenne
Copy link
Member

tfenne commented Oct 11, 2016

I agree on not losing the MD5 check .. but perhaps the API can require that the returned object yield up the MD5 of the full sequence, and then also allow querying of sub-sequences? I don't think it's necessary for the CRAM code to fetch the full sequence and compute the MD5, if, e.g. there's a sequence dictionary or similar file available with the reference genome I think it should be sufficient to check the pre-computed reference MD5s vs. those in the CRAM header.

@jpaul6
Copy link
Author

jpaul6 commented Oct 11, 2016

Presently the MD5 check is implemented only at the level of the slice in HTSJDK (as far as I can tell). My personal preference would be to allow the client code to decide whether and how to check at the genome and/or contig and/or slice level, or not at all (i.e. to enable something like #719).

@jrobinso to your question re: profiling, I did not profile, but I did implement the CRAMReferenceSource interface @tfenne and I mentioned, and it does help enormously. Unfortunately, this other problem whereby HTSJDK returns far more blocks/records than span the queried range (number 3, in my initial post), still requires querying a rather large amount of sequence -- I'm still not certain that's an index issue, as it could conceivably be something to do with i.e. fetching mate pairs. I'm hoping someone with a better command of how indexing (bai versus crai, etc) can weigh in on that.

@cmnbroad cmnbroad added the cram label Oct 19, 2016
@dbolser-ebi
Copy link

Sorry for ignorance, but is an MD5SUM of each reference sequence a part of the CRAM specification?

We're having difficulty processing CRAMs with ~1G reference sequences (an assembly believe it or not). It takes ages (about 5 hours per file) to check the given MD5SUM against the CRAM. Not sure if this issue is related or if I should discuss it on a separate thread.

Thanks,

@cmnbroad
Copy link
Collaborator

I believe it is part of the spec (because CRAM is reference-compressed, without this check its possible to silently get completely bogus results):

"2. All CRAM reader implementations are expected to check for reference MD5 checksums and report any missing or mismatching entries. Consequently, all writer implementations are expected to ensure that all checksums are injected or checked during compression time."

We had originally discussed an "opt-out" arg, but I don't think it was implemented. What app are you seeing this with - IGV ?

@jrobinso
Copy link
Contributor

jrobinso commented Jan 19, 2017 via email

@tfenne
Copy link
Member

tfenne commented Jan 19, 2017

@jrobinso I want to believe you're being sarcastic, because the answer is that it should take nowhere near that long. On my ~3 year old mac laptop it takes ~8s to compute the md5 of the entire hs38DH reference. You can try for yourself.

Calculate the MD5 checksum of the whole human genome:

# Mac
time md5 hs38DH.fa
# Linux
time md5sum hs38DH.fa

@jrobinso
Copy link
Contributor

jrobinso commented Jan 19, 2017 via email

@dbolser-ebi
Copy link

Sorry for confusion, in this case there are 735,943 reference sequences (scaffolds) for a total of approximately 13 Gbp. http://plants.ensembl.org/Triticum_aestivum/Info/Annotation

The problem isn't generating the MD5SUMs, it comes from the fact that ~1G sequences need to be checked against the CRAM Reference Registry for a few thousand CRAM files. Because we're using HTSJDK in our processing pipeline, we'd like to skip this step (for example when we have checked one file exhaustively, and computed the MD5SUM of all MD5SUMs and can quickly confirm the next file has identical sequences).

Is this reasonable?

@jrobinso
Copy link
Contributor

Hi all, @tfenne , @jpaul6, I recently pushed some changes that should help with performance, sometimes by quite a lot, please check the latest release of 3.0 beta or nightly snapshot and comment here or open an IGV discussion on what you find. I've cycled back to the CRAM problem and want to fix whatever can be fixed in IGV. As long as the interface requires the entire sequence (chromosome) to decode records the initial delay will remain, but with the latest release further browsing on the same chromsome should be faster.

Also @jpaul6 you mentioned an alternative implementation of CRAMReferenceSource that sped things up. Is that something you can share?

@jrobinso
Copy link
Contributor

Bump, any hope of movement on this issue? I've optimized all I can in IGV without an improved CRAMReferenceSource API.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants