-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTSJDK performance on CRAM #722
Comments
@jrobinso @vadimzalunin Thoughts on this ? @jpaul6 Have you PoC'd #3 - I'd be curious to see what that looks like. |
@cmnbroad I have not, and I agree that's probably among the stickier parts. |
I've also been experimenting with IGV and CRAM files, and it's pretty painful. For my own personal use I tend to load a local genome when using CRAM as then the reference sequence fetching is fast, and in general things are only 1-2X slower than working with a BAM file. The downside of this, of course, is all the standard annotations in the server-hosted genomes aren't at my finger tips any more (genes, repeats, etc.). To echo @jpaul6's comments, it seems that the real problem is that the
Either of these approaches would allow a separate |
I like Tim's approach of returning an object instead of a byte [] array, Tim, IIRC when you first load a server-hosted genome you can check a box Before all that, however, has anyone profiled this to confirm that On Tue, Oct 11, 2016 at 11:49 AM, Tim Fennell notifications@github.com
|
@jrobinso I haven't profiled - but the performance is night and day different using the hosted genome vs. a local genome. I had meant to say in my comment, that while there are workarounds (use a local genome, the checkbox you suggest) those are not great workarounds for less experienced users. For example I support a number of lab where there's a plethora of lab techs and mol-bio people using IGV - getting everyone setup to use a local genome or similar is likely to be quite challenging. |
OK. Since disk space is not usually a problem these days, perhaps IGV On Tue, Oct 11, 2016 at 12:24 PM, Tim Fennell notifications@github.com
|
BTW I think the best solution is to change the CRAMReferenceSource API to On Tue, Oct 11, 2016 at 12:27 PM, James Robinson <
|
I agree on not losing the MD5 check .. but perhaps the API can require that the returned object yield up the MD5 of the full sequence, and then also allow querying of sub-sequences? I don't think it's necessary for the CRAM code to fetch the full sequence and compute the MD5, if, e.g. there's a sequence dictionary or similar file available with the reference genome I think it should be sufficient to check the pre-computed reference MD5s vs. those in the CRAM header. |
Presently the MD5 check is implemented only at the level of the slice in HTSJDK (as far as I can tell). My personal preference would be to allow the client code to decide whether and how to check at the genome and/or contig and/or slice level, or not at all (i.e. to enable something like #719). @jrobinso to your question re: profiling, I did not profile, but I did implement the CRAMReferenceSource interface @tfenne and I mentioned, and it does help enormously. Unfortunately, this other problem whereby HTSJDK returns far more blocks/records than span the queried range (number 3, in my initial post), still requires querying a rather large amount of sequence -- I'm still not certain that's an index issue, as it could conceivably be something to do with i.e. fetching mate pairs. I'm hoping someone with a better command of how indexing (bai versus crai, etc) can weigh in on that. |
Sorry for ignorance, but is an MD5SUM of each reference sequence a part of the CRAM specification? We're having difficulty processing CRAMs with ~1G reference sequences (an assembly believe it or not). It takes ages (about 5 hours per file) to check the given MD5SUM against the CRAM. Not sure if this issue is related or if I should discuss it on a separate thread. Thanks, |
I believe it is part of the spec (because CRAM is reference-compressed, without this check its possible to silently get completely bogus results): "2. All CRAM reader implementations are expected to check for reference MD5 checksums and report any missing or mismatching entries. Consequently, all writer implementations are expected to ensure that all checksums are injected or checked during compression time." We had originally discussed an "opt-out" arg, but I don't think it was implemented. What app are you seeing this with - IGV ? |
Should it take 5 hours to compute MD5 on a 1G sequence? Sorry I'm not
familiar with the algorithm or how it scales.
…On Thu, Jan 19, 2017 at 11:20 AM, Chris Norman ***@***.***> wrote:
I believe it is part of the spec (because CRAM is reference-compressed,
without this check its possible to silently get completely bogus results):
"2. All CRAM reader implementations are expected to check for reference
MD5 checksums and report any missing or mismatching entries. Consequently,
all writer implementations are expected to ensure that all checksums are
injected or checked during compression time."
We had originally discussed an "opt-out" arg, but I don't think it was
implemented. What app are you seeing this with - IGV ?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#722 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA49HKrLz2CHUdxOQr1M93Z_wjCvPGJbks5rT7d5gaJpZM4KTBHe>
.
|
@jrobinso I want to believe you're being sarcastic, because the answer is that it should take nowhere near that long. On my ~3 year old mac laptop it takes ~8s to compute the md5 of the entire hs38DH reference. You can try for yourself. Calculate the MD5 checksum of the whole human genome: # Mac
time md5 hs38DH.fa
# Linux
time md5sum hs38DH.fa |
Tim, not being sarcastic at all, the OP stated that is was taking 5 hours.
So I don't want us off on a tangent (e.g. turning off MD5 checking) if
that's not the problem.
…On Thu, Jan 19, 2017 at 12:49 PM, Tim Fennell ***@***.***> wrote:
@jrobinso <https://github.com/jrobinso> I want to believe you're being
sarcastic, because the answer is that it should take nowhere near that
long. On my ~3 year old mac laptop it takes ~8s to compute the md5 of the
entire hs38DH reference. You can try for yourself.
Calculate the MD5 checksum of the whole human genome:
# Mactime md5 hs38DH.fa# Linuxtime md5sum hs38DH.fa
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#722 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA49HIcJdOmwPvTZaT_K2rsULtZyEYtmks5rT8xsgaJpZM4KTBHe>
.
|
Sorry for confusion, in this case there are 735,943 reference sequences (scaffolds) for a total of approximately 13 Gbp. http://plants.ensembl.org/Triticum_aestivum/Info/Annotation The problem isn't generating the MD5SUMs, it comes from the fact that ~1G sequences need to be checked against the CRAM Reference Registry for a few thousand CRAM files. Because we're using HTSJDK in our processing pipeline, we'd like to skip this step (for example when we have checked one file exhaustively, and computed the MD5SUM of all MD5SUMs and can quickly confirm the next file has identical sequences). Is this reasonable? |
Hi all, @tfenne , @jpaul6, I recently pushed some changes that should help with performance, sometimes by quite a lot, please check the latest release of 3.0 beta or nightly snapshot and comment here or open an IGV discussion on what you find. I've cycled back to the CRAM problem and want to fix whatever can be fixed in IGV. As long as the interface requires the entire sequence (chromosome) to decode records the initial delay will remain, but with the latest release further browsing on the same chromsome should be faster. Also @jpaul6 you mentioned an alternative implementation of CRAMReferenceSource that sped things up. Is that something you can share? |
Bump, any hope of movement on this issue? I've optimized all I can in IGV without an improved CRAMReferenceSource API. |
I've recently been trying to use recent builds of IGV on CRAM files. Especially when using the Broad-hosted reference files, performance is very poor. I've traced this back to three (semi-independent) issues with the interface that HTSJDK provides/implements.
I see that these are all related, and the reason that I'd separated them out is that I'd like to propose the following solutions (which must all be realized in order to make IGV performant):
I've PoC'd much of this, and it is all relatively straightforward. I'd be happy to clean things up and submit a PR if that's useful. Looking forward to the discussion. Many thanks.
The text was updated successfully, but these errors were encountered: