-
Notifications
You must be signed in to change notification settings - Fork 379
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Htsget support #850
Htsget support #850
Conversation
co-authored-by: Florian Reisinger <florian.reisinger@unimelb.edu.au>
This is supported in igv.js, it didn't require much code: https://github.com/igvteam/igv.js/blob/master/js/bam/htsgetReader.js I don't know of anyone who actually uses it though. There was initially some interest from a few users but they abandoned it. I take it you are actively using it there? |
There's certainly a chicken-and-egg question here...there are 3-4 of the national genomics initiatives implementing the protocol, although naturally the endpoints wouldn't be available publicly, so it's a bit under the radar. I've got to get round to standing one up for the newer resequenced 1000G datasets. It goes without saying that IGV support would be a terrific addition to all this! glad to see an application of the the new htsjdk client too |
I'm happy to support it but can't really until I have public htsget servers to test against. I'm not personally aware of any active initiatives in the US or Europe, but there could be some of course. In any event please let me know if public test servers become available. |
@jrobinso Genomics England are using htsget to provide virtual panel data to their curation teams; Cineca are using it as part of their no-filesystem approach. But really good point on needing a public-facing reference implementation. The Sanger one got shut down due to a lack of funding. I'll try to raise this at the next GA4GH steering committee meeting, let's see what we can do here. |
That would be very helpful, thanks, a "certified" public test server
would be very useful. By certified I mean something believed by those
behind this to implement the spec correctly, as opposed to something I
might hack up based on my understanding of it.
igv.js has an htsget implementation for alignments that I believe to be
correct, but nothing for variants.
…On Tue, Sep 29, 2020 at 9:32 PM Oliver Hofmann ***@***.***> wrote:
@jrobinso <https://github.com/jrobinso> Genomics England are using htsget
to provide virtual panel data to their curation teams; Cineca are using it
as part of their no-filesystem approach. But really good point on needing a
public-facing reference implementation. The Sanger one got shut down due to
a lack of funding. I'll try to raise this at the next GA4GH steering
committee meeting, let's see what we can do here.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#850 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHD2HGD2KV5QDLV5A2IOTLSIKYEZANCNFSM4RTNMN6A>
.
|
Hi All, the GA4GH technical team has stood up a public, open source, longstanding htsget test server at: Please feel free to use it for testing. If interested, I can provide more info about the datasets/ids it's connected to. |
You were the next person I was going to reach out to, Jeremy - thank you! |
@jb-adams Can I take you up on the offer to describe the accessible data sets? Happy to do offline / via email; we might want to contribute a basic human BAM/VCF to it that is properly consented. |
@jb-adams Could you provide some dataset ids for the server at https://hgsget.ga4gh.org? |
@brainstorm Rather than try to detect htsget URLs, I suggest considering just adding a new explicit menu item, "load from htsget" or similar. |
Thanks @jrobinso for the suggestion. When we started implementing, we considered your suggestion for a moment but concluded that it would unnecessarily clutter the File-> menu and that Load from URL was more transparent to the user (better UX)... but we'll do as you say if you think it's best. |
@brainstorm I agree with your points and concerns, but how will we determine if a URL is to an htsget service otherwise? Searching for /reads/ in the URL would not be accepted, too much chance of a false positive, and keeping an explicit list is not maintainable or even possible for private services. On a related topic we should rename our "Load from Server..." , perhaps to "Load Hosted Tracks...". |
@jb-adams What is the preferred approach for this - pinging the server and checking if https://github.com/ga4gh-discovery/ga4gh-service-info exists, then checking if there's an htsget server at that URL? Feels slightly excessive. |
@ohofmann Just a thought, can some regex be constructed that will identify an htsget server url with high reliability? Something more than just matching /reads/? |
I think @brainstorm had already a quick look at the specs and there does not seem to be a reliable/enforced URL pattern. It does mention that responses SHOULD include a Content-Type header that is specific to htsget, so a HEAD request or generic GET request that would yield minimal payload could be used. However, that header is not mandatory and I don't know if this could be applied to all supported URLs. I wonder if that use case ever came up in the htsget community and what the thought were. |
@reisingerf I would not be in favor of any pinging, HEAD or GET, to detect htsget since it would need to be applied to all URLs, 99% of which will not be htsget requests. At least in the foreseeable future. If there was a regex to identify likely htsget requests, then a ping could be used for those, but that is mute if there is not a reliable URL pattern. So I think an explicit UI entry of some sort is the most robust solution, since the user would presumably know. Alternatively we could use a list of known htsget servers as you do in the prototype code, with an option for an end user to add to the list through the preferences. In the meantime I have working code for alignments in igv.js, but no server to talk to. |
Give us two weeks or so; Jeremy is in the process of adding Genome in a Bottle reference samples (exome BAM, VCF) to the reference server. |
Yes, feels excessive, especially since not all servers will implement
checking for Another solution could be to use GA4GH service registry for a listing of htsget services. IGV could ping the service registry, asking for only htsget services, and build a client-side cache out of the results. This ping would not need to be performed for every request, only for rebuilding the cache. But again, this would require htsget services to be registered, potentially excluding private and/or non-registered services.
Yes, currently working on a new build of the reference server, which will involve new data sources (and better descriptions of them) as well as some new features. |
@brainstorm @ohofmann @jrobinso @mlin Alright, feature and dataset updates have been pushed to the reference server. Please take a look at the docs for an explanation of the different datasets available. In particular Roman and Oliver, you will probably be interested in the section Genome in a Bottle NA12878/HG001 BAMs under Reads Datasets. Please give some of those IDs a shot and let me know your feedback. I've added the experimental Please reach out about any questions or issues you encounter |
Thanks @jb-adams! Trying to quickly tilt this up (as a user without delving into the Go classes... yet):
And using the following NA12878 config: {
"htsgetconfig": {
"props": {
"port": "3000",
"host": "http://localhost:3000/"
},
"reads": {
"enabled": true,
"dataSourceRegistry": {
"sources": [
{
"pattern": "^(?P<accession>NA12878)$",
"path": "./data/gcp/gatk-test-data/wgs_bam/{accession}.bam"
}
]
},
"variants": {
"enabled": true
}
}
}
The according to the docs pointed:
But I'm not getting it:
I'm fairly sure I'm missing something crucial and trivial, but cannot pinpoint it right now :/ /cc @victorskl |
Hi @brainstorm , a number of things:
Maps a single ID (NA12878) to a single, local file (located at ./data/gcp/gatk-test-data/wgs_bam/NA12878.bam). When you run the server, do you have this file available locally? Given the config, the following IDs won't work: In particular, these 2 registrations point IDs like
|
Moving the htsget-ref discussion over to ga4gh/htsget-refserver#15 (comment) since it doesn't belong here. |
…loading ranges properly 🤷♂️ co-authored-by: Florian Reisinger <florian.reisinger@unimelb.edu.au>
cc @lbergelson re htsjdk's release schedule |
@brainstorm #983 is addressed, or as addressed as its going to be. To be honest the thread for this PR is much too long to study, perhaps you can summarize the outstanding issues if any here. |
On it Jim, let me merge it with my branch and I'll give it a spin, happy to close this PR fast ;) |
…=header' URL addition in HtsgetReader.getReader() returns errors igvteam#983 (comment), needs more work, I'm surprised it worked for @jrobinso at all unless the merge introduced other artifacts :-S
Btw, @jrobinso instead of using your own
... it took me a bit of time to reconcile the (big) refactoring of PicardIterator and other non-rebased changes :-S |
@brainstorm Thanks. It wasn't my intent actually to work on this PR, I intended to merge the variant htsget support independently, but if this PR is close to ready I will wait. The merge would have to happen eventually. BTW the reason the "class=header" parameter is important in the first initial probing request (the test to see if URL is an htsget server) is because without it you are requesting a ticket for an entire file. It is legal to return the contents of the entire file in the ticket as a data URI, thus the parameter to request header only. The initial round of server implementations in fact did use data URIs for bam data, that doesn't seem to be the fashion now, but it is legal. |
Yes, I understand the header mechanism, but in its current form in this PR, it seems like the So what I meant is that the following happens if the other parameters remain (as they seem to do now), from
{
"htsget": {
"error": "InvalidRange",
"message": "referenceName incompatible with header-only request"
}
} But anyhow, I need to debug and cleanup things a bit more, the fault might very well be on our side... although if you want to checkout this PR and give it a quick try and see what I mean, that'd help too ;) |
@brainstorm I can't speak for what's in the PR, but in the htsget branch it works as expected. |
Alright, that's good to know, will re-review tomorrow, the BAM support seems to be broken as well as a result of the merge, so that's that too. |
Look at the unit tests for the pattern. I'm concentrating on other
issues with igv.js right now.
…On Sun, Jul 18, 2021 at 10:33 PM Roman Valls Guimera < ***@***.***> wrote:
Alright, that's good to know, will re-review tomorrow, the BAM support
seems to be broken as well as a result of the merge, so that's that too.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#850 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHD2HAZN7RN7OKS6GBV73TTYO2KLANCNFSM4RTNMN6A>
.
|
@brainstorm Let me look at this tommorrow. There's something not right, an Alignment source should not implement FeatureSource. I would expect the only new class required to be an implementation of BAMReader, which should just plug into BamSource in the switch statement below. Note that the long deprecated GA4GH reader plugins in here. Nothing else should be needed, other than a little code to detect that the server is an hstget endpoint. Anyway, let me spend an hour looking at it tomorrow before you do anything else. if ("ga4gh" === config.sourceType) {
this.bamReader = new Ga4ghAlignmentReader(config, genome);
} else if ("pysam" === config.sourceType) {
this.bamReader = new BamWebserviceReader(config, genome)
} else if ("htsget" === config.sourceType) {
this.bamReader = new HtsgetBamReader(config, genome);
} else if ("shardedBam" === config.sourceType) {
this.bamReader = new ShardedBamReader(config, genome);
} else if ("cram" === config.format) {
this.bamReader = new CramReader(config, genome, browser);
} else {
if (this.config.indexed === false) {
this.bamReader = new BamReaderNonIndexed(config, genome);
} else {
this.bamReader = new BamReader(config, genome);
}
} |
Oh scratch that, that's javascript. I confuse myself like this often. The equivalent in Java is in AlignmentReaderFactory. Anyway the principle remains, we just need a BAMReader implementation to plug in, no other code should be touched. I think a fresh start in the htsget branch might be more productive than this PR branch. I will look at it tomorrow. |
Jim, we were planning to take a peek at this me and @reisingerf tomorrow, don't drop this PR just yet ;) Yes, the BAMReader impl got scratched during the merge because a few interfaces changed, but we have code to restore it and well. |
OK, well I wan't going to drop it just suggesting that a fresh start might
be easier. I don't know when is tomorrow for your or me, tomorrow for me
is in 30 minutes, but I was going to look at it tomorrow afternoon PT. I
will wait another day so we aren't stepping on each other. In the end I
would expect to see just a BAMReader implementation, really nothing else.
The HtsgetReader class (too many "readers') will probably not be helpful as
you are using the htsjdk, it should probably be ignored for you BAM
implementation, but I'm speaking without knowing what htsjdk provides.
…On Sun, Jul 18, 2021 at 11:19 PM Roman Valls Guimera < ***@***.***> wrote:
Jim, we were planning to take a peek at this me and @reisingerf
<https://github.com/reisingerf> tomorrow, don't drop this PR just yet ;)
Yes, the BAMReader impl got scratched during the merge because a few
interfaces changed, but we have code to restore it and well.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#850 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAHD2HFA2YC724WYZZCUF6TTYO7XDANCNFSM4RTNMN6A>
.
|
… working with the UI, things do not work as expected, namely: 1) The URL parameters do not get correctly concatenated with the base URL, leading to htsget server errors as the ones described in the UMCCR htsget PR: igvteam#850 (comment) 2) There should be a clear separation between the ?class=header JSON payload and the actual header bytes from the underlying format.
I beg to differ according to @reisingerf and my experiments, see umccr@886bfc4 By works you meant it passes the tests? We couldn't make it work from loading the initial example data htsget VCF url to interacting with the UI since base URLs were getting longer (and incorrect) at every UI operation (zoom into region, change chromosome, etc...) I'll be off in the next 3 days, but happy to come back and continue fixing this until we can ship both VCF and BAM support in htsget-IGV, at least. Thanks Jim for the patience! ;) |
@brainstorm @reisingerf No I meant the UI. You will need to give me URLs and steps to reproduce. Here's what I do (1)load the following by URL: https://htsget.ga4gh.org/variants/giab.NA12878 |
The "htsget" branch now supports BAM (and presumably CRAM?) file formats through the htsjdk, and variants in VCF format through the new IGV classes. I think this PR can be closed, any issues we find in the htsget branch can be dealt with as bugs. There are a couple of items missing in the htsget branch implementation, but I think they can wait for variant support in the htsjdk. Specifically (1) BCF support, and (2) support for data URIs in query ticket responses for variant endpoints. The latter could be implemented pretty easily, the place to do so is in HtsgetReader.loadURLs, line 63, if anyone wants to have a go. I have tested against these endpoints (only). If you find issues include test URLs and steps to reproduce. Variants: https://htsget.ga4gh.org/variants/giab.NA12878 |
Jim, I see, have you tried entering the URL *with all parameters* from step
1 instead of a base URL without any parameters?
That’s what we tested first because it’s the kind of URLs we will load:
already parametrised from other systems (i.e our internal data portal).
…On Wed, 21 Jul 2021 at 07:15, Jim Robinson ***@***.***> wrote:
The "htsget" branch now supports BAM (and presumably CRAM?) file formats
through the htsjdk, and variants in VCF format through the new IGV classes.
I think this PR can be closed, any issues we find in the htsget branch can
be dealt with as bugs. There are a couple of items missing in this
implementation, but I think they can wait for variant support in the
htsjdk. Specifically (1) BCF support, and (2) support for data URIs in
query ticket responses for variant endpoints. The latter could be
implemented pretty easily, the place to do so is in HtsgetReader.loadURLs,
line 63, if anyone wants to have a go.
I have tested against these endpoints (only). If you find issues include
test URLs and steps to reproduce.
Variants: https://htsget.ga4gh.org/variants/giab.NA12878
Alignments: htsget://htsget.ga4gh.org/reads/giab.NA12878.NIST7086.1
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#850 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABK3Y6PJUWF7DXAMBTB6MDTYXRQ7ANCNFSM4RTNMN6A>
.
|
@brainstorm I don't understand what you mean by. "URL with all parameters". |
… On Wed, 21 Jul 2021 at 08:51, Jim Robinson ***@***.***> wrote:
@brainstorm <https://github.com/brainstorm> I don't understand what you
mean by. "URL with all parameters".
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#850 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABK3Y4ACFX53573HNM246LTYX4YNANCNFSM4RTNMN6A>
.
|
Aah, no, that would be weird. Particularly the genome coordinates. What if the user zooms out, or goes somewhere else? |
I don't anticipate this being a normal use case, we would have to make up special rules and determine what the genome coordinates mean in this context. There are enough special rules in IGV already. We could keep the "format" parameter, its perhaps useful if the same endpoint can support multiple formats. OTOH I'm not sure why we would want to impose a format (e.g. what does it matter if the variants are fetched with VCF, BCF, or some new as yet unspecified variant format)? |
I've merged my htsget branch with support for variants (VCF), BAM, and CRAM. Its usable now as is with the public servers, which is all I have to test with. Let's discuss further work via new git issues as this conversation has gotten too long. I think we can close this PR? |
Sure, thanks a lot Jim, feel free to close ;)
…On Fri, 23 Jul 2021 at 13:29, Jim Robinson ***@***.***> wrote:
I've merged my htsget branch with support for variants (VCF), BAM, and
CRAM. Its usable now as is with the public servers, which is all I have to
test with. Let's discuss further work will be considered via new git issues
as this conversation has gotten too long. I think we can close this PR?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#850 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AABK3Y5F6JEYH4HE4VLMJ6LTZDOYLANCNFSM4RTNMN6A>
.
|
Work in progress, needs more debugging.
The current strategy is using
Load From URL->
as if htsget endpoints were any other http(s) endpoint./cc @reisingerf @andersleung @lbergelson @ohofmann @jb-adams @mlin