Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Htsget support #850

Closed
wants to merge 15 commits into from
Closed

Htsget support #850

wants to merge 15 commits into from

Conversation

brainstorm
Copy link
Contributor

Work in progress, needs more debugging.

The current strategy is using Load From URL-> as if htsget endpoints were any other http(s) endpoint.

/cc @reisingerf @andersleung @lbergelson @ohofmann @jb-adams @mlin

@jrobinso
Copy link
Contributor

This is supported in igv.js, it didn't require much code: https://github.com/igvteam/igv.js/blob/master/js/bam/htsgetReader.js

I don't know of anyone who actually uses it though. There was initially some interest from a few users but they abandoned it. I take it you are actively using it there?

@mlin
Copy link

mlin commented Sep 24, 2020

There's certainly a chicken-and-egg question here...there are 3-4 of the national genomics initiatives implementing the protocol, although naturally the endpoints wouldn't be available publicly, so it's a bit under the radar. I've got to get round to standing one up for the newer resequenced 1000G datasets. It goes without saying that IGV support would be a terrific addition to all this! glad to see an application of the the new htsjdk client too

@jrobinso
Copy link
Contributor

I'm happy to support it but can't really until I have public htsget servers to test against. I'm not personally aware of any active initiatives in the US or Europe, but there could be some of course. In any event please let me know if public test servers become available.

@ohofmann
Copy link

@jrobinso Genomics England are using htsget to provide virtual panel data to their curation teams; Cineca are using it as part of their no-filesystem approach. But really good point on needing a public-facing reference implementation. The Sanger one got shut down due to a lack of funding. I'll try to raise this at the next GA4GH steering committee meeting, let's see what we can do here.

@jrobinso
Copy link
Contributor

jrobinso commented Sep 30, 2020 via email

@jb-adams
Copy link

Hi All, the GA4GH technical team has stood up a public, open source, longstanding htsget test server at:
https://htsget.ga4gh.org. It is actively being worked on to support new features and public datasets, but the core htsget functionality is working as expected.

Please feel free to use it for testing. If interested, I can provide more info about the datasets/ids it's connected to.

@ohofmann
Copy link

You were the next person I was going to reach out to, Jeremy - thank you!

@ohofmann
Copy link

ohofmann commented Oct 1, 2020

@jb-adams Can I take you up on the offer to describe the accessible data sets? Happy to do offline / via email; we might want to contribute a basic human BAM/VCF to it that is properly consented.

@jrobinso
Copy link
Contributor

@jb-adams Could you provide some dataset ids for the server at https://hgsget.ga4gh.org?

@jrobinso
Copy link
Contributor

@brainstorm Rather than try to detect htsget URLs, I suggest considering just adding a new explicit menu item, "load from htsget" or similar.

@brainstorm
Copy link
Contributor Author

Thanks @jrobinso for the suggestion. When we started implementing, we considered your suggestion for a moment but concluded that it would unnecessarily clutter the File-> menu and that Load from URL was more transparent to the user (better UX)... but we'll do as you say if you think it's best.

@jrobinso
Copy link
Contributor

@brainstorm I agree with your points and concerns, but how will we determine if a URL is to an htsget service otherwise? Searching for /reads/ in the URL would not be accepted, too much chance of a false positive, and keeping an explicit list is not maintainable or even possible for private services.

On a related topic we should rename our "Load from Server..." , perhaps to "Load Hosted Tracks...".

@ohofmann
Copy link

@jb-adams What is the preferred approach for this - pinging the server and checking if https://github.com/ga4gh-discovery/ga4gh-service-info exists, then checking if there's an htsget server at that URL? Feels slightly excessive.

@jrobinso
Copy link
Contributor

@ohofmann Just a thought, can some regex be constructed that will identify an htsget server url with high reliability? Something more than just matching /reads/?

@reisingerf
Copy link

I think @brainstorm had already a quick look at the specs and there does not seem to be a reliable/enforced URL pattern.

It does mention that responses SHOULD include a Content-Type header that is specific to htsget, so a HEAD request or generic GET request that would yield minimal payload could be used. However, that header is not mandatory and I don't know if this could be applied to all supported URLs.

I wonder if that use case ever came up in the htsget community and what the thought were.

@jrobinso
Copy link
Contributor

@reisingerf I would not be in favor of any pinging, HEAD or GET, to detect htsget since it would need to be applied to all URLs, 99% of which will not be htsget requests. At least in the foreseeable future. If there was a regex to identify likely htsget requests, then a ping could be used for those, but that is mute if there is not a reliable URL pattern.

So I think an explicit UI entry of some sort is the most robust solution, since the user would presumably know. Alternatively we could use a list of known htsget servers as you do in the prototype code, with an option for an end user to add to the list through the preferences.

In the meantime I have working code for alignments in igv.js, but no server to talk to.

@ohofmann
Copy link

Give us two weeks or so; Jeremy is in the process of adding Genome in a Bottle reference samples (exome BAM, VCF) to the reference server.

@jb-adams
Copy link

@jb-adams What is the preferred approach for this - pinging the server and checking if https://github.com/ga4gh-discovery/ga4gh-service-info exists, then checking if there's an htsget server at that URL? Feels slightly excessive.

Yes, feels excessive, especially since not all servers will implement service-info. In GATK we are experimenting with an htsget:// url scheme for determining whether the data sits behind an htsget server or not. However, this approach is not standardized. If we have multiple use cases where client software will need to seamlessly differentiate between htsget and non-htsget sources, then maybe it's something we can build into the spec.

@ohofmann Just a thought, can some regex be constructed that will identify an htsget server url with high reliability? Something more than just matching /reads/?

checking for /reads/ as a regex won't work, especially since the htsget spec currently has non-prescriptive endpoints (ie. implementing at the /reads/ endpoint is suggested but not required to be a valid htsget server).

Another solution could be to use GA4GH service registry for a listing of htsget services. IGV could ping the service registry, asking for only htsget services, and build a client-side cache out of the results. This ping would not need to be performed for every request, only for rebuilding the cache. But again, this would require htsget services to be registered, potentially excluding private and/or non-registered services.

Give us two weeks or so; Jeremy is in the process of adding Genome in a Bottle reference samples (exome BAM, VCF) to the reference server.

Yes, currently working on a new build of the reference server, which will involve new data sources (and better descriptions of them) as well as some new features.

@jb-adams
Copy link

@brainstorm @ohofmann @jrobinso @mlin

Alright, feature and dataset updates have been pushed to the reference server. Please take a look at the docs for an explanation of the different datasets available. In particular Roman and Oliver, you will probably be interested in the section Genome in a Bottle NA12878/HG001 BAMs under Reads Datasets. Please give some of those IDs a shot and let me know your feedback.

I've added the experimental POST request functionality for the reads endpoint, feel free to use it for multi-region requests. As it's a new feature I haven't added much sanitization to the client input (ie. region sorting and collapsing overlaps), but as long as regions are pre-sorted it should work ok.

Please reach out about any questions or issues you encounter

@brainstorm
Copy link
Contributor Author

brainstorm commented Oct 26, 2020

Thanks @jb-adams! Trying to quickly tilt this up (as a user without delving into the Go classes... yet):

% pwd
/Users/rvalls/dev/umccr/htsget/htsget-refserver
% git branch -a
* master
% git pull
Already up to date.
% go build -o ./htsget-refserver ./cmd
%

And using the following NA12878 config:

{
  "htsgetconfig": {
    "props": {
      "port": "3000",
      "host": "http://localhost:3000/"
    },
    "reads": {
      "enabled": true,
	"dataSourceRegistry": {
	  "sources": [
		{
		  "pattern": "^(?P<accession>NA12878)$",
		  "path": "./data/gcp/gatk-test-data/wgs_bam/{accession}.bam"
		}
	  ]
	},
    "variants": {
      "enabled": true
    }
  }
}
% ./htsget-refserver -config data/config/NA12878-local.json
Server started on port 3000!

The according to the docs pointed:

Any of the following IDs may be provided to the /reads/{id} endpoint to stream the different lane-level BAMs: giab.NA12878.NIST7035.1, giab.NA12878.NIST7035.2, giab.NA12878.NIST7086.1, giab.NA12878.NIST7086.2,

But I'm not getting it:

$ curl -s http://localhost:3000/reads/NA12878 | jq .
{
  "htsget": {
    "error": "NotFound",
    "message": "The requested resource could not be associated with a registered data source"
  }
}
$ curl -s http://localhost:3000/reads/giab.NA12878.NIST7035.1 | jq .
{
  "htsget": {
    "error": "NotFound",
    "message": "The requested resource could not be associated with a registered data source"
  }
}
$ curl -s http://localhost:3000/reads/giab.NA12878.NIST7035.2 | jq .
{
  "htsget": {
    "error": "NotFound",
    "message": "The requested resource could not be associated with a registered data source"
  }
}
$ curl -s http://localhost:3000/reads/gatk.NA12878 | jq .
{
  "htsget": {
    "error": "NotFound",
    "message": "The requested resource could not be associated with a registered data source"
  }
}

I'm fairly sure I'm missing something crucial and trivial, but cannot pinpoint it right now :/

/cc @victorskl

@jb-adams
Copy link

Hi @brainstorm , a number of things:

  1. In your config, change htsgetconfig to htsgetConfig

  2. The single data source in the JSON:

{
  "pattern": "^(?P<accession>NA12878)$",
  "path": "./data/gcp/gatk-test-data/wgs_bam/{accession}.bam"
}

Maps a single ID (NA12878) to a single, local file (located at ./data/gcp/gatk-test-data/wgs_bam/NA12878.bam). When you run the server, do you have this file available locally?

Given the config, the following IDs won't work: giab.NA12878.NIST7035.1, giab.NA12878.NIST7035.2 because they don't conform to the above regex pattern. To pull in these files to the htsget server, you would need to add more data sources. Check out this file, which is how I configure the server when testing locally.

In particular, these 2 registrations point IDs like giab.NA12878.NIST7035.1 to the corresponding data on the GIAB AWS S3 bucket.

{
  "pattern": "^giab\\.NA12878\\.NIST7035\\.(?P<lane>.*)$",
  "path": "https://giab.s3.amazonaws.com/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/project.NIST_NIST7035_H7AP8ADXX_TAAGGCGA_{lane}_NA12878.bwa.markDuplicates.bam"
},
{
  "pattern": "^giab\\.NA12878\\.NIST7086\\.(?P<lane>.*)$",
  "path": "https://giab.s3.amazonaws.com/data/NA12878/Garvan_NA12878_HG001_HiSeq_Exome/project.NIST_NIST7086_H7AP8ADXX_CGTACTAG_{lane}_NA12878.bwa.markDuplicates.bam"
},

@brainstorm
Copy link
Contributor Author

brainstorm commented Oct 27, 2020

Moving the htsget-ref discussion over to ga4gh/htsget-refserver#15 (comment) since it doesn't belong here.

…loading ranges properly

🤷‍♂️

co-authored-by: Florian Reisinger <florian.reisinger@unimelb.edu.au>
@brainstorm
Copy link
Contributor Author

To any htsjdk devs listening to this thread... when should we expect the next release (including htsget support) so that we don't need to ship the .jar along? /cc @mlin @jb-adams?

@ohofmann
Copy link

cc @lbergelson re htsjdk's release schedule

@jrobinso
Copy link
Contributor

@brainstorm #983 is addressed, or as addressed as its going to be. To be honest the thread for this PR is much too long to study, perhaps you can summarize the outstanding issues if any here.

@brainstorm
Copy link
Contributor Author

brainstorm commented Jul 15, 2021

On it Jim, let me merge it with my branch and I'll give it a spin, happy to close this PR fast ;)

…=header' URL addition in HtsgetReader.getReader() returns errors igvteam#983 (comment), needs more work, I'm surprised it worked for @jrobinso at all unless the merge introduced other artifacts :-S
@brainstorm
Copy link
Contributor Author

Btw, @jrobinso instead of using your own htsget branch, you can use the GitHub CLI to work on an active/open PR directly like so (i.e this one):

gh pr checkout 850

... it took me a bit of time to reconcile the (big) refactoring of PicardIterator and other non-rebased changes :-S

@jrobinso
Copy link
Contributor

@brainstorm Thanks. It wasn't my intent actually to work on this PR, I intended to merge the variant htsget support independently, but if this PR is close to ready I will wait. The merge would have to happen eventually.

BTW the reason the "class=header" parameter is important in the first initial probing request (the test to see if URL is an htsget server) is because without it you are requesting a ticket for an entire file. It is legal to return the contents of the entire file in the ticket as a data URI, thus the parameter to request header only. The initial round of server implementations in fact did use data URIs for bam data, that doesn't seem to be the fashion now, but it is legal.

@brainstorm
Copy link
Contributor Author

brainstorm commented Jul 19, 2021

BTW the reason the "class=header" parameter is important in the first initial probing request (the test to see if URL is an htsget server) is because without it you are requesting a ticket for an entire file. It is legal to return the contents of the entire file in the ticket as a data URI, thus the parameter to request header only. The initial round of server implementations in fact did use data URIs for bam data, that doesn't seem to be the fashion now, but it is legal.

Yes, I understand the header mechanism, but in its current form in this PR, it seems like the class=header parameter gets appended to the initial input URL, i.e, the one you pointed earlier:

https://htsget.ga4gh.org/variants/1000genomes.phase1.chr8?format=VCF&referenceName=8&start=128732400&end=128770475

So what I meant is that the following happens if the other parameters remain (as they seem to do now), from HtsgetReader.java:29:

org.broad.igv.exceptions.HttpResponseException: Bad Request <br> 
{
  "htsget": {
    "error": "InvalidRange",
    "message": "referenceName incompatible with header-only request"
  }
}

But anyhow, I need to debug and cleanup things a bit more, the fault might very well be on our side... although if you want to checkout this PR and give it a quick try and see what I mean, that'd help too ;)

@jrobinso
Copy link
Contributor

@brainstorm I can't speak for what's in the PR, but in the htsget branch it works as expected.

@brainstorm
Copy link
Contributor Author

Alright, that's good to know, will re-review tomorrow, the BAM support seems to be broken as well as a result of the merge, so that's that too.

@jrobinso
Copy link
Contributor

jrobinso commented Jul 19, 2021 via email

@jrobinso
Copy link
Contributor

@brainstorm Let me look at this tommorrow. There's something not right, an Alignment source should not implement FeatureSource. I would expect the only new class required to be an implementation of BAMReader, which should just plug into BamSource in the switch statement below. Note that the long deprecated GA4GH reader plugins in here. Nothing else should be needed, other than a little code to detect that the server is an hstget endpoint. Anyway, let me spend an hour looking at it tomorrow before you do anything else.

        if ("ga4gh" === config.sourceType) {
            this.bamReader = new Ga4ghAlignmentReader(config, genome);
        } else if ("pysam" === config.sourceType) {
            this.bamReader = new BamWebserviceReader(config, genome)
        } else if ("htsget" === config.sourceType) {
            this.bamReader = new HtsgetBamReader(config, genome);
        } else if ("shardedBam" === config.sourceType) {
            this.bamReader = new ShardedBamReader(config, genome);
        } else if ("cram" === config.format) {
            this.bamReader = new CramReader(config, genome, browser);
        } else {
            if (this.config.indexed === false) {
                this.bamReader = new BamReaderNonIndexed(config, genome);
            } else {
                this.bamReader = new BamReader(config, genome);
            }
        }

@jrobinso
Copy link
Contributor

Oh scratch that, that's javascript. I confuse myself like this often. The equivalent in Java is in AlignmentReaderFactory. Anyway the principle remains, we just need a BAMReader implementation to plug in, no other code should be touched. I think a fresh start in the htsget branch might be more productive than this PR branch. I will look at it tomorrow.

@brainstorm
Copy link
Contributor Author

Jim, we were planning to take a peek at this me and @reisingerf tomorrow, don't drop this PR just yet ;)

Yes, the BAMReader impl got scratched during the merge because a few interfaces changed, but we have code to restore it and well.

@jrobinso
Copy link
Contributor

jrobinso commented Jul 19, 2021 via email

brainstorm added a commit to umccr/igv that referenced this pull request Jul 20, 2021
… working with the UI, things do not work as expected, namely:

1) The URL parameters do not get correctly concatenated with the base URL, leading to htsget server errors as the ones described in the UMCCR htsget PR: igvteam#850 (comment)
2) There should be a clear separation between the ?class=header JSON payload and the actual header bytes from the underlying format.
@brainstorm
Copy link
Contributor Author

@brainstorm I can't speak for what's in the PR, but in the htsget branch it works as expected.

I beg to differ according to @reisingerf and my experiments, see umccr@886bfc4

By works you meant it passes the tests? We couldn't make it work from loading the initial example data htsget VCF url to interacting with the UI since base URLs were getting longer (and incorrect) at every UI operation (zoom into region, change chromosome, etc...)

I'll be off in the next 3 days, but happy to come back and continue fixing this until we can ship both VCF and BAM support in htsget-IGV, at least. Thanks Jim for the patience! ;)

@jrobinso
Copy link
Contributor

@brainstorm @reisingerf No I meant the UI. You will need to give me URLs and steps to reproduce. Here's what I do

(1)load the following by URL: https://htsget.ga4gh.org/variants/giab.NA12878
(2) jump to "Myc", I see 1 variant
(3) jump to chr7:55,131,949-55,205,656, I see the following screenshot
(3) jump around some more, I see more variants. No errors

Screen Shot 2021-07-20 at 9 22 31 AM

@jrobinso
Copy link
Contributor

jrobinso commented Jul 20, 2021

The "htsget" branch now supports BAM (and presumably CRAM?) file formats through the htsjdk, and variants in VCF format through the new IGV classes. I think this PR can be closed, any issues we find in the htsget branch can be dealt with as bugs. There are a couple of items missing in the htsget branch implementation, but I think they can wait for variant support in the htsjdk. Specifically (1) BCF support, and (2) support for data URIs in query ticket responses for variant endpoints. The latter could be implemented pretty easily, the place to do so is in HtsgetReader.loadURLs, line 63, if anyone wants to have a go.

I have tested against these endpoints (only). If you find issues include test URLs and steps to reproduce.

Variants: https://htsget.ga4gh.org/variants/giab.NA12878
Alignments: https://htsget.ga4gh.org/reads/giab.NA12878.NIST7086.1

@brainstorm
Copy link
Contributor Author

brainstorm commented Jul 20, 2021 via email

@jrobinso
Copy link
Contributor

@brainstorm I don't understand what you mean by. "URL with all parameters".

@brainstorm
Copy link
Contributor Author

brainstorm commented Jul 20, 2021 via email

@jrobinso
Copy link
Contributor

Aah, no, that would be weird. Particularly the genome coordinates. What if the user zooms out, or goes somewhere else?

@jrobinso
Copy link
Contributor

I don't anticipate this being a normal use case, we would have to make up special rules and determine what the genome coordinates mean in this context. There are enough special rules in IGV already. We could keep the "format" parameter, its perhaps useful if the same endpoint can support multiple formats. OTOH I'm not sure why we would want to impose a format (e.g. what does it matter if the variants are fetched with VCF, BCF, or some new as yet unspecified variant format)?

@jrobinso
Copy link
Contributor

jrobinso commented Jul 23, 2021

I've merged my htsget branch with support for variants (VCF), BAM, and CRAM. Its usable now as is with the public servers, which is all I have to test with. Let's discuss further work via new git issues as this conversation has gotten too long. I think we can close this PR?

@brainstorm
Copy link
Contributor Author

brainstorm commented Jul 23, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants