Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added /service-info endpoint to htsget openapi specification #493

Merged
merged 14 commits into from
Mar 1, 2021

Conversation

jb-adams
Copy link
Contributor

addition of /service-info endpoint to htsget OpenAPI specifcation

@jb-adams
Copy link
Contributor Author

This PR aims to incorporate the GA4GH service info specification into htsget. Service info will enable htsget services to inform clients about what features they support. There are 3 main aspects to the proposed htsget service-info response.

  1. Include ALL base Service attributes in the htsgetServiceInfo JSON response payload
  2. The artifact attribute of the Service's type object MUST be htsget
  3. The service-info JSON response payload contains an optional htsget attribute, which contains a nested object indicating supported features. Through this object, the server can specify:
  • Whether the service serves reads and/or variants
  • The API path to access reads and variants (if path differs from the default)
  • Read file formats supported by the server
  • Variant file formats supported by the server
  • Whether BAM/CRAM field exclusion is supported by the server
  • Whether BAM/CRAM tag inclusion/exclusion is supported by the server
  • The supported BAM/CRAM tags that the client can ask to include/exclude

The proposed spec can be viewed via Swagger Editor by pasting the new openAPI yaml into the editor.

@brainstorm
Copy link
Contributor

This is fantastic @jb-adams!

I've just added your version on SwaggerHub where I originally developed the previous specs so that people can interact with it directly (and also compare with previous releases):

https://app.swaggerhub.com/apis/brainkod/htsget/1.3.0

As for the content of the PR itself, LGTM ;)

@mlin mlin added the htsget label May 12, 2020
@mlin
Copy link
Member

mlin commented May 13, 2020

@jb-adams following your lead here I've just added (as a PR onto your branch which is in turn PRed here) a markdown description of service-info. First, rough draft, needs work for sure.

https://github.com/jb-adams/hts-specs/pull/1/files

@jmarshall
Copy link
Member

This draft implements this at a /service-info endpoint. The advice from the service-info authors suggests (for foobar=htsget, foo=reads, and bar=variants) that their expectation would be for two endpoints, /reads/service-info and /variants/service-info.

The question of what endpoint to implement a service-info response at has been discussed at some length on the htsget mailing list. This question would need to be resolved and that resolution reflected in this PR before htsget service-info can progress.

@jmarshall
Copy link
Member

  • Why have endpointSupported? Wouldn't it be more parsimonious to use the simple presence of a reads or variants item in the htsget object determine whether the endpoint is supported?

  • supportedFormats (or just formats?) is a useful hint as to what formats are liked by a server, but why is the default in its absence not “no assumptions about formats returned can be made ahead of actually making a query”?

  • the fields and tags parameters are always going to be ‘supported’ by a server; the question is whether they will actually be acted upon and actually make the fields disappear in the returned file. So either this hint is unnecessary or it would be better named ‘…Effective‘ or similar.

  • Many SAM tags are predefined in SAMtags.pdf but others and locally-defined ones are equally valid. So any list of which tags can be given to tags/notags is going to be voluminous and/or incomplete. What is gained by having a field here to list them?

@jb-adams
Copy link
Contributor Author

@mlin @jmarshall

I pushed some changes following the May 13th discussion:

  • There are now 2 service-info endpoints, one for each htsget API: /reads/service-info and /variants/service-info. This promotes encapsulation and prevents service-info endpoint collision when a web service implements multiple GA4GH API specs (e.g. htsget + refget).
  • Keeping track of non-default url paths is considered out of band for service-info, it can be handled by the url property of a Service within a service registry. As such, the endpointSupported and endpoint properties have been removed.
  • In keeping with decoupling the reads and variants APIs (a service can implement either or both), they have different proposed canonical artifact strings: htsget-reads and htsget-variants.
  • supportedFormats renamed to formats
  • fieldsParameterSupported renamed to fieldsParameterEffective
  • tagsParametersSupported renamed to tagsParametersEffective
  • tags listing removed

@mlin
Copy link
Member

mlin commented May 14, 2020

I also updated https://github.com/jb-adams/hts-specs/pull/1/files although with some intentional divergences; in particular, wondering if we can indeed get away with one response schema description even though there are slight differences between them for reads vs variants.

In particular, the question will arise, does the schema have to capture with full fidelity, the constraint that fieldsParameterEffective and tagsParametersEffective can only come from reads endpoints and never from variants endpoints? I would like to argue this isn't a big deal and we get a nice simplification by not expressing that, but open to feedback ofc.

@brainstorm
Copy link
Contributor

@mlin, I hear you. Would it make sense to keep those parameters common to all endpoints and return nothing or null when and if they don't make much sense for a particular file format?... we then have a more uniform, predictable interface, IMHO. Also discussed/covered in PR #495 (comment).

@jb-adams
Copy link
Contributor Author

@mlin @jmarshall @brainstorm @daviesrob

To clarify from the last htsget meeting, we decided on tackling and closing these issues relating to the htsget spec sequentially:

  1. integration of /service-info (PR added /service-info endpoint to htsget openapi specification #493 )
  2. fix OpenAPI repetition (issue htsget OpenAPI endpoint repetition #489, PR fix htsget OpenAPI repetition #495 )
  3. further refinement to better state what is meant by fields and tags in reads and variants

@mlin is this correct? I think the first order of business is to resolve the divergences between the OpenAPI and MD incorporating /service-info, ie. this PR

@mlin
Copy link
Member

mlin commented Jun 23, 2020

@jb-adams Yes, excited for both #493 and #495 and I no longer think one has to precede the other except to fit our collective bandwidth/attention limitations

@jb-adams
Copy link
Contributor Author

jb-adams commented Jul 3, 2020

@mlin 's markdown changes and my OpenAPI changes have been consolidated and merged. What is the next step?

@mlin
Copy link
Member

mlin commented Jul 6, 2020

I think we're in good shape for merging this, pending any further comments from others

htsget.md Outdated Show resolved Hide resolved
@jmarshall
Copy link
Member

jmarshall commented Jul 7, 2020

  • formats is a useful hint as to what formats are liked by a server, but why is the default in its absence not “no assumptions about formats returned can be made ahead of actually making a query”?

  • The separate artifact values htsget-reads and htsget-variants are a bit silly. What happens when we invent a third category to serve via htsget? IMHO it would be better to have a single "artifact": "htsget" value in the generic part of the service-info response, and a required "htsget: { "datatype": "reads|variants" } entry in the htsget-specific part.

  • Why is the default for fieldsParameterEffective and tagsParametersEffective not “If absent, clients cannot make assumptions about whether fields/tags/notags will be effective”?

  • Whether the appropriate version number bump here is to 1.2.1 or 1.3, it needs to be bumped in both htsget.md and htsget-openapi.yaml.

  • There is trailing whitespace inadvertently added in both files, and it would be easier to review htsget-openapi.yaml if the PR didn't make unrelated opinion-based editorial changes (blank line removal) at the same time.

@jmarshall
Copy link
Member

Previously in #493 (comment) I wrote:

The question of what endpoint to implement a service-info response at has been discussed at some length on the htsget mailing list. This question would need to be resolved and that resolution reflected in this PR before htsget service-info can progress.

This has still not been resolved. As service-info is not for service discovery, it seems to me that we would have the option of implementing service-info at URLs that did not conflict with our IDs, e.g. /reads?serviceinfo=serviceinfo and then listing those URLs as the service-info URLs in the server's service registry response at /services. Ideally such URLs would thereafter become blessed in the service-info spec, and this would IMHO be a very reasonable improvement to that spec.

@mlin
Copy link
Member

mlin commented Jul 7, 2020

Re what "artifact" should look like, summarizing prior side discussion between @jb-adams and myself. First the discrepancy was noted,

In the MD, there are 2 type properties, one in the root object that says htsget, and one within the htsget property that states either reads or variants. In the OpenAPI, there is only 1 type property that can have a value of htsget-reads or htsget-variants.

Then I wrote along similar lines as @jmarshall above,

I've thought it possible for htsget to eventually provide query access for numerous additional datatypes (e.g. BED, wiggle/bigWig, future graph formats) and thus my instinct is to minimize explicit special-casing for reads & variants to the extent possible. However, this is admittedly an idealistic vision, so if the bifurcation provides more-practical benefits (such as easier OpenAPI capture) then I'm alright with it.

And for now we snapped the markdown to the distinct artifacts approach, but certainly open for further discussion:

regarding (1), I think the goal with the GA4GH service info endpoint is to provide the client with the API's identity from just the ServiceType attribute (ie. group, artifact, version). If htsget expands to include additional types at different routes, we could include those on the TASC register at that time. I'm curious to hear what the group thinks.

@jb-adams
Copy link
Contributor Author

jb-adams commented Jul 13, 2020

  • formats is a useful hint as to what formats are liked by a server, but why is the default in its absence not “no assumptions about formats returned can be made ahead of actually making a query”?

Because the spec states that the default format when not specified in the request is BAM, this is in keeping with that. There's a similar critique of fieldsParameterEffective and tagsParametersEffective below, I've addressed them all in more depth there.

  • The separate artifact values htsget-reads and htsget-variants are a bit silly. What happens when we invent a third category to serve via htsget? IMHO it would be better to have a single "artifact": "htsget" value in the generic part of the service-info response, and a required "htsget: { "datatype": "reads|variants" } entry in the htsget-specific part.

Why is it silly? The canonical values need to be registered somewhere, either in 1 attribute and 1 registry, or across 2 attributes and 2 registries. Why should we invent a separate htsget datatype registry when the TASC service info registry is already doing this? When we invent a third htsget type we will need to add a new value whether we use artifact only or artifact and datatype, I don't see how additional data types means putting everything in artifact won't work.

  • Why is the default for fieldsParameterEffective and tagsParametersEffective not “If absent, clients cannot make assumptions about whether fields/tags/notags will be effective”?

My understanding of service-info is that it's a way for data providers to let clients know about what features are implemented by their service. htsget is already quite flexible in that we don't require the service to implement multiple formats, field or tag inclusion/exclusion, but there should be a mechanism by which the service can let the client know what's implemented and what's not. If we say "no assumptions can be made" about formats, fields, tags, notags in all these circumstances where the corresponding service-info attribute isn't provided, then service-info just becomes a nice thing to have, but ultimately not the de facto way of letting clients know what they can request. If no assumptions can be made, the client must do all the leg work in figuring out the exact implemented features for each service. Why don't we use service-info to put the onus on data providers to be explicit about what features they've implemented, rather than putting the onus on clients? The data provider will know better than the client what they've implemented, and it's simpler for them to relay this via service-info rather than making the client keep a record of what works and what doesn't for each service.

  • Whether the appropriate version number bump here is to 1.2.1 or 1.3, it needs to be bumped in both htsget.md and htsget-openapi.yaml.

Noted. My understanding is that since service info is a new feature, that constitutes a minor version bump.

  • There is trailing whitespace inadvertently added in both files, and it would be easier to review htsget-openapi.yaml if the PR didn't make unrelated opinion-based editorial changes (blank line removal) at the same time.

Noted

@jmarshall

@jmarshall
Copy link
Member

The separate artifact values htsget-reads and htsget-variants are a bit silly. What happens when we invent a third category to serve via htsget? IMHO it would be better to have a single "artifact": "htsget" value in the generic part of the service-info response, and a required "htsget: { "datatype": "reads|variants" } entry in the htsget-specific part.

Why is it silly?

At heart, because it is not normalised (in the relational database sense).

The canonical values need to be registered somewhere, either in 1 attribute and 1 registry, or across 2 attributes and 2 registries. Why should we invent a separate htsget datatype registry when the TASC service info registry is already doing this? When we invent a third htsget type we will need to add a new value whether we use artifact only or artifact and datatype, I don't see how additional data types means putting everything in artifact won't work.

A registry is what you have when you have multiple groups collaborating in a shared namespace. If we have a separate htsget datatype as suggested, it would just be an enum within the htsget spec — no registry or external coordination is needed, because it's entirely controlled by a single group: us, htsget.

So if we invent a third htsget datatype, the choice is between going cap in hand to TASC and asking for htsget-foo to be added — or just writing foo alongside reads and variants in our htsget spec. So logistically the latter is obviously much easier, but that is only one aspect of which is “correct” in some sense.

The other aspect of this is: what exactly is a service-info artifact supposed to represent? There is not a lot of guidance in the service-info spec:

artifact:
  type: string
  description: 'Name of the API or GA4GH specification implemented. […]'
  example: 'beacon'

The name of the GA4GH specification implemented here is htsget. There is also the discussion of chained specifications (so htsget-reads could be considered to play the role of foo), but it is a stretch to call this a case of “multiple specifications”.

We can also be guided by what artifact values other groups have listed in TASC's registry.
For example, RNAget provides a number of endpoints returning different sorts of things (projects, studies, expressions, …), somewhat similarly to our reads and variants. They have listed a single rnaseq artifact value.
For example, DRS returns arbitrary types of data (perhaps with an identifying mime_type) and has listed a single drs artifact value.

regarding (1), I think the goal with the GA4GH service info endpoint is to provide the client with the API's identity from just the ServiceType attribute (ie. group, artifact, version).

I guess the real question a client hopes to answer by looking at an artifact value is “can I speak this protocol”. Regardless of whether an endpoint is returning reads or variants, our basic request is the same (“basic” meaning when datatype-specific fields like fields or tags are not used), the ticket response is identical, and the core mechanic is identical. So a client can speak the protocol regardless of whether reads or variants are being returned. Hence it would be appropriate for htsget to use a single artifact value too.


ServiceInfo:
type: object
'$ref': https://raw.githubusercontent.com/ga4gh-discovery/ga4gh-service-info/develop/service-info.yaml#/components/schemas/Service
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
'$ref': https://raw.githubusercontent.com/ga4gh-discovery/ga4gh-service-info/develop/service-info.yaml#/components/schemas/Service
'$ref': https://raw.githubusercontent.com/ga4gh-discovery/ga4gh-service-info/v1.0.0/service-info.yaml#/components/schemas/Service

@jmarshall
Copy link
Member

service-info is […] a way for data providers to let clients know about what features are implemented by their service. htsget is already quite flexible in that we don't require the service to implement multiple formats, field or tag inclusion/exclusion, but there should be a mechanism by which the service can let the client know what's implemented and what's not. […] Why don't we use service-info to put the onus on data providers to be explicit about what features they've implemented, rather than putting the onus on clients? The data provider will know better than the client what they've implemented, and it's simpler for them to relay this via service-info rather than making the client keep a record of what works and what doesn't for each service.

I completely agree. IMHO the best way to encourage data providers to fill in these service-info fields is by not providing a default.

@jb-adams
Copy link
Contributor Author

jb-adams commented Aug 5, 2020

I guess the real question a client hopes to answer by looking at an artifact value is “can I speak this protocol”. Regardless of whether an endpoint is returning reads or variants, our basic request is the same (“basic” meaning when datatype-specific fields like fields or tags are not used), the ticket response is identical, and the core mechanic is identical. So a client can speak the protocol regardless of whether reads or variants are being returned. Hence it would be appropriate for htsget to use a single artifact value too.

Ok, I'm good with this.

Previously in #493 (comment) I wrote:

The question of what endpoint to implement a service-info response at has been discussed at some length on the htsget mailing list. This question would need to be resolved and that resolution reflected in this PR before htsget service-info can progress.

This has still not been resolved. As service-info is not for service discovery, it seems to me that we would have the option of implementing service-info at URLs that did not conflict with our IDs, e.g. /reads?serviceinfo=serviceinfo and then listing those URLs as the service-info URLs in the server's service registry response at /services. Ideally such URLs would thereafter become blessed in the service-info spec, and this would IMHO be a very reasonable improvement to that spec.

Perhaps relating to the question of artifact values is how htsget service URLs will be stored in service registry. You mention here that service registry could list the htsget service's service-info endpoint (if the service-info endpoint is not prescribed to be /reads/service-info and /variants/service-info). But how would this work practically? How would a client be able to get the service's reads endpoint and variants endpoint? Will a single htsget service need to be listed multiple times in the registry? (ie. one entry for reads and one for variants)?

I brought this up to the Discovery Networks group. I can paraphrase the conversation in greater detail later, but the prevailing sentiment is that it will be very difficult to link htsget into a federated network (via a service registry) if the endpoints are not prescriptive. The endpoints for all other GA4GH API specs have prescriptive endpoints, therefore by knowing the base URL and the service type (indicated by artifact and version), a client can know what to request of that service. With htsget, it seems by not prescribing the /reads, /variants, or either of the service-info endpoints, it makes it difficult to represent this in service registry.

@jb-adams
Copy link
Contributor Author

Pushed new commit to address the following:

@jmarshall

IMHO it would be better to have a single "artifact": "htsget" value in the generic part of the service-info response, and a required "htsget: { "datatype": "reads|variants" } entry in the htsget-specific part.

Made this replacement in both yaml and markdown

  • formats is a useful hint as to what formats are liked by a server, but why is the default in its absence not “no assumptions about formats returned can be made ahead of actually making a query”?

Removed the default assumption, stating that clients cannot assume formats ahead of making the query without the params being provided

  • Why is the default for fieldsParameterEffective and tagsParametersEffective not “If absent, clients cannot make assumptions about whether fields/tags/notags will be effective”?

Removed the default assumptions, stating that clients cannot assume fieldsParameterEffective or tagsParametersEffective ahead of making the query without the params being provided.

Bumped reference URL to Service Info YAML from develop branch to v1.0.0 release

This has still not been resolved. As service-info is not for service discovery, it seems to me that we would have the option of implementing service-info at URLs that did not conflict with our IDs, e.g. /reads?serviceinfo=serviceinfo and then listing those URLs as the service-info URLs in the server's service registry response at /services. Ideally such URLs would thereafter become blessed in the service-info spec, and this would IMHO be a very reasonable improvement to that spec.

I've added a "GA4GH Service Registry" section in the Markdown (below GA4GH service-info), outlining what has been tentatively discussed in this group about how to register htgset services within a registry. It outlines that reads and variants should be registered distinctly, and that the value of the url property should be the service-info endpoint.

@brainstorm
Copy link
Contributor

I'm with @jb-adams analysis and feedback from the discovery group: having concrete endpoints is simpler and in the event of coming up with other endpoints I don't see that much overhead on explicitly defining it.

@mlin
Copy link
Member

mlin commented Sep 1, 2020

@brainstorm We had some back-and-forth on that point above & gravitated toward the one artifact, two datatype (for now) approach. I'm a bit reluctant to reverse at this point, but let us know if you feel strongly about it (I don't think anyone was forcefully arguing one way or the other). Thx

htsget.md Outdated
The [GA4GH Service Registry API specification](https://github.com/ga4gh-discovery/ga4gh-service-registry) allows information about GA4GH-compliant web services, including htsget services, to be aggregated into registries and made available via a standard API. The following considerations SHOULD be followed when registering htsget services within a service registry.

* Endpoints for different htsget data types should be registered as separate entities within the registry. If an htsget service provides both `reads` and `variants` data, both endpoints should be registered.
* The `url` property should reference the API's `service-info` endpoint for a single data type (i.e. an htsget reads API registration should provide the URL to the reads `service-info` endpoint, an htsget variants API registration should provide the URL to the variants `service-info` endpoint). Clients should be able to assume that by replacing the URL's "service-info" string with an object id, they will hit the corresponding `reads`/`variants` endpoint.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we happy with this assumption? I might be missing something but:

  • If we're reserving /service-info so that it's not allowed as a sample ID, then it seems it would be fine to leave that off instead of making the client snip it.
  • If we're saying we could use this to declare the service info endpoint as something else like /service-info2 so that the server can still have a sample ID=service-info, we should specifically advise that client is supposed to strip whatever the last URL component is, and on the whole it seems unfortunate to make this an htsget-specific caveat)

My opinion is still that the first approach is a simple and basically acceptable way forward. If in future we motivate a change to service registry so that it admits a query string for service-info then I think we could relax the reservation constraint at that time.

(comments welcome from all, tagging @jmarshall @brainstorm @jb-adams)

Copy link
Contributor Author

@jb-adams jb-adams Sep 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm +1 for the first approach. Registering the base url (ie. without "service-info") is much more in line with the service info spec and the networks team's plans for federating services of mixed type. Requiring that "service-info" be reserved for this endpoint is simple and will lead to better interoperability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I totally second Jeremy.

mlin and others added 9 commits November 4, 2020 10:30
- collapsed htsgetReadsServiceInfo and htsgetVariantsServiceInfo into htsgetServiceInfo
…sEffective

- removed explicit statement saying it has no effect in variants endpoint
Co-authored-by: John Marshall <John.W.Marshall@glasgow.ac.uk>
…stry considerations

- removed 'htsget-reads' and 'htsget-variants', instead using a 'datatype' parameter to indicate 'reads' or 'variants'
- removed default assumptions when service-info parameters not provided
@jb-adams
Copy link
Contributor Author

jb-adams commented Nov 4, 2020

following the Oct 27th htsget meeting, we closed #495 . The above commits involved rebasing jb-adams:htsget/service-info onto samtools:master after the merge

@mlin
Copy link
Member

mlin commented Nov 25, 2020

Thanks @jb-adams, I'm 👍 to merging this

htsget.md Outdated
"type": {
"group": "org.ga4gh",
"artifact": "htsget",
"version": "1.2.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this (and the second one a dozen lines below) need to be bumped to 1.2.1?

@jb-adams
Copy link
Contributor Author

I believe I got all references to the htsget spec version and changed them to 1.2.1

@mlin mlin merged commit 33dbb91 into samtools:master Mar 1, 2021
@jb-adams jb-adams deleted the htsget/service-info branch April 23, 2021 14:33
jmarshall added a commit to jmarshall/TASC that referenced this pull request Jul 30, 2021
Htsget implemented service-info in PR samtools/hts-specs#493.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants