Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reference Retrieval API Specification Merge #327

Merged
merged 1 commit into from
Aug 15, 2018
Merged

Reference Retrieval API Specification Merge #327

merged 1 commit into from
Aug 15, 2018

Conversation

andrewyatz
Copy link
Contributor

Adding the version 0.2 RR API specification. This spec defines how to query for a sequence by a checksum (/sequence/:id), get information about that checksum (/sequence/:id/metadata) and information about the capabilities service provides (/sequence/service-info).

This service is backwards compatible with the cram reference registry and can be used by CRAM and htslib without modification to either spec.

@andrewyatz
Copy link
Contributor Author

Intention is to squash/merge this request on Friday.

## Checksum calculation
The supported checksum algorithms are `MD5` and a SHA-512 based system called `TRUNC512` (see later for details). Servers MUST support sequence retrieval by one or more of these algorithms, and are encouraged to support all to maximize interoperability. To provide CRAM Reference Registry compatibility an implementation must support MD5.

When calculating the checksum for a sequence, all non-base symbols (\n, spaces, etc) must be removed and then uppercase the rest. The allowed alphabet for checksum calculation is uppercase ASCII (0x41-0x5A).
Copy link
Contributor

@jkbonfield jkbonfield Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is different to the SAM specification of how MD5 is computed; see SAM spec "Reference MD5 calculation" section.

Specifically it strips out any character of 32 (space) or below and 127 (delete) and above, and then uppercases. This means punctuation remains as-is.

I don't know whether that is particularly sensible, but it's what the code actually did so the spec was written to match it! The practical situation is that it makes no difference for most data, but occasionally there may be something else.

I wonder if we also need to consider the methylation work being done elsewhere which uses 1,2,3,4 and 7 characters as well as lowercase letters (c vs C differ). This breaks things for everyone, but the API could be set up with an additional parameter that requests case sensitivity? (As long as there is room to add this, it doesn't need to go in now.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did discuss this on the call with @daviesrob and from re-reading the minutes I think these are the pertinent points.

  • The original htslib code was pointed out but that for the most part those other characters are treated as N. The sum result was restrict to just A-Z and have encodings treat this as N (unspecified in the spec).
  • Main source of data for the RR API service are INSDC submitted sequences, which restrict to (?i)^([ACGTUMRWSYKVHDBN]+)\\*?$". Far more restrictive than what we've said here in the spec
  • The VMC spec, of which TRUNC512 was based on restricts to just IUPAC nucleotide or amino acid seqs
  • All source algorithms agreed on the uppercasing of sequence before being digested

That resulted in the decision to allow only upper-cased A-Z and this maintained API compatibility with CRAM RR not with htslib.

Given the other examples I'm not sure how some of the more obscure sequence would make it into the ENA implementation of this but I am somewhat worried now. @lairdm thoughts?

As for the methylation that really begins to mess with the ideas behind what sequence is :). I'd nearly be tempted to suggest this is an alternative checksum algorithm. What do others think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a fair comment - if the existing reference server has a set of additional restrictions over and above the SAM specification, then the difference is largely irrelevant.

Yes we can find data where the M5 tag in SAM mismatches the MD5 calculation by this API, but such references couldn't be used externally due to other restrictions. Seems fine then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you feel it's worth adding a condensed version of this into the specification? I know the VMC does this to explain key design decisions to inform future implementors/maintainers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a brief rationale could help.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brief rationale added

@jmarshall
Copy link
Member

Also needs to be added to index.md and README.md — compare ff9df33 or I can help with this (though I'm on holiday this week).

Has there been any consideration of giving this thing a catchy name, like when GA4GH Retrieval API became htsget?

@andrewyatz
Copy link
Contributor Author

@jmarshall added to index.md and README.md. There's been some thoughts but nothing has stuck yet. Any suggestions would be appreciated but I agree something catchier would be nice.

@andrewyatz
Copy link
Contributor Author

How does htsref strike people as a name?

@jkbonfield
Copy link
Contributor

I was going to suggest that infact, given that the htslib implementation caches data it downloads from EBI to .cache/hts-ref already. (The hyphen should be removed in the name for consistency with htsget IMO.)

@andrewyatz
Copy link
Contributor Author

Switched the name to htsref and updated the docs

@jmarshall
Copy link
Member

I've taken the liberty of adding a commit that formats the tables so that they are rendered correctly when converted to HTML as per the eventual https://samtools.github.io/hts-specs/refget.html. (You can preview this locally with jekyll serve -w though to be sure installing Jekyll is not unpainful.)

Also a couple of editorial items (described in the last paragraph of the commit message) that you may wish to pick and choose from.

@andrewyatz
Copy link
Contributor Author

Garrgghh stupid unicode quotes. Thank you all the edits look fantastic from my POV.

@jmarshall
Copy link
Member

jmarshall commented Aug 7, 2018

Further comments on the specification, which is generally looking in pretty good shape IMHO.

  • [§ Protocol essentials] Talks specifically about 302 redirection. Are other 30x response codes acceptable?

  • [§ Security] “The reference sequence retrieval API can enable […]” survived the search-&-replace to refget. Are reference chromosomes really “potentially sensitive genomic data” or could this be worded better?

  • [§ CORS] Link is to a Google Docs document, should at least be a view URL, not an edit URL surely?!

  • [§ Method: get sequence by ID] Second paragraph talks about 200 / 206 / 302. If a request with a Range header gets redirected, presumably the client should also put the Range header on the subsequent request? (As opposed to the server being allowed to encode the range into the Location URL. The answer here wasn't immediately obvious in the underlying RFCs, but I'm guessing it's this way around and possibly this spec should spell it out.)

  • [§ Query parameters] What happens if start == end? Is it different for a circular chromosome?

  • [§ Request parameters :: Range] Third sentence clearer as “The server MUST respond with a Bad Request error if both a Range header and start or end query parameters are specified.”

  • [§ Response] “with no carriage returns” can be interpreted as forbidding Windows-style CR-LF. Clearer as “with no line terminators or other formatting characters”.

  • [§ Method: Get known metadata for an id :: Response]

    • This is JSON, so length doesn't need to say “decimal”.
    • Is there a controlled vocabulary for naming_authority? Even just a few well-known ones whose canonical spelling/capitalisation could be listed here?
  • [§ Method: Fetch information on the service :: Response]

    • For subsequence_limit: if no limit is imposed, can the server just omit this object field, or must it be present with a value of null (so say integer or null as its type)?
  • [§ Range headers] “GA4GH has a standard of 0-based, half-open coordinates” (‘based’; hyphenate) is the usual way to describe this.

  • I didn't examine the Python or Perl code.

  • [§ Rationale]

What is refget-openapi.yaml? Is it a sample of some document that servers will want to provide? I'd suggest putting it down in the pub/ subdirectory and describing and linking to it in the spec.

The MAINTAINERS.md addition still calls it Reference Retrieval API.

README.md says “Refget.md enables bulk access to reference sequences […]”, but refget.md itself enables no such thing — it's just a document. You might consider something like the phrasing in the htsget.md paragraph.

@andrewyatz
Copy link
Contributor Author

andrewyatz commented Aug 7, 2018 via email

@andrewyatz
Copy link
Contributor Author

andrewyatz commented Aug 7, 2018

Responses inline

[§ Protocol essentials] Talks specifically about 302 redirection. Are other 30x response codes acceptable?

We've changed this to 303. It looks more suitable

[§ Security] “The reference sequence retrieval API can enable […]” survived the search-&-replace to refget. Are reference chromosomes really “potentially sensitive genomic data” or could this be worded better?

We've attempted an improved section

[§ CORS] Link is to a Google Docs document, should at least be a view URL, not an edit URL surely?!

They seem to come through as an edit link no matter what

[§ Method: get sequence by ID] Second paragraph talks about 200 / 206 / 302. If a request with a Range header gets redirected, presumably the client should also put the Range header on the subsequent request? (As opposed to the server being allowed to encode the range into the Location URL. The answer here wasn't immediately obvious in the underlying RFCs, but I'm guessing it's this way around and possibly this spec should spell it out.)

Indeed, Matt dug through the RFCs for the answer and it's not explicitly spelled out. There are discussion board posts and the implementation of every client I've played with is to resend the query with the same headers, just to the new location. It also seems like the intuitive thing to do, why would you drop/change the headers when simply changing the request location. I'm not sure it needs explicit spelling out as it feels redundant to what http client libraries already seem to do.

[§ Query parameters] What happens if start == end? Is it different for a circular chromosome?

An empty string is returned. Clarified in the text

[§ Request parameters :: Range] Third sentence clearer as “The server MUST respond with a Bad Request error if both a Range header and start or end query parameters are specified.”
[§ Response] “with no carriage returns” can be interpreted as forbidding Windows-style CR-LF. > Clearer as “with no line terminators or other formatting characters”.

Both changed accordingly

[§ Method: Get known metadata for an id :: Response]

This is JSON, so length doesn't need to say “decimal”.

JSON spec says number so we'll go with that

Is there a controlled vocabulary for naming_authority? Even just a few well-known ones whose canonical spelling/capitalisation could be listed here?

Added an appendix to solve this

[§ Method: Fetch information on the service :: Response]
For subsequence_limit: if no limit is imposed, can the server just omit this object field, or must it be present with a value of null (so say integer or null as its type)?

It should be null (so the type has been changed)

[§ Range headers] “GA4GH has a standard of 0-based, half-open coordinates” (‘based’; hyphenate) is the usual way to describe this.

Changed

I didn't examine the Python or Perl code.
Okay
[§ Rationale]
What is VMC?

Added a short description and links

The rationale on 0-based coordinates is too weak. This is a formally promulgated GA4GH policy. See threads from https://twitter.com/jomarnz/status/920259187057324032.

Tried to improve

What is refget-openapi.yaml? Is it a sample of some document that servers will want to provide? I'd suggest putting it down in the pub/ subdirectory and describing and linking to it in the spec.

It's the OpenAPI 3 version of the specification. It's not meant for servers to provide it but to be a reference description that is complementary to refget.md. I can move if you'd prefer it to be moved

The MAINTAINERS.md addition still calls it Reference Retrieval API.

Done

README.md says “Refget.md enables bulk access to reference sequences […]”, but refget.md itself enables no such thing — it's just a document. You might consider something like the phrasing in the htsget.md paragraph.

Altered by removing the word bulk

@andrewyatz
Copy link
Contributor Author

Okay assuming that people are happy with these changes then I'll go for the merge later on today (probably close to 5pm BST).

A big thank you to everyone's comments and help on getting the specification to this point.

@jmarshall
Copy link
Member

Thanks. I've pushed another commit for your consideration that fixes typos, makes the openapi link relative, and removes the link to my tweet which I would prefer not to be linked from a specification.

Thanks for moving that file down to pub/, which is for ancillary files that need to be jekylled into the https://samtools.github.io/hts-specs/ website but don't want to clutter the top-level repo directory.

Responses to some of your responses, for what they're worth:

[§ Protocol essentials] Talks specifically about 302 redirection. Are other 30x response codes acceptable?

We've changed this to 303. It looks more suitable

I don't really understand why refget would specify this. Isn't the choice of redirect flavour up to the server? And I would have thought 302/307 would be the usual ones. Why not just leave it open by writing as 30x or the equivalent?

[On start == end] An empty string is returned.

FYI htsget considers this an error.

JSON spec says number so we'll go with that

A JSON number is floating point. You probably do want to stick with integer, which is also a named thing in the JSON grammar (though spelt as int there).

rationale on 0-based coordinates

IMHO this is now worse than before. The reason for using 0-based half-open coordinates is that that is the sensible representation for doing arithmetic, as discussed at ga4gh/ga4gh-schemas#121 and elsewhere, and that is why it was GA4GH policy at the time the htsget spec was written. Apparently you folks disagree, but unless you have evidence that this policy changed, “Whilst this is not a GA4GH standard” seems unjustified.

And I would prefer that the specification did not contain a link to a tweet of mine.

README.md says “Refget.md enables bulk access to reference sequences […]”, but refget.md itself enables no such thing — it's just a document. You might consider something like the phrasing in the htsget.md paragraph.

I wasn't objecting to “bulk” 😄 The (perhaps esoteric) point was that the sentence says “The spec enables bulk access to […]” but what it wants to say is “The protocol described in the spec enables bulk access to […]”.

@andrewyatz
Copy link
Contributor Author

Thanks. I've pushed another commit for your consideration that fixes typos, makes the openapi link relative, and removes the link to my tweet which I would prefer not to be linked from a specification.

Okay no problems

Responses to some of your responses, for what they're worth:

[§ Protocol essentials] Talks specifically about 302 redirection. Are other 30x response codes acceptable?

We've changed this to 303. It looks more suitable

I don't really understand why refget would specify this. Isn't the choice of redirect flavour up to the server? And I would have thought 302/307 would be the usual ones. Why not just leave it open by writing as 30x or the equivalent?

We did consider that but reading through the HTTP status codes 303 is the right one. Since in neither situation is the resources 302 Found or 307 Temporary Redirect. 303 See Other seemed to be more correct.

[On start == end] An empty string is returned.
FYI htsget considers this an error.

Okay. @lairdm what do you think

JSON spec says number so we'll go with that
A JSON number is floating point. You probably do want to stick with integer, which is also a named thing in the JSON grammar (though spelt as int there).

Okay I can see from https://www.json.org/ the hierarchy. Yes we want int. Will change this.

rationale on 0-based coordinates
IMHO this is now worse than before. The reason for using 0-based half-open coordinates is that that is the sensible representation for doing arithmetic, as discussed at ga4gh/ga4gh-schemas#121 and elsewhere, and that is why it was GA4GH policy at the time the htsget spec was written. Apparently you folks disagree, but unless you have evidence that this policy changed, “Whilst this is not a GA4GH standard” seems unjustified.
And I would prefer that the specification did not contain a link to a tweet of mine.

Okay this is a mis-understanding of mine. For some reason I thought Beacon was 1-based fully-closed but that's not the case. Also though where has Global Alliance said this. I'm not starting a fight, actually far from it, just I'd prefer to reference that. In-fact if this is a GA4GH standard then why not delete that section. There's no rationale to give to the choice. It just is

README.md says “Refget.md enables bulk access to reference sequences […]”, but refget.md itself enables no such thing — it's just a document. You might consider something like the phrasing in the htsget.md paragraph.
I wasn't objecting to “bulk” 😄 The (perhaps esoteric) point was that the sentence says “The spec enables bulk access to […]” but what it wants to say is “The protocol described in the spec enables bulk access to […]”.

Okay

@jmarshall
Copy link
Member

We did consider [30x] but reading through the HTTP status codes 303 is the right one

I don't see that at all (RFC 7231 §6.4.4/3 says 303 results in something merely similar and hopefully useful), but I'm happy to revisit this as an issue after this PR has been merged.

[Re rationale on 0-based coordinates] Also though where has Global Alliance said this

Good question. The last big discussion was in ga4gh/ga4gh-schemas#121; previously there was ga4gh/ga4gh-schemas#49 in which after great agonising the (now legacy) schemas were converted to 0-based half-open. As I recall, at the time, this was called a policy so that we wouldn't have to rehash the discussion again — and that's what the note at the top of the htsget spec reflects. I am sure this appeared in slides etc at the time, but the marketing ga4gh.org web site is not the sort of site that archives historical technical information… so who knows. Perhaps someone else recalls something.

@andrewyatz
Copy link
Contributor Author

Okay I'll review this later. As for the 0-based rationale for the moment I'll just replicate what htsget has said. I also don't think we need to re-discuss this but GA4GH really needs to have this kind of decision record in an easy to reference place (whilst slowly looking in the mirror and realising it'll be him who pushes for it) ...

@jmarshall
Copy link
Member

I imagine there's going to be a bit of squashing before this gets merged in? (Hopefully merging master into here won't have complicated squashing too much…)

@andrewyatz
Copy link
Contributor Author

My PR has just been rebased against hts-specs/master and squashed into a single commit. Just about to push it into my repo now so you should see the changes very soon

@andrewyatz
Copy link
Contributor Author

Okay this is now ready to be merged in. Happy to hit the button myself or let someone else do the deed

@jmarshall
Copy link
Member

Confirmed that the tables etc are still okay when jekylled into HTML as it will appear on http://samtools.github.io/hts-specs.

My final pedantic comment is that maybe you could summarise the commit message. Most of the intermediate paragraphs (update this; update that; fix typo; remove tweet) are irrelevant now that the intermediate states have been squashed out of existence.

@andrewyatz
Copy link
Contributor Author

andrewyatz commented Aug 14, 2018 via email

@andrewyatz
Copy link
Contributor Author

Okay very verbose messages have now been squashed into a single message

@jmarshall
Copy link
Member

Excellent, thanks.

You know I hate to complain yet again, but: thanks for putting in as discussed in PR#327, but to make GitHub autolink it the # needs to follow a space: …in PR #327.

(And the original commit message noted that this came from previous text in a Google doc, which seems worth mentioning. Compare 5c25669.)

Text as exported from previous Google Docs document:
https://docs.google.com/document/d/1q2ZE9YewJTpaqQg82Nrz_jVy8KsDpKoG1T8RvCAAsbI
as of 27th July 2018 (last edit was made 25th June 2018). Format was
changed into Markdown with HTML.

This is version 0.2 of the refget specification. Refget supports the
retrieval of sequences using an identifier generated by a hash digest of
the full length underlying sequence. It also supports retrieval of
metadata concerning this identifier including sequence length and
aliases known to the server.
@andrewyatz
Copy link
Contributor Author

Okay done the edits

@jmarshall
Copy link
Member

Thanks. As the complainer-in-chief for today 😄, I have now hit the merge button.

Welcome to the fold! The refget specification is now available at http://samtools.github.io/hts-specs/refget.html.

@andrewyatz
Copy link
Contributor Author

Thank you for all the help as well! I do feel the specification can only benefit from this scrutiny

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants