-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reference Retrieval API Specification Merge #327
Conversation
Intention is to squash/merge this request on Friday. |
reference_retrieval_api.md
Outdated
## Checksum calculation | ||
The supported checksum algorithms are `MD5` and a SHA-512 based system called `TRUNC512` (see later for details). Servers MUST support sequence retrieval by one or more of these algorithms, and are encouraged to support all to maximize interoperability. To provide CRAM Reference Registry compatibility an implementation must support MD5. | ||
|
||
When calculating the checksum for a sequence, all non-base symbols (\n, spaces, etc) must be removed and then uppercase the rest. The allowed alphabet for checksum calculation is uppercase ASCII (0x41-0x5A). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is different to the SAM specification of how MD5 is computed; see SAM spec "Reference MD5 calculation" section.
Specifically it strips out any character of 32 (space) or below and 127 (delete) and above, and then uppercases. This means punctuation remains as-is.
I don't know whether that is particularly sensible, but it's what the code actually did so the spec was written to match it! The practical situation is that it makes no difference for most data, but occasionally there may be something else.
I wonder if we also need to consider the methylation work being done elsewhere which uses 1,2,3,4 and 7 characters as well as lowercase letters (c vs C differ). This breaks things for everyone, but the API could be set up with an additional parameter that requests case sensitivity? (As long as there is room to add this, it doesn't need to go in now.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did discuss this on the call with @daviesrob and from re-reading the minutes I think these are the pertinent points.
- The original htslib code was pointed out but that for the most part those other characters are treated as
N
. The sum result was restrict to just A-Z and have encodings treat this as N (unspecified in the spec). - Main source of data for the RR API service are INSDC submitted sequences, which restrict to
(?i)^([ACGTUMRWSYKVHDBN]+)\\*?$"
. Far more restrictive than what we've said here in the spec - The VMC spec, of which
TRUNC512
was based on restricts to just IUPAC nucleotide or amino acid seqs - All source algorithms agreed on the uppercasing of sequence before being digested
That resulted in the decision to allow only upper-cased A-Z
and this maintained API compatibility with CRAM RR not with htslib.
Given the other examples I'm not sure how some of the more obscure sequence would make it into the ENA implementation of this but I am somewhat worried now. @lairdm thoughts?
As for the methylation that really begins to mess with the ideas behind what sequence is :). I'd nearly be tempted to suggest this is an alternative checksum algorithm. What do others think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a fair comment - if the existing reference server has a set of additional restrictions over and above the SAM specification, then the difference is largely irrelevant.
Yes we can find data where the M5 tag in SAM mismatches the MD5 calculation by this API, but such references couldn't be used externally due to other restrictions. Seems fine then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you feel it's worth adding a condensed version of this into the specification? I know the VMC does this to explain key design decisions to inform future implementors/maintainers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a brief rationale could help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Brief rationale added
Also needs to be added to index.md and README.md — compare ff9df33 or I can help with this (though I'm on holiday this week). Has there been any consideration of giving this thing a catchy name, like when GA4GH Retrieval API became htsget? |
@jmarshall added to index.md and README.md. There's been some thoughts but nothing has stuck yet. Any suggestions would be appreciated but I agree something catchier would be nice. |
How does |
I was going to suggest that infact, given that the htslib implementation caches data it downloads from EBI to |
Switched the name to htsref and updated the docs |
I've taken the liberty of adding a commit that formats the tables so that they are rendered correctly when converted to HTML as per the eventual https://samtools.github.io/hts-specs/refget.html. (You can preview this locally with Also a couple of editorial items (described in the last paragraph of the commit message) that you may wish to pick and choose from. |
Garrgghh stupid unicode quotes. Thank you all the edits look fantastic from my POV. |
Further comments on the specification, which is generally looking in pretty good shape IMHO.
What is refget-openapi.yaml? Is it a sample of some document that servers will want to provide? I'd suggest putting it down in the pub/ subdirectory and describing and linking to it in the spec. The MAINTAINERS.md addition still calls it Reference Retrieval API. README.md says “Refget.md enables bulk access to reference sequences […]”, but refget.md itself enables no such thing — it's just a document. You might consider something like the phrasing in the htsget.md paragraph. |
I'll start working through these and will come back with any
clarifications. Thank you
…On Tue, 7 Aug 2018 at 12:45, John Marshall ***@***.***> wrote:
Further comments on the specification, which is generally looking in
pretty good shape IMHO.
-
[§ Protocol essentials] Talks specifically about 302 redirection. Are
other 30x response codes acceptable?
-
[§ Security] “The reference sequence retrieval API can enable […]”
survived the search-&-replace to refget. Are reference chromosomes really
“potentially sensitive genomic data” or could this be worded better?
-
[§ CORS] Link is to a Google Docs document, should at least be a view
URL, not an edit URL surely?!
-
[§ Method: get sequence by ID] Second paragraph talks about 200 / 206
/ 302. If a request with a Range header gets redirected, presumably the
client should also put the Range header on the subsequent request? (As
opposed to the server being allowed to encode the range into the
Location URL. The answer here wasn't immediately obvious in the
underlying RFCs, but I'm guessing it's this way around and possibly this
spec should spell it out.)
-
[§ Query parameters] What happens if start == end? Is it different for
a circular chromosome?
-
[§ Request parameters :: Range] Third sentence clearer as “The server
MUST respond with *a* Bad Request error if *both a* Range *header* and
start or end *query parameters* are specified.”
-
[§ Response] “with no carriage returns” can be interpreted as
forbidding Windows-style CR-LF. Clearer as “with no line terminators or
other formatting characters”.
-
[§ Method: Get known metadata for an id :: Response]
- This is JSON, so length doesn't need to say “decimal”.
- Is there a controlled vocabulary for naming_authority? Even just
a few well-known ones whose canonical spelling/capitalisation could be
listed here?
-
[§ Method: Fetch information on the service :: Response]
- For subsequence_limit: if no limit is imposed, can the server just
omit this object field, or must it be present with a value of null
(so say *integer or null* as its type)?
-
[§ Range headers] “GA4GH has a standard of *0-based, half-open*
coordinates” (‘based’; hyphenate) is the usual way to describe this.
-
I didn't examine the Python or Perl code.
-
[§ Rationale]
- What is VMC?
- The rationale on 0-based coordinates is too weak. This is a
formally promulgated GA4GH policy. See threads from
https://twitter.com/jomarnz/status/920259187057324032.
What is *htsref-openapi.yaml*? Is it a sample of some document that
servers will want to provide? I'd suggest putting it down in the *pub/*
subdirectory and describing and linking to it in the spec.
The *MAINTAINERS.md* addition still calls it Reference Retrieval API.
*README.md* says “Refget.md enables bulk access to reference sequences
[…]”, but *refget.md* itself enables no such thing — it's just a
document. You might consider something like the phrasing in the htsget.md
paragraph.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#327 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAK5GQ15B9Hq0Cjea7qMudzbf-5Lu5_Gks5uOX3UgaJpZM4VhZJb>
.
|
Responses inline
We've changed this to 303. It looks more suitable
We've attempted an improved section
They seem to come through as an edit link no matter what
Indeed, Matt dug through the RFCs for the answer and it's not explicitly spelled out. There are discussion board posts and the implementation of every client I've played with is to resend the query with the same headers, just to the new location. It also seems like the intuitive thing to do, why would you drop/change the headers when simply changing the request location. I'm not sure it needs explicit spelling out as it feels redundant to what http client libraries already seem to do.
An empty string is returned. Clarified in the text
Both changed accordingly
JSON spec says number so we'll go with that
Added an appendix to solve this
It should be null (so the type has been changed)
Changed
Added a short description and links
Tried to improve
It's the OpenAPI 3 version of the specification. It's not meant for servers to provide it but to be a reference description that is complementary to refget.md. I can move if you'd prefer it to be moved
Done
Altered by removing the word bulk |
Okay assuming that people are happy with these changes then I'll go for the merge later on today (probably close to 5pm BST). A big thank you to everyone's comments and help on getting the specification to this point. |
Thanks. I've pushed another commit for your consideration that fixes typos, makes the openapi link relative, and removes the link to my tweet which I would prefer not to be linked from a specification. Thanks for moving that file down to pub/, which is for ancillary files that need to be jekylled into the https://samtools.github.io/hts-specs/ website but don't want to clutter the top-level repo directory. Responses to some of your responses, for what they're worth:
I don't really understand why refget would specify this. Isn't the choice of redirect flavour up to the server? And I would have thought 302/307 would be the usual ones. Why not just leave it open by writing as 30x or the equivalent?
FYI htsget considers this an error.
A JSON
IMHO this is now worse than before. The reason for using 0-based half-open coordinates is that that is the sensible representation for doing arithmetic, as discussed at ga4gh/ga4gh-schemas#121 and elsewhere, and that is why it was GA4GH policy at the time the htsget spec was written. Apparently you folks disagree, but unless you have evidence that this policy changed, “Whilst this is not a GA4GH standard” seems unjustified. And I would prefer that the specification did not contain a link to a tweet of mine.
I wasn't objecting to “bulk” 😄 The (perhaps esoteric) point was that the sentence says “The spec enables bulk access to […]” but what it wants to say is “The protocol described in the spec enables bulk access to […]”. |
Okay no problems
We did consider that but reading through the HTTP status codes 303 is the right one. Since in neither situation is the resources
Okay. @lairdm what do you think
Okay I can see from https://www.json.org/ the hierarchy. Yes we want
Okay this is a mis-understanding of mine. For some reason I thought Beacon was 1-based fully-closed but that's not the case. Also though where has Global Alliance said this. I'm not starting a fight, actually far from it, just I'd prefer to reference that. In-fact if this is a GA4GH standard then why not delete that section. There's no rationale to give to the choice. It just is
Okay |
I don't see that at all (RFC 7231 §6.4.4/3 says 303 results in something merely similar and hopefully useful), but I'm happy to revisit this as an issue after this PR has been merged.
Good question. The last big discussion was in ga4gh/ga4gh-schemas#121; previously there was ga4gh/ga4gh-schemas#49 in which after great agonising the (now legacy) schemas were converted to 0-based half-open. As I recall, at the time, this was called a policy so that we wouldn't have to rehash the discussion again — and that's what the note at the top of the htsget spec reflects. I am sure this appeared in slides etc at the time, but the marketing ga4gh.org web site is not the sort of site that archives historical technical information… so who knows. Perhaps someone else recalls something. |
Okay I'll review this later. As for the 0-based rationale for the moment I'll just replicate what htsget has said. I also don't think we need to re-discuss this but GA4GH really needs to have this kind of decision record in an easy to reference place (whilst slowly looking in the mirror and realising it'll be him who pushes for it) ... |
I imagine there's going to be a bit of squashing before this gets merged in? (Hopefully merging master into here won't have complicated squashing too much…) |
My PR has just been rebased against hts-specs/master and squashed into a single commit. Just about to push it into my repo now so you should see the changes very soon |
Okay this is now ready to be merged in. Happy to hit the button myself or let someone else do the deed |
Confirmed that the tables etc are still okay when jekylled into HTML as it will appear on http://samtools.github.io/hts-specs. My final pedantic comment is that maybe you could summarise the commit message. Most of the intermediate paragraphs (update this; update that; fix typo; remove tweet) are irrelevant now that the intermediate states have been squashed out of existence. |
Okay :). Will do this ASAP
…On Tue, 14 Aug 2018 at 15:06, John Marshall ***@***.***> wrote:
Confirmed that the tables etc are still okay when jekylled into HTML as it
will appear on http://samtools.github.io/hts-specs.
My final pedantic comment is that maybe you could summarise the commit
message. Most of the intermediate paragraphs (update this; update that; fix
typo; remove tweet) are irrelevant now that the intermediate states have
been squashed out of existence.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#327 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAK5GSer1on_tHIE6W6RNzmIo-YMVhz2ks5uQtjvgaJpZM4VhZJb>
.
|
Okay very verbose messages have now been squashed into a single message |
Excellent, thanks. You know I hate to complain yet again, but: thanks for putting in (And the original commit message noted that this came from previous text in a Google doc, which seems worth mentioning. Compare 5c25669.) |
Text as exported from previous Google Docs document: https://docs.google.com/document/d/1q2ZE9YewJTpaqQg82Nrz_jVy8KsDpKoG1T8RvCAAsbI as of 27th July 2018 (last edit was made 25th June 2018). Format was changed into Markdown with HTML. This is version 0.2 of the refget specification. Refget supports the retrieval of sequences using an identifier generated by a hash digest of the full length underlying sequence. It also supports retrieval of metadata concerning this identifier including sequence length and aliases known to the server.
Okay done the edits |
Thanks. As the complainer-in-chief for today 😄, I have now hit the merge button. Welcome to the fold! The refget specification is now available at http://samtools.github.io/hts-specs/refget.html. |
Thank you for all the help as well! I do feel the specification can only benefit from this scrutiny |
Adding the version 0.2 RR API specification. This spec defines how to query for a sequence by a checksum (
/sequence/:id
), get information about that checksum (/sequence/:id/metadata
) and information about the capabilities service provides (/sequence/service-info
).This service is backwards compatible with the cram reference registry and can be used by CRAM and htslib without modification to either spec.