-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Header and data boolean flags for htsget. #311
Conversation
Does the proposal imply that (BAM, VCF, etc) headers are included in each data block by default, and excluded via a flag? If so, I think it could be easier to separate the header and the data URLs in the htsget response. That way repeated downloads would only occur if the client retrieves the contents of the header URL more than once.
Maybe 'metadata' instead of 'headers', and 'body' instead of 'data'? Specifying 'HTTP headers' instead of just 'headers' could help. |
Terminology-wise, within the files we often talk about headers vs. records. So:
|
The major problem with the body-only part of this is, as I think we've discussed in the past, a file with no headers is not itself a valid BAM file; similarly for VCF and BCF files. “Headers-only” is just a BAM/CRAM/BCF file that has no records, which is trivially valid. So that's easy. The converse is not true, which is why we've punted on it in the past. It may finally be time to have that conversation, but
|
I feel the potential gain outweighs the purity of data-only not being well formed. The PR already states that normally both will be present for a well formed file, and gives an example where requesting body without header may be an appropriate thing to do (ie we already have the header and it's a sizeable overhead if we have a small query region). So I don't think this is a major issue. It's one we proposed very early on as I recall (I remember it from back when I was part of the htsget group). I don't think we need any more for your part 2 other than to state that parsers may need to prepend the data stream with some previously downloaded headers. (I say "may" as the alternative is a parser than can save state and restart from just after the headers have been parsed.) I would add however that we need to be clear which headers are involved. CRAM has multiple tiers - the SAM header, the container header, a compression header and a slice header. It's only the first which makes sense to omit. |
This is a neat idea... I guess it comes down to whether we want to add a flag to the URL objects to indicate whether payloads are header or data objects. It puts some restrictions on servers though I guess, since they now have to be sure to separate the two. Also, in the case where a client just wants to get the header from some big file, the server is going to spend a lot of effort encoding hundreds of URLS into the JSON response which are just going to be ignored by the client. The client saying 'all I want is the header' up front seems a bit cleaner.
@jmarshall: This is true, but I think I agree with @jkbonfield here. If the client explicitly asks for body-only, then I think it's fair enough to assume that the client either knows how to consume the raw stream or knows that it should prepend an earlier downloaded header to the stream before passing it on to someone else. It's a valid point though about whether we should bother with the body-only bit of this right now... Re terminology: I don't have strong opinions about what the terms should be, and happy to go with what others think. |
Pinging @jrobinso: would you mind taking a look at this proposal and seeing if it meets your requirements re fetching only headers or body data? I was mostly thinking about you as the potential user for this, so it'd be great to get your thoughts on it. |
The major need here is to be able to fetch the header before doing a query, since that is the only way to know what sequences are present in the resource. The equivalent of a -H flag in samtools. WRT what to return for CRAM files, really on the "SAM" format type header is required. I haven't tested samtools -H with a cram file but I imagine that is what is returned. Fetching only the body would be a performance improvement, not strictly required. I would point out that fetching just the "data" via a query is the standard behavior for samtools, you have to specifically ask for the header with a -h flag. I'm sure Picard has the equivalent Again this is a well and long established pattern, nothing new needs to be invented here. WRT what to return, if you decide to implement this part I would return exactly what samtools does, many pipelines and tools are already written that use samtools to do queries. |
To refine and clarify, IGV really just needs the sequence dictionary. I suggested the entire header because that is what samtools does, but I don't use the rest of it. But for the sake of tools built around samtools I think you should do what it (samtools) does. |
Thanks @jrobinso, this is very helpful. |
@cyenyxe: That's a rather nice alternative approach. Another alternative approach for the headers-only request is to follow the lead of our reference-retrieval colleagues, who have separate endpoints for sequences and just the metadata:
The exact analogue is more difficult for us because we don't constrain our Edited to add: To avoid ambiguities due to our full-path ids, we could have endpoints like
but this is really no more satisfying than this PR's proposed query parameter approach. |
In the meeting on 16/05 we decided to reduce the scope of this back to requesting header data only. I'll update the PR soon. |
e5888af
to
5358ff0
Compare
I've updated this to have a Once that's done, adding this |
Rather than Then we would be able to reuse this parameter for other purposes too. For example PR #320 could use this as |
Seems like a good idea to me @jmarshall. The word |
It doesn't feel quite right to me either, so let's all keep ruminating… 😄 If @cyenyxe's upcoming data-only PR adds to the JSON ticket in the way I'm vaguely anticipating, we could reuse its wording: I think that'll be |
Any of these suggestions would work for me, but I suggest you consider
supporting the samtools options, specifically -H for this instance.
Many tools are built on samtools and/or pysam. I imagine htsget
implementations would be built around these as well.
…On Wed, Jun 6, 2018 at 12:42 PM, John Marshall ***@***.***> wrote:
It doesn't feel quite right to me either, so let's all keep ruminating… 😄
If @cyenyxe <https://github.com/cyenyxe>'s upcoming data-only PR adds to
the JSON ticket in the way I'm vaguely anticipating, we could reuse its
wording: I think that'll be class=header which might not feel quite right
for this either…
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#311 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA49HBsQeAz0SeLt-eW50kYbP3dRqn3aks5t6AaCgaJpZM4T80zC>
.
|
Sorry I hit send prematurely. My point is that supporting samtools flag
equivalents would satisfy the needs of tools built on samtools (or pysam)
by definition, and should be trivial to implement for htsget
implementations built on samtools or htslib. This is a large number of
potential htsget clients. I suppose you could make the same argument for
Picard flags, I am just less familiar with those.
However, for IGV purposes any option that would return the sequence
dictionary in some form would suffice.
Any of these suggestions would work for me, but I suggest you consider
… supporting the samtools options, specifically -H for this instance.
Many tools are built on samtools and/or pysam. I imagine htsget
implementations would be built around these as well.
|
This is about how to specify these things in the underlying HTTP-based protocol. When samtools is being used as a htsget client, it would be able to internally translate -H being specified into adding Similarly Picard could translate its options into adding the appropriate bits and pieces onto htsget URLs when htsget is in use. |
I wasn't thinking about samtools or Picard as the client, but other tools
built on them that already know, for example, how to process the results
returned from "-H". To ease adoption of htsget it would be helpful if it
had an equivalent flag that returned the same things. That's all, nothing
complicated, and probably not a big deal either way.
…On Thu, Jun 7, 2018 at 10:31 AM, John Marshall ***@***.***> wrote:
This is about how to specify these things in the underlying HTTP-based
protocol.
When samtools is being used as a htsget client, it would be able to
internally translate -H being specified into adding &class=header or
whatever onto the end of the URL it was requesting and so take advantage of
only transmitting the headers. (At the moment it wouldn't do that because
it wouldn't know ahead of time that a given URL was an htsget URL, but
there could be ways of telling it that — such as command-line option.)
Similarly Picard could translate its options into adding the appropriate
bits and pieces onto htsget URLs when htsget is in use.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#311 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AA49HOe42jWZlib0HtQTZxPHzAMyZzwTks5t6TkpgaJpZM4T80zC>
.
|
In the interest of simplicity we're closing this PR in favour of #322. |
This is a rough draft of the proposed change to allow htsget conditionally exclude either header or data body information. This solves two immediate problems:
Issues: