-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Cache control for re-using previously-downloaded headers #325
base: master
Are you sure you want to change the base?
Conversation
Thanks for the additions here and in #322. Practical question: I am trying to implement a POC htsget server that splits up BAM files into header and body pieces. I'm at a loss for how to actually create a header BAM file and a body BAM file, such that the header BAM is valid by itself but can be concatenated with the body BAM to create the final BAM. In pseudo-python, the way I want my client to interact with the server is:
I realize that I can use samtools to split my BAM file into header-only and body-only by first converting to SAM, splitting into header-only SAM and body-only SAM, and then converting both of those back into BAM. But concatenating those two files does not create a valid BAM. I guess I could just write out the body BAM and then use samtools reheader to add the header, but that's quite slow for large BAM files. Any other suggestions? |
You need to find the boundary file offset between the header and the body, which requires understanding the format in a way that a general-purpose read-the-records API won't provide. So for BAM, you need to
At that point, you'll have the header-body boundary (in “compressed space”) that you're looking for. Note that this assumes that a new BGZF block is started for the first body data record (i.e. the header-body boundary is also at a BGZF block boundary) — this has never been stated in the SAM specification, but is something that the main implementations have done for BAM since 2010 (see #300). It seems to me that implementing htsget requires BAM files to have this property. (And similarly for BCF files, but AFAIK the main implementations don't do this for them!) In practice, you'd more likely find this boundary by looking in a BAI/etc index for the virtual file offset of the first body data record — e.g. (presumably) by finding the smallest |
@jmarshall that makes sense, thanks for the explanation. I have a python library for parsing index files I've been meaning to release for a while - seems like that will be useful here. I'll work on putting together a library and command line tool that can be used to split up BAM/BCF/etc files for serving by htsget. |
edcea92
to
b0f8114
Compare
6813e19
to
2aa4cd6
Compare
This proposal is a follow-up to #322. It will require rebasing etc as #322 develops, so I don't anticipate updating or polishing this until after the
class
proposal has landed in master.However, if that facility is to be used by clients to enable re-using previously-downloaded headers and this is to be done safely, then I think using HTTP cache control will be the natural way to make it safe and extrapolating ETag/etc to the htsget ticket is a natural extension. So if enabling this safety is considered important, I think this followup will also need to be considered soon after
class
.But this is somewhat moot in the absence of implementations, hence this separate PR.