-
Notifications
You must be signed in to change notification settings - Fork 446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal for an aux tagged field iterator API #1354
Conversation
(See jmarshall/samtools@aefbf50 for a draft of the samtools |
7453e99
to
326e3d2
Compare
I'll take complete silence as signalling approval of the basic approach 😄 The new iterator-oriented field deletion functions are now @daviesrob et al: This is now ready for review. |
htslib/sam.h
Outdated
no fields are removed. | ||
*/ | ||
HTSLIB_EXPORT | ||
uint8_t *bam_aux_remove2(bam1_t *b, uint8_t *s, uint8_t *e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case for bam_aux_remove2
?
I can see a case for specifying a list of aux tags we wish to remove and it doing the appropriate logic to remove them all within a single pass and avoid potentially O(N^2) complexity on extremely large sets of aux tags. However unless I'm misunderstanding this, it looks like it's only removing aux tags that are adjacent to each other which is rather tenuous. I guess the only use I can see is trivial removal of all tags.
Although if it's a general list, then it doesn't entirely remove the samtools view -x
code as that has both keep-list and remove-list functionality. If we were sufficiently bothered to replicate it here, then the list would need a negation flag to indicate the keep-or-reject status.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's akin to C++ iterators, which have a two-argument erase()
to remove a consecutive range of items from a sequence. However as you don't directly control the ordering of aux fields, it's not as obviously applicable here. The motivating use case was noted in the first comment above:
I also have a version with an additional
bam_aux_erase_range()
bam_aux_remove2()
function that deletes several aux fields at once (so could be used to directly reimplement the way James did the deleting in samtools/samtools#516, which probably reduces the number of memmove() calls so is 0.1% more efficient), but not totally sure the slight complication is worth it.
There's a draft of that at jmarshall/samtools@aefbf50 though I think it needs minor alterations. (That commit implements remove_tag
the obvious way, but attempts the optimised way for keep_tag
.) The idea is you can implement keep_tag
by iterating over a run of fields you don't want until you see one that you do want to keep, and then nuke the whole run of don't-wants at once with bam_aux_remove2()
; then repeat that until you're at the end of the aux fields.
Essentially bam_aux_remove2()
falls out of the implementation so might as well be public. Useful for nuke-all-tagged-fields and as a building block for fancy stuff like keep_tag
.
Or I'm happy to inline the aux_remove()
helper back into bam_aux_remove()
again, so it goes away. 🤷
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem is that if we're removing X, Y and Z from tags ABCXYDEFZGHI then the first XY
removal moves DEFZGHI down 2 places, and then the Z moves GHI down 1 place again. This is because bam_aux_remove
always moves up to the end (data + l_data
) and not up to the next point of interest. Hence GHI has been moved twice.
The efficient algorithm however is to move DEF down 2 and then GHI down 3. Hence your reimplementation is not the same complexity as the original.
In general the algorithm is O(N^2) complexity, although in practice it's probably irrelevant as N is usually small. There may be some cases though, such as the large PacBio tags, where N becomes quite significant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Such is the cost of abstraction. The raw pointer version has the liberty of leaving a gap while it is working; the API version must maintain the invariant that the bam1_t
has a contiguous vector of valid tag fields between API calls.
Perhaps what would be more useful would be a bam_aux_remove_if(…)
function that took a callback saying whether to remove a particular aux field, and iterated over the whole list itself within the one API call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding abstraction, that's why I said I felt the bam_aux_remove2
function to be of questionable benefit. Perhaps some special corner cases may work, such as removing all bar the first tag, but I don't really see the benefit of extending the API with something so limited. If we wanted to have an API which always left bam1_t in a consistent manner, then it should be as I said before - a remove function that takes a list of tag names rather than a single one (and perhaps a keep / remove flag to negate it). Plus we already have this implemented in samtools view.
We could perhaps do it with bam_aux_remove_if
callbacks, but it seems like a lot of additional faff to use over a basic array argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed bam_aux_remove2()
, which was amongst other things a pain to use, and reunified bam_aux_remove()
accordingly. (See force push to 99954c1 below.)
I'll add something else more useful for implementing stuff like keep_tag
/remove_tag
in its place.
I like this feature. Bar a few niggles, it looks solid. Thanks |
Bumping this again to see where we're at.
I haven't refreshed my mind of the discussions and code yet, but was this something you were planning on doing still, and if so is it worth waiting for it or would you view it as part 2 of the API, to arrive in a subsequent PR? (If so I assume it's a purely additional thing and not something that would need changing of the existing API.) |
|
Oops, actually it is #1361 that has been substantially refactored from the current state of its PR. I have an in-progress commit adding I may have time to finish that up in the next few days, so maybe put this one on your list to have another look at on Wednesday 😄 |
Thanks. I did check if it was draft before commenting, as I had a recollection it was, but as you noted it's the samtools side that is the draft bit. I'm refreshing my memory of what this does anyway as it'll help get my head in the zone for the upcoming changes. |
99954c1
to
ac3beaa
Compare
(That force-push was just to bring this up to current develop unchanged. Further pushes will make actual changes.) |
Add new API functions for iterating through a BAM record's aux fields, inline accessor methods for field tag and type (or code can continue to use s-2 and *s), and a variant of bam_aux_del() that returns the (updated) iterator to the following field (for use in iterator-based loops that delete fields). Add test cases for the new API functions.
ac3beaa
to
10d63e1
Compare
I've addressed the review comments to I've added the proposed API for Hence I've turned this PR back to draft, but feel free to (squash/)merge the first two commits. |
Thanks. I'll review the first couple commits with a view to merging them in isolation. |
6ef7fd2
to
78ee824
Compare
Gah, I'll remove that last commit and bring So you can review this part 1 as a normal PR… |
Hah beat me to it. I was going to squash and merge outside of this PR so as not to confuse your own branch (after testing it in anger with another samtools branch). Are you happy for me to squash and push over the top of your branch so this tidies itself up automagically on merge? (Or do it yourself - I don't care either way) |
I think I know which other samtools branch that is 😄 Feel free to (force-)push to this PR's branch. Or use the “Squash and merge” UI button in due course, which will turn the PR happily purple/merged too. |
Yeah the draft sanitize function. We also discussed whether it belongs in htslib. My preference would be there and maybe enabled by an But for now it's a useful test of an external tool against this API too. :) |
Ah frogger-it... meant squash and merge rather than just rebase. Oh well, it is as it is :) |
As suggested in #1319, this adds new API functions for iterating through a BAM record's aux fields, inline accessor methods for field tag and type (or code can continue to use
s-2
and*s
), and a variant ofbam_aux_del()
that returns the (updated) iterator to the following field (for use in iterator-based loops that delete fields). IMHO there's no need for an additionalbam_tag_iter_t
type, as theuint8_t*
pointer to aux fields is already all over this part of the API and there's no need for more iterator state anyway.This draft rewrites
bam_aux_get()
in terms of the new iterator functions. It's slightly debatable whether to do that (and remove the old version) or to leave it alone.Tossing up between using the C++-style
bam_aux_erase()
nomenclature or calling the new deletion functionbam_aux_del1()
or so. I also have a version with an additionalbam_aux_erase_range()
function that deletes several aux fields at once (so could be used to directly reimplement the way James did the deleting in samtools/samtools#516, which probably reduces the number ofmemmove()
calls so is 0.1% more efficient), but not totally sure the slight complication is worth it.This somewhat parallels PR #1351, but it's a separate part of the API (aux fields vs. headers) so might as well be a separate PR. Needs some test cases once the basic approach has basic approval.
Closes #1319.