Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for an aux tagged field iterator API #1354

Merged
merged 2 commits into from
Aug 30, 2022

Conversation

jmarshall
Copy link
Member

@jmarshall jmarshall commented Nov 9, 2021

As suggested in #1319, this adds new API functions for iterating through a BAM record's aux fields, inline accessor methods for field tag and type (or code can continue to use s-2 and *s), and a variant of bam_aux_del() that returns the (updated) iterator to the following field (for use in iterator-based loops that delete fields). IMHO there's no need for an additional bam_tag_iter_t type, as the uint8_t* pointer to aux fields is already all over this part of the API and there's no need for more iterator state anyway.

This draft rewrites bam_aux_get() in terms of the new iterator functions. It's slightly debatable whether to do that (and remove the old version) or to leave it alone.

Tossing up between using the C++-style bam_aux_erase() nomenclature or calling the new deletion function bam_aux_del1() or so. I also have a version with an additional bam_aux_erase_range() function that deletes several aux fields at once (so could be used to directly reimplement the way James did the deleting in samtools/samtools#516, which probably reduces the number of memmove() calls so is 0.1% more efficient), but not totally sure the slight complication is worth it.

This somewhat parallels PR #1351, but it's a separate part of the API (aux fields vs. headers) so might as well be a separate PR. Needs some test cases once the basic approach has basic approval.

Closes #1319.

@jmarshall
Copy link
Member Author

jmarshall commented Nov 9, 2021

(See jmarshall/samtools@aefbf50 for a draft of the samtools --keep-tag/--remove-tag functionality re-implemented via this API.)

@jmarshall
Copy link
Member Author

I'll take complete silence as signalling approval of the basic approach 😄

The new iterator-oriented field deletion functions are now bam_aux_remove() and bam_aux_remove2(), as …remove is fairly widespread in HTSlib API function names. Test cases exercising the new API functions added.

@daviesrob et al: This is now ready for review.

@jmarshall jmarshall marked this pull request as ready for review April 5, 2022 12:47
sam.c Show resolved Hide resolved
htslib/sam.h Outdated
no fields are removed.
*/
HTSLIB_EXPORT
uint8_t *bam_aux_remove2(bam1_t *b, uint8_t *s, uint8_t *e);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case for bam_aux_remove2?

I can see a case for specifying a list of aux tags we wish to remove and it doing the appropriate logic to remove them all within a single pass and avoid potentially O(N^2) complexity on extremely large sets of aux tags. However unless I'm misunderstanding this, it looks like it's only removing aux tags that are adjacent to each other which is rather tenuous. I guess the only use I can see is trivial removal of all tags.

Although if it's a general list, then it doesn't entirely remove the samtools view -x code as that has both keep-list and remove-list functionality. If we were sufficiently bothered to replicate it here, then the list would need a negation flag to indicate the keep-or-reject status.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's akin to C++ iterators, which have a two-argument erase() to remove a consecutive range of items from a sequence. However as you don't directly control the ordering of aux fields, it's not as obviously applicable here. The motivating use case was noted in the first comment above:

I also have a version with an additional bam_aux_erase_range()bam_aux_remove2() function that deletes several aux fields at once (so could be used to directly reimplement the way James did the deleting in samtools/samtools#516, which probably reduces the number of memmove() calls so is 0.1% more efficient), but not totally sure the slight complication is worth it.

There's a draft of that at jmarshall/samtools@aefbf50 though I think it needs minor alterations. (That commit implements remove_tag the obvious way, but attempts the optimised way for keep_tag.) The idea is you can implement keep_tag by iterating over a run of fields you don't want until you see one that you do want to keep, and then nuke the whole run of don't-wants at once with bam_aux_remove2(); then repeat that until you're at the end of the aux fields.

Essentially bam_aux_remove2() falls out of the implementation so might as well be public. Useful for nuke-all-tagged-fields and as a building block for fancy stuff like keep_tag.

Or I'm happy to inline the aux_remove() helper back into bam_aux_remove() again, so it goes away. 🤷

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that if we're removing X, Y and Z from tags ABCXYDEFZGHI then the first XY removal moves DEFZGHI down 2 places, and then the Z moves GHI down 1 place again. This is because bam_aux_remove always moves up to the end (data + l_data) and not up to the next point of interest. Hence GHI has been moved twice.

The efficient algorithm however is to move DEF down 2 and then GHI down 3. Hence your reimplementation is not the same complexity as the original.

In general the algorithm is O(N^2) complexity, although in practice it's probably irrelevant as N is usually small. There may be some cases though, such as the large PacBio tags, where N becomes quite significant.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Such is the cost of abstraction. The raw pointer version has the liberty of leaving a gap while it is working; the API version must maintain the invariant that the bam1_t has a contiguous vector of valid tag fields between API calls.

Perhaps what would be more useful would be a bam_aux_remove_if(…) function that took a callback saying whether to remove a particular aux field, and iterated over the whole list itself within the one API call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding abstraction, that's why I said I felt the bam_aux_remove2 function to be of questionable benefit. Perhaps some special corner cases may work, such as removing all bar the first tag, but I don't really see the benefit of extending the API with something so limited. If we wanted to have an API which always left bam1_t in a consistent manner, then it should be as I said before - a remove function that takes a list of tag names rather than a single one (and perhaps a keep / remove flag to negate it). Plus we already have this implemented in samtools view.

We could perhaps do it with bam_aux_remove_if callbacks, but it seems like a lot of additional faff to use over a basic array argument.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed bam_aux_remove2(), which was amongst other things a pain to use, and reunified bam_aux_remove() accordingly. (See force push to 99954c1 below.)

I'll add something else more useful for implementing stuff like keep_tag/remove_tag in its place.

@jkbonfield
Copy link
Contributor

I like this feature. Bar a few niggles, it looks solid.

Thanks

@jkbonfield
Copy link
Contributor

Bumping this again to see where we're at.

I've removed bam_aux_remove2(), which was amongst other things a pain to use, and reunified bam_aux_remove() accordingly. (See force push to 99954c1 below.)

I'll add something else more useful for implementing stuff like keep_tag/remove_tag in its place.

I haven't refreshed my mind of the discussions and code yet, but was this something you were planning on doing still, and if so is it worth waiting for it or would you view it as part 2 of the API, to arrive in a subsequent PR? (If so I assume it's a purely additional thing and not something that would need changing of the existing API.)

@jmarshall jmarshall marked this pull request as draft August 5, 2022 15:43
@jmarshall
Copy link
Member Author

jmarshall commented Aug 5, 2022

Don't refresh your memory of the code as it stands in this PR. Back in April, I got most of the way through refactoring it as discussed. I may have time to finish that up in the next few days.

@jmarshall
Copy link
Member Author

jmarshall commented Aug 5, 2022

Oops, actually it is #1361 that has been substantially refactored from the current state of its PR.

I have an in-progress commit adding bam_aux_remove_if() to the existing PR. I was planning on adding it to this PR (as there's some risk it might involve tweaks to the existing proposed addition), or it probably could be done as a Part II.

I may have time to finish that up in the next few days, so maybe put this one on your list to have another look at on Wednesday 😄

@jmarshall jmarshall marked this pull request as ready for review August 5, 2022 16:00
@jkbonfield
Copy link
Contributor

Thanks. I did check if it was draft before commenting, as I had a recollection it was, but as you noted it's the samtools side that is the draft bit.

I'm refreshing my memory of what this does anyway as it'll help get my head in the zone for the upcoming changes.

sam.c Show resolved Hide resolved
sam.c Show resolved Hide resolved
@jmarshall
Copy link
Member Author

(That force-push was just to bring this up to current develop unchanged. Further pushes will make actual changes.)

Add new API functions for iterating through a BAM record's aux fields,
inline accessor methods for field tag and type (or code can continue
to use s-2 and *s), and a variant of bam_aux_del() that returns the
(updated) iterator to the following field (for use in iterator-based
loops that delete fields).

Add test cases for the new API functions.
@jmarshall
Copy link
Member Author

jmarshall commented Aug 30, 2022

I've addressed the review comments to bam_aux_remove() and think that function should be ready for merging.

I've added the proposed API for bam_aux_remove_if(), which facilitates removing several aux fields at once, for review. This third commit is still in draft form, as the implementation needs testing and test cases; it's currently implemented similarly to James's versions over in samtools/sam_view.c, though I've been trying to implement a version that memmoves several kept fields at once.

Hence I've turned this PR back to draft, but feel free to (squash/)merge the first two commits.

@jkbonfield
Copy link
Contributor

Thanks. I'll review the first couple commits with a view to merging them in isolation.

@jmarshall jmarshall marked this pull request as ready for review August 30, 2022 14:32
@jmarshall
Copy link
Member Author

Gah, I'll remove that last commit and bring bam_aux_remove_if() back as a separate PR later — this one has been hanging around quite long enough!

So you can review this part 1 as a normal PR…

@jkbonfield
Copy link
Contributor

Hah beat me to it. I was going to squash and merge outside of this PR so as not to confuse your own branch (after testing it in anger with another samtools branch). Are you happy for me to squash and push over the top of your branch so this tidies itself up automagically on merge? (Or do it yourself - I don't care either way)

@jmarshall
Copy link
Member Author

jmarshall commented Aug 30, 2022

I think I know which other samtools branch that is 😄

Feel free to (force-)push to this PR's branch. Or use the “Squash and merge” UI button in due course, which will turn the PR happily purple/merged too.

@jkbonfield
Copy link
Contributor

Yeah the draft sanitize function. We also discussed whether it belongs in htslib. My preference would be there and maybe enabled by an hts_set_opt call so any program can read *AM records with some basic aligner (cough bwa) fix-ups applied and be sure it doesn't have to worry about silly things like bases aligning off the ends of chromosomes or unmapped data with silly mapping qualities and non-sensical cigar strings.

But for now it's a useful test of an external tool against this API too. :)

@jkbonfield jkbonfield merged commit 2ff03d3 into samtools:develop Aug 30, 2022
@jkbonfield
Copy link
Contributor

Ah frogger-it... meant squash and merge rather than just rebase. Oh well, it is as it is :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Wish list: aux tag iterator API
2 participants