Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix indexing bug with "placed" unmapped reads. #1352

Merged
merged 1 commit into from
Nov 12, 2021

Conversation

jkbonfield
Copy link
Contributor

Unmapped-but-placed (having REF/POS) reads are not included in the
index. Hence if an placed unmapped is the first record in a bin, then
it may not be returned. Note most aligners write out mapped followed
by unmapped which does not trigger this problem.

The SAM spec states that all unmapped placed reads should be
considered as having an alignment length of 1. While it doesn't seem
to explicitly state these must therefore be in the index, it does
imply it. It appears that picard also indexes placed reads in this
manner.

Reported by Petr Danecek

@jkbonfield
Copy link
Contributor Author

An example file:

@HD	VN:1.4	SO:coordinate
@SQ	SN:1	LN:249250621	M5:1b22b98cdeb4a9304cb5d48026a85128
@SQ	SN:2	LN:243199373	M5:a0d9851da00400dec1098a9255ac712e
um1	69	1	1000000	0	*	*	0	0	AAAAAAAAAA	*
um1	137	1	1000000	44	10M	*	0	0	AAAAAAAAAA	*
um2	69	1	2000000	0	*	*	0	0	AAAAAAAAAA	*
um2	137	1	2000000	44	10M	*	0	0	AAAAAAAAAA	*
mu1	137	2	1000000	44	10M	*	0	0	AAAAAAAAAA	*
mu1	69	2	1000000	0	*	*	0	0	AAAAAAAAAA	*
mu2	137	2	2000000	44	10M	*	0	0	AAAAAAAAAA	*
mu2	69	2	2000000	0	*	*	0	0	AAAAAAAAAA	*

Build indices, using an older samtools binary and this PR:

$ samtools113 view -o _113.bam --write-index _.sam
$ samtools view -o _dev.bam --write-index _.sam

Then query them:

$ samtools view _113.bam 1:1000000-1000000
um1	137	1	1000000	44	10M	*	0	0	AAAAAAAAAA	*
$ samtools view _dev.bam 1:1000000-1000000
um1	69	1	1000000	0	*	*	0	0	AAAAAAAAAA	*
um1	137	1	1000000	44	10M	*	0	0	AAAAAAAAAA	*

$ samtools view _113.bam 2:1000000-1000000
mu1	137	2	1000000	44	10M	*	0	0	AAAAAAAAAA	*
mu1	69	2	1000000	0	*	*	0	0	AAAAAAAAAA	*
$ samtools view _dev.bam 2:1000000-1000000
mu1	137	2	1000000	44	10M	*	0	0	AAAAAAAAAA	*
mu1	69	2	1000000	0	*	*	0	0	AAAAAAAAAA	*

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Nov 3, 2021

I've verified that picard's BuildBamIndex files are also compatible, and they act more like this PR does (with samtools able to query them correctly).

Note this is related to #1152, specifically implementing what was previously suggested in #1152 (comment). In theory it should fix this problem, but does not appear to do so. However the two issues are orthogonal:

  • Creating correct indices
  • Working with incorrect indices.

@jkbonfield
Copy link
Contributor Author

Looking back at old versions, it appears it's once again the linear index which broke the ability to cope with placed reads. From 45715e4 onwards we don't return the first record with 1:1000000-100000, but that ~1 does. I haven't figured out why #1152 doesn't cure it yet, as that was certainly the intention of it.

@jmarshall
Copy link
Member

  1. This seems like the third or fourth htslib PR attempting to address these indexing vagaries. Has anyone considered adding some test cases to the test suite exercising these corner cases?

  2. As you say, this implements my suggestion from Correct the linear index to include unmapped placed reads #1152 (comment), which I also identified as an issue in Unmapped reads are not added to the BAI linear index #1142 on 23 September 2020. Does your commit message imply that Petr reported the issue earlier than this?

@jkbonfield
Copy link
Contributor Author

Actually I was just looking at the test suite, and then ran away screaming. I have a total mental block in understanding that hideous perl script (hence preferring the simpler bourne shell equivalent when I can get away with it). It does need something though I'll agree.

As for who reported it, I was simply going on the most recent comment (via email) from yesterday. At the time I wrote the commit message I didn't realise you had already identified this particular bug. Although I could recall there being prior work, but recollection (wrong) was that it was related to the linear index and not the main index. Frankly the entire indexing strategy is also something that causes a mental block. As soon as I stop working on it it goes out my head again. :/

@jmarshall
Copy link
Member

For HTSlib (and testing of its API functions in particular), I'm quite a fan of the precision available from writing tests directly in C…

@jkbonfield jkbonfield force-pushed the unmapped_index branch 3 times, most recently from f5d11f9 to 2b99b13 Compare November 3, 2021 17:21
@jkbonfield
Copy link
Contributor Author

Hmm, not sure why linux can auto-index when writing a BAM but not Mac or Windows. I could change it to two commands (test_view and test_index), but this seems to be hinting at a bug somewhere so I'd rather not paper over it.

I have msys locally so can test it tomorrow.

Unmapped-but-placed (having REF/POS) reads are not included in the
index.  Hence if an placed unmapped is the first record in a bin, then
it may not be returned.  Note most aligners write out mapped followed
by unmapped which does not trigger this problem.

The SAM spec states that all unmapped placed reads should be
considered as having an alignment length of 1.  While it doesn't seem
to explicitly state these must therefore be in the index, it does
imply it.  It appears that picard also indexes placed reads in this
manner.

Originally reported by John Marshall.  Fixes samtools#1142
@jkbonfield
Copy link
Contributor Author

jkbonfield commented Nov 4, 2021

Hmm, not sure why linux can auto-index when writing a BAM but not Mac or Windows. I could change it to two commands (test_view and test_index), but this seems to be hinting at a bug somewhere so I'd rather not paper over it.

I have msys locally so can test it tomorrow.

Things I learnt today - getopt differs between linux, macos and windows. The linux one can do "cmd arg -x opt" fine, but windows/macos must be in the form "cmd -x opt arg". Something to unlearn then. (I'm probably just too used to getopt_long, but test_view is old fashioned getopt instead.)

@jmarshall
Copy link
Member

Things I learnt today - getopt differs between linux, macos and windows.

See also this rant and the linked archaeology, and the first paragraph of GNU Coding Standards §4.8. This is one of my personal hobby horses and I'm surprised I haven't tweeted about it more often! 😄

pd3 added a commit to pd3/samtools that referenced this pull request Nov 6, 2021
Proof of principle implementation that allows to pull reads from
regions including their mates, even when they fall outside the regions
or are unmapped.

Note
- so far this was tested on small data only
- tests need to be added
- everything is stored in memory, with many regions an external
  temporary storage may be needed
- some unmapped reads may be missed, this depends on the success of
  samtools/htslib#1352 pull request
@daviesrob daviesrob self-assigned this Nov 9, 2021
@daviesrob daviesrob merged commit 1d79f44 into samtools:develop Nov 12, 2021
@daviesrob
Copy link
Member

I suspect that after applying this PR, bins_t::loff will always end up, via a very roundabout route, the same as bins_t::list[0].u. If that's true, we can both simplify the loff calculation, and also fix-up indexes made before this change landed by using bins_t::list[0].u (where available) in place of loff when getting the minimum offset for a bin.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants