Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

queryMate method throws exception for CRAM records with the same read name #1065

Closed
kpepper opened this issue Jan 9, 2018 · 19 comments
Closed
Assignees

Comments

@kpepper
Copy link

kpepper commented Jan 9, 2018

Subject of the issue

Occasionally I am getting the following exception when calling the SamReader queryMate method for a CRAM file…..

htsjdk.samtools.SAMFormatException: Multiple SAMRecord with read name IL29_4505:7:93:5878:1091#2 for first end.
at htsjdk.samtools.SamReader$PrimitiveSamReaderToSamReaderAdapter.queryMate(SamReader.java:451)

I don’t see the same thing for the equivalent BAM file.

Your environment

  • version of htsjdk = 2.12.0
  • version of java = 1.8.151
  • which OS = Mac Sierra

Steps to reproduce

e.g. the CRAM file has the following reads, that seem to be causing the issue:

IL29_4505:7:93:5878:1091#2 163 Mito 267 60 108M = 631 399 TTAATTAATTAATAATAAAAATATAATTATAAATAATATAAATATTATTCTTTATTAATAAATATATATTTATATATTATAAAAGTATCTTAATTAATAAAAATAAAC BBBBBBB@BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB@BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB>BBBBBBBBBBB=BBBBBBBBB@?9A=A:/8 AS:i:108 XS:i:0 MD:Z:108 NM:i:0 RG:Z:lane1
IL29_4505:7:93:5878:1091#2 83 Mito 631 60 73S35M = 267 -399 AATGAAAATTTTATAAAAAAAGTAATGAATATTCCTTTTTAAAAATACCTTCTTAATTAAATAAAATAAAATAAAAAGGGGACTTTATATTTATAAAGTAATTATATT <=8=;BCBBABBBBBBBBBC?BDDDBBBBB0DDBBBBBCBBBBBBAAABB=CCCCCCBBBBBBBBBBBBBBBBBBDBBBDDCBBBBBBBBBBBBBDDDBDBBBBBDCD AS:i:35 XS:i:0 SA:Z:Mito,44163,+,37S39M32S,35,1; MD:Z:35 NM:i:0 RG:Z:lane1
IL29_4505:7:93:5878:1091#2 323 Mito 44163 35 37H39M32H = 267 -43897 TTTTATTTTATTTAATTAAGAAGGTATTTTTAAAAAGGA BBBBBBBBBBBBBBCCCCCC=BBAAABBBBBBCBBBBBD AS:i:34 XS:i:27 SA:Z:Mito,631,-,73S35M,60,0; MD:Z:30A8 NM:i:1 RG:Z:lane1

Similarly for the equivalent BAM file (which does not seem to cause an issue):

IL29_4505:7:93:5878:1091#2 163 Mito 267 60 108M = 631 399 TTAATTAATTAATAATAAAAATATAATTATAAATAATATAAATATTATTCTTTATTAATAAATATATATTTATATATTATAAAAGTATCTTAATTAATAAAAATAAAC BBBBBBB@BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB@BBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBBB>BBBBBBBBBBB=BBBBBBBBB@?9A=A:/8 NM:i:0 MD:Z:108 AS:i:108 XS:i:0 RG:Z:lane1
IL29_4505:7:93:5878:1091#2 83 Mito 631 60 73S35M = 267 -399 AATGAAAATTTTATAAAAAAAGTAATGAATATTCCTTTTTAAAAATACCTTCTTAATTAAATAAAATAAAATAAAAAGGGGACTTTATATTTATAAAGTAATTATATT <=8=;BCBBABBBBBBBBBC?BDDDBBBBB0DDBBBBBCBBBBBBAAABB=CCCCCCBBBBBBBBBBBBBBBBBBDBBBDDCBBBBBBBBBBBBBDDDBDBBBBBDCD NM:i:0 MD:Z:35 AS:i:35 XS:i:0 RG:Z:lane1 SA:Z:Mito,44163,+,37S39M32S,35,1;
IL29_4505:7:93:5878:1091#2 323 Mito 44163 35 37H39M32H = 267 -43897 TTTTATTTTATTTAATTAAGAAGGTATTTTTAAAAAGGA BBBBBBBBBBBBBBCCCCCC=BBAAABBBBBBCBBBBBD NM:i:1 MD:Z:30A8 AS:i:34 XS:i:27 RG:Z:lane1 SA:Z:Mito,631,-,73S35M,60,0;

Each read has the same name but a different flag value (one is a non-primary alignment).

The exception originates from this piece of code in the queryMate() method……

try {

            SAMRecord mateRec = null;

            while (it.hasNext()) {

                final SAMRecord next = it.next();

                if (!next.getReadPairedFlag()) {

                    if (rec.getReadName().equals(next.getReadName())) {

                        throw new SAMFormatException("Paired and unpaired reads with same name: " + rec.getReadName());

                    }

                    continue;

                }

                if (firstOfPair) {

                    if (next.getFirstOfPairFlag()) continue;

                } else {

                    if (next.getSecondOfPairFlag()) continue;

                }

                if (rec.getReadName().equals(next.getReadName())) {

                    if (mateRec != null) {

                        throw new SAMFormatException("Multiple SAMRecord with read name " + rec.getReadName() +

                                " for " + (firstOfPair ? "second" : "first") + " end.");

                    }

                    mateRec = next;

                }

            }

            return mateRec;

        } finally {

            it.close();

        }

Anybody able to explain the difference in behaviour between the BAM and CRAM?
Is this a bug or expected in some way?

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jan 9, 2018

@kpepper Do know how the bam was converted to cram (was the conversion done by htsjdk ?). This seems somewhat similar to #802, but not quite the same since there are no supplementary reads here, only secondary.

@kpepper
Copy link
Author

kpepper commented Jan 10, 2018

Yes, it was created with samtools 1.6 command line tool (i.e. the C implementation)....

samtools view -T reference.fa -C -o out.cram in.sorted.bam

@kpepper
Copy link
Author

kpepper commented Jan 11, 2018

Hi Chris, when do you think this issue and #802 may get addressed?
Thanks.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jan 11, 2018

@kpepper It probably will depend a lot on what the problem turns out to be. One thing that would definitely speed things up is to have (preferably minimally small) test files that reproduce the problem (bam, cram and .fasta).

@kpepper
Copy link
Author

kpepper commented Jan 12, 2018

He's some test files.
Running queryMate() on [e.g.] reads IL29_4505:7:2:12262:7582#2 and IL29_4505:7:52:15583:12027#2 will generate a "htsjdk.samtools.SAMFormatException: Multiple SAMRecord with read name .... for second end." exception. Reads IL29_4505:7:33:16799:10386#2 and IL29_4505:7:30:11521:4492#2 will cause a "htsjdk.samtools.SAMFormatException: Multiple SAMRecord with read name .... for first end." exception. So, there seems to be two different exceptions thrown.

I'm now using this via the latest Picard jar, which was recently built with the up to date htsjdk. It's for Artemis development.

htsjdk_test.zip

@cmnbroad
Copy link
Collaborator

@kpepper Thanks for the files, that will help a lot. I'll take a look.

@kpepper
Copy link
Author

kpepper commented Jan 19, 2018

@cmnbroad Thanks.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jan 24, 2018

@kpepper Update: I'm glad you reported this issue - it looks like its due to a bug in the CRAM implementation of queryAlignmentStart, which is used by queryMate. Calling queryAlignmentStart("Mito", 631), for example, returns 24 reads on the BAM, but ~22788 reads on the CRAM, which I think represents the entire rest of the file. I don't yet have a fix, but wanted to update the ticket.

One other random observation. ValidateSamFile reports a bunch of mate-pair errors for the BAM, but not the CRAM (though it does report some other issues with NM). If I use htsjdk to convert the BAM to CRAM, ValidateSamFile reports exactly the same mate pair issues on the CRAM as it does on the original BAM.

One question - I'd like to be able to use this BAM as a test file in htsjdk once I have a fix, but it would live in the public repository. Let me know if that would be acceptable.

@cmnbroad cmnbroad self-assigned this Jan 24, 2018
@kpepper
Copy link
Author

kpepper commented Jan 26, 2018

@cmnbroad Thanks for the update. That explains the apparent BAM/CRAM speed difference then. I'll get back to you about the test data.

@kpepper
Copy link
Author

kpepper commented Jan 31, 2018

@cmnbroad Yes, it's okay to use that file for test data on GitHub.

@kpepper
Copy link
Author

kpepper commented Mar 5, 2018

Hi @cmnbroad, just wondered how the fix is going for this issue(s)?
Thanks.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Mar 5, 2018

Hi @kpepper. I haven' been able to spend any time on this recently but haven't forgotten about it. I will try to get to it in the next week or so.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Apr 2, 2018

@kpepper Update: Sorry for the long delay. I've finally gotten around to this and understand the underlying issue and have a proposed fix. I still need to add some new tests since it required a little bit of refactoring.

@kpepper
Copy link
Author

kpepper commented Apr 3, 2018

@cmnbroad no problem, look forward to trying the fix out when ready.

@kpepper
Copy link
Author

kpepper commented Jun 22, 2018

Has this been fixed now or still in progress?

@cmnbroad
Copy link
Collaborator

@kpepper Still in progress - I expect to submit a pull request with a fix within the next week.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jul 12, 2018

@kpepper I submitted a PR with a fix for the CRAM implementation of queryAlignmentStart/queryMate. Sorry for the long delay. I'm not sure how long it will take to get this through code review. In the meantime, if you're so inclined, feel free to pull the branch and try it out. The behavior wrt/validations being triggered via queryMate should be the same now between BAM and CRAM.

@kpepper
Copy link
Author

kpepper commented Jul 18, 2018

@cmnbroad - Thanks. I use htsjdk via Picard so will probably need to wait for the fix to be pulled in.

@cmnbroad
Copy link
Collaborator

cmnbroad commented Jun 25, 2019

@kpepper you may have lost interest by now (sorry for the insanely long turnaround time on this), but FYI #1164 is merged now, which fixes the queryMate/queryAlignmentStart issue that, IIRC, was the underlying problem that resulted in the original issue reported here. I know you need it in Picard but hopefully this will make its way there in the near term. Also thanks for the test data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants