-
Notifications
You must be signed in to change notification settings - Fork 243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix CRAMReferenceRegion updating. #1708
Changes from 6 commits
2d5b20f
969bdf1
4f465ab
f276317
64180f8
19be0c7
669fd7e
ca12df0
7c473de
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -113,9 +113,14 @@ public void fetchReferenceBases(final int referenceIndex) { | |
|
||
// Re-resolve the reference bases if we don't have a current region or if the region we have | ||
// doesn't span the *entire* contig requested. | ||
final SAMSequenceRecord newSequenceRecord = sequenceDictionary.getSequence(referenceIndex); | ||
if (newSequenceRecord == null) { | ||
throw new IllegalArgumentException( | ||
String.format("Requested reference sequence index %d not found", referenceIndex)); | ||
} | ||
if ((referenceIndex != this.referenceIndex) || | ||
regionStart != 0 || | ||
(regionLength < referenceBases.length)) { | ||
(regionLength != newSequenceRecord.getSequenceLength())) { | ||
setCurrentSequence(referenceIndex); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a comment that the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Latter is done. |
||
referenceBases = referenceSource.getReferenceBases(sequenceRecord, true); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There is still a potential order-of-operations issue here: since the call to |
||
if (referenceBases == null) { | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,32 +21,72 @@ public class CRAMAllEncodingStrategiesTest extends HtsjdkTest { | |
private static final File TEST_DATA_DIR = new File("src/test/resources/htsjdk/samtools/cram"); | ||
private final CompressorCache compressorCache = new CompressorCache(); | ||
|
||
@DataProvider(name="roundTripTestFiles") | ||
public Object[][] roundTripTestFiles() { | ||
@DataProvider(name="defaultStrategyRoundTripTestFiles") | ||
public Object[][] defaultStrategyRoundTripTestFiles() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you explain why some of these test cases have to use lenient equality, while others don't? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. CRAMs don't always roundtrip with complete fidelity, especially when roundtripping through samtools, which resurrects NM/MD tags whether they were present in the original or not, unless they are specifically excluded. So in some cases when roundtripping, you can't use full SAMRecord comparisons. In those cases we use lenient equality to test only a subset of the SAMRecord. I've added this as a comment. |
||
return new Object[][] { | ||
{ new File(TEST_DATA_DIR, "NA12878.20.21.1-100.100-SeqsPerSlice.500-unMapped.cram"), | ||
new File(TEST_DATA_DIR, "human_g1k_v37.20.21.1-100.fasta") }, | ||
new File(TEST_DATA_DIR, "human_g1k_v37.20.21.1-100.fasta"), | ||
false, false }, | ||
{ new File(TEST_DATA_DIR, "CEUTrio.HiSeq.WGS.b37.NA12878.20.21.10m-10m100.cram"), | ||
new File("src/test/resources/htsjdk/samtools/reference/human_g1k_v37.20.21.fasta.gz"), | ||
false, false }, | ||
{ new File(TEST_DATA_DIR, "NA12878.unmapped.cram"), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you add one-line comments explaining the provenance of these three crams, and what they are testing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done with as much specificity as I have about these, since they've been in the repo for a while. |
||
new File(TEST_DATA_DIR, "human_g1k_v37.20.21.1-100.fasta"), | ||
false, false }, | ||
// generated with samtools 1.19 from the gatk bam file CEUTrio.HiSeq.WGS.b37.NA12878.20.21.bam | ||
{ new File(TEST_DATA_DIR, "CEUTrio.HiSeq.WGS.b37.NA12878.20.21.v3.0.samtools.cram"), | ||
new File("src/test/resources/htsjdk/samtools/reference/human_g1k_v37.20.21.fasta.gz"), | ||
true, false }, | ||
// these tests use lenient equality to only validate read names, bases and qual scores | ||
{ new File(TEST_DATA_DIR, "mitoAlignmentStartTestGATKGen.cram"), | ||
new File(TEST_DATA_DIR, "mitoAlignmentStartTest.fa"), true, false }, | ||
{ new File(TEST_DATA_DIR, "mitoAlignmentStartTest.cram"), | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add comments explaining what these files are as well There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done with as much specificity as I have about these, since they've been in the repo for a while. |
||
new File(TEST_DATA_DIR, "mitoAlignmentStartTest.fa"), true, false }, | ||
// files created by rewriting the htsjdk test file src/test/resources/htsjdk/samtools/cram/mitoAlignmentStartTest.cram | ||
// using code that replicates the first read (which is aligned to position 1 of the mito contig) either | ||
// 10,000 or 20,000 times, to create a file with 2 or 3 containers, respectively, that have reads aligned to | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Did you confirm via direct inspection of the file that it did in fact create 2 or 3 containers with reads mapped to position 1? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes. You can also see this in the GATK PR, which uses files that were created using the same code that created these files, to test the detector on files with multiple bad containers. Alternatively, you can run GATK |
||
// position 1 of the contig | ||
{ new File(TEST_DATA_DIR, "mitoAlignmentStartTest_2_containers_aligned_to_pos_1.cram"), | ||
new File(TEST_DATA_DIR, "mitoAlignmentStartTest.fa"), true, false }, | ||
{ new File(TEST_DATA_DIR, "mitoAlignmentStartTest_3_containers_aligned_to_pos_1.cram"), | ||
new File(TEST_DATA_DIR, "mitoAlignmentStartTest.fa"), true, false } | ||
}; | ||
} | ||
|
||
@Test(dataProvider = "roundTripTestFiles") | ||
public final void testRoundTripDefaultEncodingStrategy(final File sourceFile, final File referenceFile) throws IOException { | ||
@Test(dataProvider = "defaultStrategyRoundTripTestFiles") | ||
public final void testRoundTripDefaultEncodingStrategy( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a comment before this test method describing the purpose of the test (since we're not just testing "encoding strategies" here....) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is testing the default encoding strategy. Comment added. |
||
final File sourceFile, | ||
final File referenceFile, | ||
final boolean lenientEquality, | ||
final boolean emitDetail) throws IOException { | ||
final CRAMEncodingStrategy testStrategy = new CRAMEncodingStrategy(); | ||
final File tempOutCRAM = File.createTempFile("testRoundTrip", ".cram"); | ||
tempOutCRAM.deleteOnExit(); | ||
CRAMTestUtils.writeToCRAMWithEncodingStrategy(testStrategy, sourceFile, tempOutCRAM, referenceFile); | ||
assertRoundTripFidelity(sourceFile, tempOutCRAM, referenceFile, false); | ||
assertRoundtripFidelityWithSamtools(tempOutCRAM, referenceFile); | ||
assertRoundTripFidelity(sourceFile, tempOutCRAM, referenceFile, lenientEquality, emitDetail); | ||
assertRoundtripFidelityWithSamtools(tempOutCRAM, referenceFile, lenientEquality, emitDetail); | ||
} | ||
|
||
@DataProvider(name="encodingStrategiesTestFiles") | ||
public Object[][] encodingStrategiesTestFiles() { | ||
return new Object[][] { | ||
{ new File(TEST_DATA_DIR, "NA12878.20.21.1-100.100-SeqsPerSlice.500-unMapped.cram"), | ||
new File(TEST_DATA_DIR, "human_g1k_v37.20.21.1-100.fasta"), false, false }, | ||
}; | ||
} | ||
|
||
@Test(dataProvider = "roundTripTestFiles") | ||
public final void testAllEncodingStrategyCombinations(final File cramSourceFile, final File referenceFile) throws IOException { | ||
@Test(dataProvider = "encodingStrategiesTestFiles") | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would there be value in having this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Perhaps, but there are currently 81 encoding strategies, so using all of those files would make this test take a really long time, and I would guess, dominate the CI test time, especially now that we have the large CRAM test file in that list. So instead I separated them this way. |
||
public final void testAllEncodingStrategyCombinations( | ||
final File cramSourceFile, | ||
final File referenceFile, | ||
final boolean lenientEquality, | ||
final boolean emitDetail) throws IOException { | ||
for (final Tuple<String, CRAMEncodingStrategy> testStrategy : getAllEncodingStrategies()) { | ||
final File tempOutCRAM = File.createTempFile("allEncodingStrategyCombinations", ".cram"); | ||
tempOutCRAM.deleteOnExit(); | ||
CRAMTestUtils.writeToCRAMWithEncodingStrategy(testStrategy.b, cramSourceFile, tempOutCRAM, referenceFile); | ||
assertRoundTripFidelity(cramSourceFile, tempOutCRAM, referenceFile, false); | ||
assertRoundtripFidelityWithSamtools(tempOutCRAM, referenceFile); | ||
assertRoundTripFidelity(cramSourceFile, tempOutCRAM, referenceFile, lenientEquality, emitDetail); | ||
assertRoundtripFidelityWithSamtools(tempOutCRAM, referenceFile, lenientEquality, emitDetail); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a comment explaining why we also do a test with samtools There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Its basically testing interoperability with samtools for all of the different encodings, to make sure both tool sets agree on the data. Added a comment saying its for interop testing. |
||
} | ||
} | ||
|
||
|
@@ -82,6 +122,7 @@ public void assertRoundTripFidelity( | |
final File sourceFile, | ||
final File targetCRAMFile, | ||
final File referenceFile, | ||
final boolean lenientEquality, | ||
final boolean emitDetail) throws IOException { | ||
try (final SamReader sourceReader = SamReaderFactory.makeDefault() | ||
.referenceSequence(referenceFile) | ||
|
@@ -91,7 +132,13 @@ public void assertRoundTripFidelity( | |
final SAMRecordIterator sourceIterator = sourceReader.iterator(); | ||
final SAMRecordIterator targetIterator = copyReader.getIterator(); | ||
while (sourceIterator.hasNext() && targetIterator.hasNext()) { | ||
if (emitDetail) { | ||
if (lenientEquality) { | ||
final SAMRecord sourceRec = sourceIterator.next(); | ||
final SAMRecord targetRec = targetIterator.next(); | ||
Assert.assertEquals(targetRec.getReadName(), sourceRec.getReadName()); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can lenient mode also check the start/end positions of the read? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, added. |
||
Assert.assertEquals(targetRec.getReadBases(), sourceRec.getReadBases()); | ||
Assert.assertEquals(targetRec.getBaseQualities(), sourceRec.getBaseQualities()); | ||
} else if (emitDetail) { | ||
final SAMRecord sourceRec = sourceIterator.next(); | ||
final SAMRecord targetRec = targetIterator.next(); | ||
if (!sourceRec.equals(targetRec)) { | ||
|
@@ -108,13 +155,17 @@ public void assertRoundTripFidelity( | |
} | ||
} | ||
|
||
private void assertRoundtripFidelityWithSamtools(final File sourceCRAM, final File referenceFile) throws IOException { | ||
private void assertRoundtripFidelityWithSamtools( | ||
final File sourceCRAM, | ||
final File referenceFile, | ||
final boolean lenientEquality, | ||
final boolean emitDetail) throws IOException { | ||
if (SamtoolsTestUtils.isSamtoolsAvailable()) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If samtools is not available, the test probably shouldn't just silently do nothing and succeed. Maybe throw a TestNG There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Makes sense. Done. |
||
final File samtoolsOutFile = SamtoolsTestUtils.convertToCRAM( | ||
sourceCRAM, | ||
referenceFile, | ||
"--input-fmt-option decode_md=0 --output-fmt-option store_md=0 --output-fmt-option store_nm=0"); | ||
assertRoundTripFidelity(sourceCRAM, samtoolsOutFile, referenceFile, false); | ||
assertRoundTripFidelity(sourceCRAM, samtoolsOutFile, referenceFile, lenientEquality, emitDetail); | ||
} | ||
} | ||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -3,6 +3,7 @@ | |
import htsjdk.HtsjdkTest; | ||
import htsjdk.samtools.SAMSequenceRecord; | ||
import htsjdk.samtools.cram.build.CRAMReferenceRegion; | ||
import htsjdk.samtools.cram.structure.AlignmentContext; | ||
import htsjdk.samtools.cram.structure.CRAMStructureTestHelper; | ||
import htsjdk.samtools.reference.InMemoryReferenceSequenceFile; | ||
import org.testng.Assert; | ||
|
@@ -167,6 +168,39 @@ public void testGetReferenceBasesByRegionPositive( | |
Assert.assertEquals(bases, Arrays.copyOfRange(fullContigBases, requestedOffset, requestedOffset + requestedLength)); | ||
} | ||
|
||
// simulate the state transitions that occur when writing a CRAM file | ||
@Test | ||
public void testSerialStateTransitions() { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Add a comment here explaining and referencing the bug that prompted us to write this test. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
// start with an entire reference sequence | ||
final CRAMReferenceRegion cramReferenceRegion = getAlternatingReferenceRegion(); | ||
cramReferenceRegion.fetchReferenceBases(CRAMStructureTestHelper.REFERENCE_SEQUENCE_ZERO); | ||
final long fullRegionFragmentLength = cramReferenceRegion.getRegionLength(); | ||
Assert.assertEquals(fullRegionFragmentLength, CRAMStructureTestHelper.REFERENCE_CONTIG_LENGTH); | ||
|
||
// transition to a shorter reference fragment using fetchReferenceBasesByRegion, then back to the full region | ||
final int SHORT_FRAGMENT_LENGTH = 5; | ||
Assert.assertTrue(SHORT_FRAGMENT_LENGTH < fullRegionFragmentLength); | ||
cramReferenceRegion.fetchReferenceBasesByRegion(CRAMStructureTestHelper.REFERENCE_SEQUENCE_ZERO, 0, SHORT_FRAGMENT_LENGTH); | ||
Assert.assertEquals(cramReferenceRegion.getRegionLength(), SHORT_FRAGMENT_LENGTH); | ||
|
||
// now transition back to the full sequence | ||
cramReferenceRegion.fetchReferenceBases(CRAMStructureTestHelper.REFERENCE_SEQUENCE_ZERO); | ||
Assert.assertEquals(cramReferenceRegion.getRegionLength(), fullRegionFragmentLength); | ||
|
||
// transition to a shorter region fragment length using fetchReferenceBasesByRegion(AlignmentContext), then back to the full region | ||
Assert.assertTrue(SHORT_FRAGMENT_LENGTH < fullRegionFragmentLength); | ||
cramReferenceRegion.fetchReferenceBasesByRegion( | ||
new AlignmentContext( | ||
new ReferenceContext(CRAMStructureTestHelper.REFERENCE_SEQUENCE_ZERO), | ||
1, | ||
SHORT_FRAGMENT_LENGTH)); | ||
Assert.assertEquals(cramReferenceRegion.getRegionLength(), SHORT_FRAGMENT_LENGTH); | ||
|
||
// now transition back to the full sequence | ||
cramReferenceRegion.fetchReferenceBases(CRAMStructureTestHelper.REFERENCE_SEQUENCE_ZERO); | ||
Assert.assertEquals(cramReferenceRegion.getRegionLength(), fullRegionFragmentLength); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It is currently not used anywhere across threads. As you noted, it's not thread-safe. I guess I could add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Added a comment saying it's not thread-safe. |
||
|
||
@Test | ||
public void testGetReferenceBasesByRegionExceedsContigLength() { | ||
int regionStart = CRAMStructureTestHelper.REFERENCE_CONTIG_LENGTH / 2; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you call the existing
getSAMSequenceRecord()
method here? It does exactly the same thing.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yeah, done.