Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

block compressed interval streams #7142

Merged
merged 6 commits into from
Jul 20, 2021
Merged

Conversation

tedsharpe
Copy link
Contributor

No description provided.

@tedsharpe tedsharpe requested review from mwalker174 and droazen March 15, 2021 19:23
Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tedsharpe this is looking great. Before I do a deep dive on this, I was having trouble getting some basic test cases to work with the existing tools. Can you take a shot at my comments below? I realize the Feature class design might make some of this complicated to implement. If headerless files aren't feasible we could look into forking gatk right now so we can use LocalizeSVEvidence in the current gatk-sv pipeline.

@@ -286,12 +290,13 @@ public FeatureDataSource(final FeatureInput<T> featureInput, final int queryLook
BucketUtils.getPrefetchingWrapper(cloudIndexPrefetchBuffer),
genomicsDBOptions);

if (IOUtils.isGenomicsDBPath(featureInput)) {
if (IOUtils.isGenomicsDBPath(featureInput) ||
featureInput.getFeaturePath().endsWith(BCI_FILE_EXTENSION)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make all extension checks case-insensitive (also in the *BCICodec classes)

Comment on lines 177 to 184
private static BufferedWriter createOutputFile( final String fileName ) {
final GATKPath path = new GATKPath(fileName);
if ( fileName.endsWith(".gz") ) {
return new BufferedWriter(
new OutputStreamWriter(
new BlockCompressedOutputStream(path.getOutputStream(), path.toPath(), 6)));
}
return new BufferedWriter(new OutputStreamWriter(path.getOutputStream()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's go ahead and add bci output functionality here

}

@VisibleForTesting
public final static class LocusDepth implements Feature {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a public external class? That way other tools can use it, just like with SplitReadEvidence, DiscordantPairEvidence, etc.

Comment on lines 647 to 648
private final int refIdx;
private final int altIdx;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add comment that these are the ordinal values from Nucleotide

}

public String toString() {
return getContig() + "\t" + getStart() + "\t" + "ACGT".charAt(refIdx) + "\t" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe change "ACGT".charAt(refIdx) -> Nucleotide.values()[refIdx]? In the unlikely event that the ordering of Nucleotide changes.


public String toString() {
return getContig() + "\t" + getStart() + "\t" + "ACGT".charAt(refIdx) + "\t" +
"ACGT".charAt(altIdx) + "\t" + totalDepth + "\t" + altDepth;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here

}

private void writeDictionary( final SAMSequenceDictionary dictionary ) throws IOException {
dos.writeInt(dictionary.size());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When converting an evidence file from text to bci format using PrintSVEvidence, dictionary is null here, and I am also getting null errors for the sample metadata. Can you add an optional header to the text formats?

For the current version of the pipeline, we still need to be able to consume a headerless text format with PrintSVEvidence (writing to headerless text). For the next version, we will require headers. In other words, we need to have the following conversion functionality:

  1. headered text -> headered text, bci
  2. unheadered text -> unheadered text
  3. bci -> headered text, bci

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's leave the text files alone, and not create versions with headers. To make PrintSVEvidence work with headerless text files, you simply supply a reference with -R.

oneLineSummary = "Prints allele counts.",
programGroup = StructuralVariantDiscoveryProgramGroup.class
)
public class PrintAlleleCounts extends FeatureWalker<LocusDepth> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't remember, but I may have asked for this and wasn't clear about what I meant. Instead of this tool, can you add functionality to PrintSVEvidence to convert between locus depth bci and text formats? (consistent with the other evidence types)

Comment on lines 232 to 240
final String strandA = r.isReadReverseStrand() ? "-" : "+";
final String strandB = r.isMateReverseStrand() ? "-" : "+";

final DiscordantPairEvidence discordantPair = new DiscordantPairEvidence(sampleName, r.getContig(), r.getStart(), strandA, r.getMateContig(), r.getMateStart(), strandB);
peWriter.add(discordantPair);
try {
// subtract 1 from positions to match pysam output
peWriter.write(r.getContig() + "\t" + (r.getStart() - 1) + "\t" + strandA + "\t" + r.getMateContig() + "\t" + (r.getMateStart() - 1) + "\t" + strandB + "\t" + sampleName + "\n");
} catch (IOException e) {
throw new GATKException("Could not write to PE file", e);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like recently changed code from commit b68eb8778d65f84c46f2e1fe44683e861bfd67d2 on master (and elsewhere in this file). Can you check for accidental reversions?

@tedsharpe tedsharpe force-pushed the tws_BlockCompressedIntervalStreams branch from 304d73f to c2ef909 Compare April 29, 2021 18:59
@lbergelson
Copy link
Member

could look into forking gatk

This seems problematic. How can we avoid this?

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe This is cool and useful. Self indexing files are great. I think there might be a few issues with the changes to FeatureWalker but I don't remember the details of why it was set up the way it was so I'm summoning @droazen who probably does. I also despair at the introduction of yet another copy of IntervalTree (and the addition of yet another interval class). Can we avoid this? Should we refactor the IntervalTree class in htsjdk to allow it to be parameterized with a comparator? It would be useful if the new classes had some javadoc explaining how they were different from the existing similarly named classes.

import java.io.IOException;
import java.util.function.Supplier;

public class CollatingInterval implements Feature, Comparable<CollatingInterval> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These classes need javadoc explaining what these are and how they're different from existing classes of similar name/function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Am I correct in my understanding that this is just an interval that uses the contig order in the sequence dictionary to sort? We do already have IntervalUtils.getDictionaryOrderComparator() to do this sort of thing. Do we really need yet another interval class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've backed all but a couple of trivial changes out of FeatureWalker (as you'll see at the next check-in).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to introduce another copy of IntervalTree. But this is the good one. Let's get rid of the others. ;-)

drivingFeatures = new FeatureDataSource<>(new FeatureInput<>(drivingPath), FeatureDataSource.DEFAULT_QUERY_LOOKAHEAD_BASES, null, cloudPrefetchBuffer, cloudIndexPrefetchBuffer, referenceArguments.getReferencePath());

final FeatureInput<F> drivingFeaturesInput = new FeatureInput<>(drivingPath, "drivingFeatureFile");
drivingFeaturesInput = new FeatureInput<>(drivingPath, "drivingFeatureFile");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this no longer respect cloud settings and pass through a reference?

progressMeter.update(feature);
});
final Iterator<F> featureItr =
features.getFeatureIterator(drivingFeaturesInput, userIntervals);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this change breaks some existing features like setting the appropriate caching behavior and cloud lookahead settings.

I believe there was an explicit reason to use a separate FeatureDataSource for the driving features instead of the one in feature manager but I don't remember what the issue is exactly. @droazen can you comment?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lbergelson I've added a comment in a separate review below.

import java.util.NoSuchElementException;
import java.util.Objects;

/**
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's depressing that we need a 3rd copy of this code. Can we abstract it slightly so we can have 1 instead of 3?

IntervalTree<Long> index;
boolean usedByIterator;

public Reader( final FeatureInput<T> inputDescriptor, final FeatureCodec<T, Reader<T>> codec ) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice if these could all route through 1 copy of the setup code instead of 3 slight variants but maybe that's not feasible.

this.codec = codec;
final SeekableStream ss;
try {
ss = SeekableStreamFactory.getInstance().getStreamFor(path);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad HtsPath -> string -> SeekableStream seems very roundabout. Do we have a preferred way of doing this? I didn't see one other than HtsStream -> Path -> SeekablePathStream.

@tedsharpe
Copy link
Contributor Author

@lbergelson I think the idea of the BlockCompressedIntervalStream is a sound one, and will be quite useful. The code around it -- trying to integrate it with the Feature system -- is a stinky mess, and I'm trying to clean it up.
I'm close to having a major update. Let's think about it more at that point.

@lbergelson
Copy link
Member

@tedsharpe Yes, the Feature / codec system is indeed a stinky mess.

@gatk-bot
Copy link

gatk-bot commented May 13, 2021

Travis reported job failures from build 34176
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 34176.12 logs
integration openjdk8 34176.2 logs

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe I've commented on the FeatureWalker changes, as requested. I think we'll need to adjust them slightly, but I'm sure we can work something out that won't be cumbersome. I just need some more info from you before I can give you my suggestion.

@@ -20,7 +21,7 @@
*/
public abstract class FeatureWalker<F extends Feature> extends WalkerBase {

private FeatureDataSource<F> drivingFeatures;
private FeatureInput<F> drivingFeaturesInput;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe @lbergelson The "historical reason" why we create a special separate additional FeatureDataSource for the driving input is twofold:

  • There are some tools that need to view the records from the primary input in two ways simultaneously: record-by-record as part of the main traversal, and via windowed queries around the current locus against the FeatureContext. This is why we create both a standalone drivingFeatures data source, an also add an additional FeatureInput for the driving features to features (so it will also show up in the FeatureContext)

  • To get reasonable performance in the above use case, we need different caching settings for the drivingFeatures for the main traversal vs. the FeatureInput added to the feature sources.

Now, that's the historical reason. The truth is that this only affects certain VariantWalkers (and if you read the source for VariantWalker, you'll find some comments there on this issue). But FeatureWalker was intended to be a generalization of VariantWalker, so it would be good if we could keep them consistent if possible.

@tedsharpe Could you briefly explain why you needed to make these changes to FeatureWalker? Once I understand your goals here better, I can suggest a way forward. I'm sure we can easily work something out that would require minimal changes on your end!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen I've been able to back out almost all my changes to FeatureWalker. Once I saw that the Tabix reader implemented the famous fake-iterable pattern, I understood what you documented above.
I'm going to rip through the engine classes that I've changed and see what else I can revert.

@tedsharpe
Copy link
Contributor Author

@mwalker174 I think I've addressed almost all of your comments. I have one little piece of work to go, as I mentioned at standup.

@tedsharpe
Copy link
Contributor Author

@lbergelson I'm going to try to convince you that implementing Intervals (and therefore Features) by holding a SAMSequenceRecord rather than a String contig name is a really good idea. Comparability. Speed. Memory use. More to follow...

@lbergelson
Copy link
Member

@tedsharpe You make good points in theory, but in practice we intern the chromosome names so we only ever end up holding 1 copy of each string and comparison mostly short circuits to object comparison. Previous iterations held a contig index instead, which was a horribly bug ridden mess. Your solution seems fine but I'm not sure you're actually going to see much in the way of benefits practically. It potentially introduces some new potential issues when dealing with intervals that come from different sources with slightly different references that still represent the same data.

@lbergelson
Copy link
Member

If you show me that you're code actually wastes a lot of time and effort on contig name comparison I'll be happy to eat my words...

@gatk-bot
Copy link

gatk-bot commented May 18, 2021

Travis reported job failures from build 34203
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 34203.12 logs
integration openjdk8 34203.2 logs

@tedsharpe
Copy link
Contributor Author

CollatingInterval does the equivalent of intern() without the overhead: It always returns the dictionary's copy of the contig name. (And it does this perforce. I think we're actually pretty inconsistent about interning.)

I've tested sorting (which I would expect to be dominated by the cost of interval comparison) with 1) CollatingIntervals, and then, using the DictionaryOrderComparator, 2) SimpleIntervals with effectively interned contig Strings, and 3) SimpleIntervals with novel Strings. CollatingInterval is (disappointingly to me) only twice as fast with interned Strings, and three times as fast with novel Strings on my pathetically ancient Ubuntu machine. So is 2x and 3x "a lot of time and effort"?

Seems to me like the potential issues with slightly different references are already a can of worms, and I don't see that this creates new worms.

You know what? It really isn't worth the struggle. It seemed kind of cool and innovative to me, but nobody but me seems to find it compelling. I'll get rid of it all.

@lbergelson
Copy link
Member

@tedsharpe It's not a bad idea. I'm just very resistant to change... 2x speedup is significant.

It sort of just has different worms than the existing solution does which might catch people out who aren't expecting them.

@tedsharpe
Copy link
Contributor Author

I'm supposed to be providing block-compressed streams for interval-oriented data. This is a side-trip. I'll rip it back to the bare minimum and we can have another look.
Thanks very much for your time in reviewing.

@tedsharpe
Copy link
Contributor Author

@lbergelson @mwalker174 I think this is ready for a re-review, if you could, please.

@gatk-bot
Copy link

gatk-bot commented May 24, 2021

Travis reported job failures from build 34279
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 34279.12 logs
integration openjdk8 34279.2 logs

@gatk-bot
Copy link

gatk-bot commented May 25, 2021

Travis reported job failures from build 34296
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 34296.12 logs
integration openjdk8 34296.2 logs

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @tedsharpe this is looking great. I have comments mainly pertaining to documentation and usability. Also what would you think about moving some of the SV utilities (SVIntervalTree, SVInterval) to the engine/htsjdk? I assume they perform better than using OverlapDetector/IntervalTree and SimpleInterval.

@@ -69,6 +73,10 @@
public static final String PAIRED_END_FILE_ARGUMENT_LONG_NAME = "pe-file";
public static final String SPLIT_READ_FILE_ARGUMENT_SHORT_NAME = "SR";
public static final String SPLIT_READ_FILE_ARGUMENT_LONG_NAME = "sr-file";
public static final String ALLELE_COUNT_OUTPUT_ARGUMENT_SHORT_NAME = "AC";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to add tool documentation for this above

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to rename the tool since it does AC as well now - I would suggest CollectSVEvidence. @cwhelan any opinions on that?

final List<String> sampleNames = Collections.singletonList(sampleName);
peWriter = createPEWriter();
srWriter = createSRWriter();
if ( alleleCountInputFilename != null && alleleCountOutputFilename != null ) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should throw an exception if one is specified but not the other.

private FeatureOutputStream<DepthEvidence> rdStream;
private FeatureCodec<? extends Feature, ?> featureCodec;
private Class<? extends Feature> evidenceClass;
@Argument(doc = "List of sample names", fullName = "sample-names", optional = true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you make this into a sample list file? Also need documentation for when this is required (seems to be when ingesting a headerless feature file)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the argument is of type "List", Barclay will automatically read a file named "*.list" to populate it. Now documented that it's necessary when input is tab-delimited text, except for read depth.

@@ -59,7 +58,7 @@
)
@ExperimentalFeature
@DocumentedFeature
public final class PrintSVEvidence extends FeatureWalker<Feature> {
public final class PrintSVEvidence <F extends Feature> extends FeatureWalker<F> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will need documentation about what inputs are required when the input feature files don't have headers

throw new UserException("vcf contains a SNP with a non-standard reference base " +
refCall + " at locus " + snp.getContig() + ":" + snp.getStart());
}
final byte[] altSeq = snp.getAlternateAllele(0).getBases();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We only want to collect bi-allelic sites for gatk-sv, so at a minimum multi-allelics should be skipped. It would be great, if it's not too much trouble, to have the option to collect multi-alelics on the allele with the highest count (as with the CN tools) to allow for some experimentation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Omitting multi-allelics for now.
Would you like me to change the output format to always show counts for A, C, G, and T? I.e., change columns from
contig, position, refAllele, altAllele, totalCount, altCount
to
contig, position, refAllele, aCount, cCount, gCount, tCount

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually that might be a very good solution. Any file size or performance hits for that? Let's discuss at standup.

Comment on lines +21 to +27
if ( !versionChecked ) {
if ( !DepthEvidence.BCI_VERSION.equals(reader.getVersion()) ) {
throw new UserException("baf.bci file has wrong version: expected " +
DepthEvidence.BCI_VERSION + " but found " + reader.getVersion());
}
versionChecked = true;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will have tools that consume multiple evidence files, and it looks like this would only check the first one. It may be better to leave it up to the implementer to check versions. Maybe add a checkVersion(String version) function to the FeatureOutputCodec interface?

Copy link
Contributor Author

@tedsharpe tedsharpe Jul 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dealing with codecs is all completely hidden from the user in the engine classes. However, the engine classes do the right thing and create a new codec instance for each FeatureReader. So this will be fine: Each bci file read will have its version checked just once.

Comment on lines +17 to +21
this.className = className;
this.version = version;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should check that these aren't null

final DataInputStream dis;
final FeaturesHeader header;
final long dataFilePointer;
SVIntervalTree<Long> index;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be ideal to move this and SVInterval out of the SV packages and into htsjdk or the engine utils.

Comment on lines +94 to +95
return contig + "\t" + position + "\t" + (char)refCall + "\t" +
depths[0] + "\t" + depths[1] + "\t" + depths[2] + "\t" + depths[3];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to update the tool documentation with the new columns

Comment on lines 149 to 203
private void dumpHeader( final FeaturesHeader header ) {
System.out.println("Dictionary:");
for ( final SAMSequenceRecord rec : header.getDictionary().getSequences() ) {
System.out.println(rec.getSAMString());
}
System.out.println();
System.out.println("Samples:");
for ( final String sampleName : header.getSampleNames() ) {
System.out.println(sampleName);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should have been more specific about this in my review. Can you make --sequence-dict-dump and --sample-list-dump arguments (or the like) that specify file to write the header info to? This is just to be a little more machine-friendly.

Copy link
Contributor

@mwalker174 mwalker174 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Looks good to me!

@tedsharpe tedsharpe force-pushed the tws_BlockCompressedIntervalStreams branch from 6782b95 to acf83ba Compare July 13, 2021 15:06
@tedsharpe
Copy link
Contributor Author

@droazen I think this has addressed your concerns and is ready to merge. Could you take a quick look?

@droazen
Copy link
Contributor

droazen commented Jul 13, 2021

@tedsharpe Great, will take a last look now!

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe Re-reviewed the changes to the classes in the engine package. Looks fine overall, but I did have one or two minor comments that will be easy to resolve (and I provided a suggested patch for one of them). Once these are addressed, I think this can be merged if the other reviewers have also approved.

if ( intervals != null ) {
dataSource.setIntervalsForTraversal(intervals);
}
return dataSource.iterator();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As currently written, this method has the side effect of permanently setting the traversal intervals on the underlying data source, which will affect future calls to getFeatureIterator() as well as other methods. I recommend that you get rid of this side effect by clearing the traversal intervals on the data source after you obtain the iterator. Eg.,

if ( intervals != null ) {
    dataSource.setIntervalsForTraversal(intervals);
}
final Iterator<T> iter = dataSource.iterator();
dataSource.setIntervalsForTraversal(null);
return iter;

Would be less clunky if there were an iterator(List<SimpleInterval>) method we could use here, but there isn't, unfortunately.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opted to always setIntervalsForTraversal before getting the iterator, and always set them to null afterwards. Fewer assumptions about the state of the source. OK?

final FeatureInput<F> featureInput = new FeatureInput<>(drivingPath, "drivingFeatureFile");
featureInput.setFeatureCodecClass((Class<FeatureCodec<F, ?>>)codec.getClass());
features.addToFeatureSources(featureInput,
new FeatureDataSource<>(featureInput, FeatureDataSource.DEFAULT_QUERY_LOOKAHEAD_BASES, null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous implementation passed in 0 here for queryLookaheadBases rather than FeatureDataSource.DEFAULT_QUERY_LOOKAHEAD_BASES, and used FeatureDataSource.DEFAULT_QUERY_LOOKAHEAD_BASES only for drivingFeatures itself (above). This was because we know that drivingFeatures is going to be used for a full linear single-pass traversal, so the caching makes sense, whereas the secondary "drivingFeatureFile" FeatureInput might be used for things like windowed queries around individual loci that would interact poorly with the very simplistic forward-looking caching scheme.

Now, I don't object to turning the caching on in both places if you've done some benchmarking and concluded that doing so will result in better performance for typical FeatureWalkers -- is that in fact the case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nah. Good catch. The change was inadvertent.

import java.nio.file.Path;
import java.util.*;

public class BlockCompressedIntervalStream {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't BlockCompressedIntervalStream be covered by direct unit tests?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I thought the indirect tests covered everything, but they didn't. Added test for query feature. (As well as reduplicated the round-trip test.) Thanks for catching this, too.

@tedsharpe
Copy link
Contributor Author

I think maybe we're ready to go this time?

@gatk-bot
Copy link

gatk-bot commented Jul 14, 2021

Travis reported job failures from build 34955
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 34955.12 logs
integration openjdk8 34955.2 logs
integration openjdk11 34955.12 logs

@droazen
Copy link
Contributor

droazen commented Jul 15, 2021

@tedsharpe Looks like you have some test failures in FuncotateSegmentsIntegrationTest (see latest gatk-bot comment above for the test logs). Could you resolve these before I re-review?

@gatk-bot
Copy link

gatk-bot commented Jul 16, 2021

Travis reported job failures from build 34987
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 34987.1 logs
cloud openjdk11 34987.14 logs

@droazen
Copy link
Contributor

droazen commented Jul 16, 2021

@tedsharpe Some additional failures in VariantWalkerGCSSupportIntegrationTest -- are these related to this branch?

@tedsharpe
Copy link
Contributor Author

Tests all passing, now.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tedsharpe Two quick remaining TODOs for you, then we can hit merge

final FeatureInput<F> featureInput = new FeatureInput<>(drivingPath, "drivingFeatureFile");
featureInput.setFeatureCodecClass((Class<FeatureCodec<F, ?>>)codec.getClass());
features.addToFeatureSources(featureInput,
new FeatureDataSource<>(featureInput, 0, null,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The third argument to this constructor should be codec.getFeatureType(), not null -- unless you changed it to null deliberately?

import java.util.ArrayList;
import java.util.Collections;

public class BlockCompressIntervalStreamUnitTests {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unit test classes should extend GATKBaseTest, and the class name should end in *UnitTest, not *UnitTests -- I believe this matters when we parallelize the test suite by test type in Travis.

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@droazen droazen merged commit c042e8f into master Jul 20, 2021
@droazen droazen deleted the tws_BlockCompressedIntervalStreams branch July 20, 2021 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants