-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move/rename some stray Spark datasource classes, and delete BaseRecalibratorSparkSharded #5192
Conversation
…ibratorSparkSharded -Move the Spark reference datasource classes from the engine.datasources package into the engine.spark.datasources package, and rename them to make it clear that they are for use on Spark. This fixes a longstanding problem where they were getting confused with the walker ReferenceDataSource class. -Delete the unused/unmaintained experimental tool BaseRecalibratorSparkSharded, which has fallen out-of-date relative to BaseRecalibratorSpark. -Delete an extra "VariantSource" class that is now unused (note: this is not the same as VariantSparkSource, which is used extensively and retained here)
@jamesemery please review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest you move the non-spark datasources into org.broadinstitute.hellbender.engine
-> org.broadinstitute.hellbender.engine.datasources
for consistency sake.
Also AddContextDataToReadSpark
is orphaned
@@ -21,33 +20,33 @@ | |||
* | |||
* This class needs to subclassed by test code, so it cannot be declared final. | |||
*/ | |||
public class ReferenceMultiSource implements ReferenceSource, Serializable { | |||
public class ReferenceMultiSparkSource implements ReferenceSparkSource, Serializable { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like this name, this should really be ReferenceSparkSource
as the multi is confusing (are there multiple sparks?"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't like it either, but ReferenceSparkSource
is taken already as the interface name. I think it's better to have a consistent naming convention across these classes for now. We can improve the names if we ever merge the walker and Spark datasource classes.
|
||
Assert.assertTrue(CollectionUtils.isEqualCollection(rddParallelVariants.collect(), variantsList)); | ||
} | ||
|
||
/** | ||
* getVariantsListAs grabs the variants from local files (or perhaps eventually buckets), applies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need javadocs on a private method used by a handful of tests in this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, but it doesn't hurt.
@@ -10,15 +9,15 @@ | |||
/** | |||
* Internal interface to load a reference sequence. | |||
*/ | |||
public interface ReferenceSource { | |||
public interface ReferenceSparkSource { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SparkReferenceSource
or ReferenceDataSparkSource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These seem inconsistent with the convention already in use across the engine.spark.datasources
package. Again, I think it's better to stick to a consistent naming convention for now, since the goal of this PR is to disambiguate these classes vs. their walker counterparts.
* Move/rename ReferenceMultiSparkSourceUnitTest * Delete unused AddContextDataToReadSparkOptimized
Added a second commit to delete the now-unused |
* AddContextDataToReadSpark (and AddContextDataToReadSparkUnitTest) implemented the different JoinStrategy options for BQSR; has been replaced with the Spark Files mecahnism (see #5127) * BroadcastJoinReadsWithRefBases and JoinReadsWithRefBasesSparkUnitTest were only used by AddContextDataToReadSpark * BroadcastJoinReadsWithVariants and JoinReadsWithVariantsSparkUnitTest were only used by AddContextDataToReadSpark * ShuffleJoinReadsWithRefBases and ShuffleJoinReadsWithVariants were only used by AddContextDataToReadSpark * JoinStrategy was only used for BQSR (HC always uses overlaps partitioner), but is no longer used since #5127 * KnownSitesCache was replaced with Spark Files * ReferenceMultiSourceAdapter in HaplotypeCallerSpark was replaced with the regular ReferenceDataSource * BaseRecalibratorEngineSparkWrapper was only used by BaseRecalibratorSparkSharded, which was removed in #5192
* AddContextDataToReadSpark (and AddContextDataToReadSparkUnitTest) implemented the different JoinStrategy options for BQSR; has been replaced with the Spark Files mecahnism (see #5127) * BroadcastJoinReadsWithRefBases and JoinReadsWithRefBasesSparkUnitTest were only used by AddContextDataToReadSpark * BroadcastJoinReadsWithVariants and JoinReadsWithVariantsSparkUnitTest were only used by AddContextDataToReadSpark * ShuffleJoinReadsWithRefBases and ShuffleJoinReadsWithVariants were only used by AddContextDataToReadSpark * JoinStrategy was only used for BQSR (HC always uses overlaps partitioner), but is no longer used since #5127 * KnownSitesCache was replaced with Spark Files * ReferenceMultiSourceAdapter in HaplotypeCallerSpark was replaced with the regular ReferenceDataSource * BaseRecalibratorEngineSparkWrapper was only used by BaseRecalibratorSparkSharded, which was removed in #5192
* AddContextDataToReadSpark (and AddContextDataToReadSparkUnitTest) implemented the different JoinStrategy options for BQSR; has been replaced with the Spark Files mecahnism (see #5127) * BroadcastJoinReadsWithRefBases and JoinReadsWithRefBasesSparkUnitTest were only used by AddContextDataToReadSpark * BroadcastJoinReadsWithVariants and JoinReadsWithVariantsSparkUnitTest were only used by AddContextDataToReadSpark * ShuffleJoinReadsWithRefBases and ShuffleJoinReadsWithVariants were only used by AddContextDataToReadSpark * JoinStrategy was only used for BQSR (HC always uses overlaps partitioner), but is no longer used since #5127 * KnownSitesCache was replaced with Spark Files * ReferenceMultiSourceAdapter in HaplotypeCallerSpark was replaced with the regular ReferenceDataSource * BaseRecalibratorEngineSparkWrapper was only used by BaseRecalibratorSparkSharded, which was removed in #5192
-Move the Spark reference datasource classes from the engine.datasources package into the
engine.spark.datasources package, and rename them to make it clear that they are for use
on Spark. This fixes a longstanding problem where they were getting confused with the
walker ReferenceDataSource/ReferenceFileSource classes.
-Delete the unused/unmaintained experimental tool BaseRecalibratorSparkSharded, which has
fallen out-of-date relative to BaseRecalibratorSpark, as well as its unused companion
AddContextDataToReadsSparkOptimized
.-Delete an extra "VariantSource" class that is now unused (note: this is not the same as
VariantSparkSource, which is used extensively and retained here)