Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move/rename some stray Spark datasource classes, and delete BaseRecalibratorSparkSharded #5192

Merged
merged 2 commits into from
Sep 14, 2018

Conversation

droazen
Copy link
Contributor

@droazen droazen commented Sep 14, 2018

-Move the Spark reference datasource classes from the engine.datasources package into the
engine.spark.datasources package, and rename them to make it clear that they are for use
on Spark. This fixes a longstanding problem where they were getting confused with the
walker ReferenceDataSource/ReferenceFileSource classes.

-Delete the unused/unmaintained experimental tool BaseRecalibratorSparkSharded, which has
fallen out-of-date relative to BaseRecalibratorSpark, as well as its unused companion AddContextDataToReadsSparkOptimized.

-Delete an extra "VariantSource" class that is now unused (note: this is not the same as
VariantSparkSource, which is used extensively and retained here)

…ibratorSparkSharded

-Move the Spark reference datasource classes from the engine.datasources package into the
 engine.spark.datasources package, and rename them to make it clear that they are for use
 on Spark. This fixes a longstanding problem where they were getting confused with the
 walker ReferenceDataSource class.

-Delete the unused/unmaintained experimental tool BaseRecalibratorSparkSharded, which has
 fallen out-of-date relative to BaseRecalibratorSpark.

-Delete an extra "VariantSource" class that is now unused (note: this is not the same as
 VariantSparkSource, which is used extensively and retained here)
@droazen
Copy link
Contributor Author

droazen commented Sep 14, 2018

@jamesemery please review

Copy link
Collaborator

@jamesemery jamesemery left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest you move the non-spark datasources into org.broadinstitute.hellbender.engine -> org.broadinstitute.hellbender.engine.datasources for consistency sake.

Also AddContextDataToReadSpark is orphaned

@@ -21,33 +20,33 @@
*
* This class needs to subclassed by test code, so it cannot be declared final.
*/
public class ReferenceMultiSource implements ReferenceSource, Serializable {
public class ReferenceMultiSparkSource implements ReferenceSparkSource, Serializable {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like this name, this should really be ReferenceSparkSource as the multi is confusing (are there multiple sparks?"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like it either, but ReferenceSparkSource is taken already as the interface name. I think it's better to have a consistent naming convention across these classes for now. We can improve the names if we ever merge the walker and Spark datasource classes.


Assert.assertTrue(CollectionUtils.isEqualCollection(rddParallelVariants.collect(), variantsList));
}

/**
* getVariantsListAs grabs the variants from local files (or perhaps eventually buckets), applies
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need javadocs on a private method used by a handful of tests in this class?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really, but it doesn't hurt.

@@ -10,15 +9,15 @@
/**
* Internal interface to load a reference sequence.
*/
public interface ReferenceSource {
public interface ReferenceSparkSource {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SparkReferenceSource or ReferenceDataSparkSource

Copy link
Contributor Author

@droazen droazen Sep 14, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem inconsistent with the convention already in use across the engine.spark.datasources package. Again, I think it's better to stick to a consistent naming convention for now, since the goal of this PR is to disambiguate these classes vs. their walker counterparts.

* Move/rename ReferenceMultiSparkSourceUnitTest
* Delete unused AddContextDataToReadSparkOptimized
@droazen droazen assigned droazen and unassigned jamesemery Sep 14, 2018
@droazen
Copy link
Contributor Author

droazen commented Sep 14, 2018

Added a second commit to delete the now-unused AddContextDataToReadSparkOptimized

@droazen droazen merged commit c935dc7 into master Sep 14, 2018
@droazen droazen deleted the dr_move_spark_datasources branch September 14, 2018 22:45
tomwhite added a commit that referenced this pull request Oct 8, 2018
* AddContextDataToReadSpark (and AddContextDataToReadSparkUnitTest) implemented the different JoinStrategy options for
  BQSR; has been replaced with the Spark Files mecahnism (see #5127)
* BroadcastJoinReadsWithRefBases and JoinReadsWithRefBasesSparkUnitTest were only used by AddContextDataToReadSpark
* BroadcastJoinReadsWithVariants and JoinReadsWithVariantsSparkUnitTest were only used by AddContextDataToReadSpark
* ShuffleJoinReadsWithRefBases and ShuffleJoinReadsWithVariants were only used by AddContextDataToReadSpark
* JoinStrategy was only used for BQSR (HC always uses overlaps partitioner), but is no longer used since #5127
* KnownSitesCache was replaced with Spark Files
* ReferenceMultiSourceAdapter in HaplotypeCallerSpark was replaced with the regular ReferenceDataSource
* BaseRecalibratorEngineSparkWrapper was only used by BaseRecalibratorSparkSharded, which was removed in #5192
jamesemery pushed a commit that referenced this pull request Jan 11, 2019
* AddContextDataToReadSpark (and AddContextDataToReadSparkUnitTest) implemented the different JoinStrategy options for
  BQSR; has been replaced with the Spark Files mecahnism (see #5127)
* BroadcastJoinReadsWithRefBases and JoinReadsWithRefBasesSparkUnitTest were only used by AddContextDataToReadSpark
* BroadcastJoinReadsWithVariants and JoinReadsWithVariantsSparkUnitTest were only used by AddContextDataToReadSpark
* ShuffleJoinReadsWithRefBases and ShuffleJoinReadsWithVariants were only used by AddContextDataToReadSpark
* JoinStrategy was only used for BQSR (HC always uses overlaps partitioner), but is no longer used since #5127
* KnownSitesCache was replaced with Spark Files
* ReferenceMultiSourceAdapter in HaplotypeCallerSpark was replaced with the regular ReferenceDataSource
* BaseRecalibratorEngineSparkWrapper was only used by BaseRecalibratorSparkSharded, which was removed in #5192
jamesemery pushed a commit that referenced this pull request Jan 11, 2019
* AddContextDataToReadSpark (and AddContextDataToReadSparkUnitTest) implemented the different JoinStrategy options for
  BQSR; has been replaced with the Spark Files mecahnism (see #5127)
* BroadcastJoinReadsWithRefBases and JoinReadsWithRefBasesSparkUnitTest were only used by AddContextDataToReadSpark
* BroadcastJoinReadsWithVariants and JoinReadsWithVariantsSparkUnitTest were only used by AddContextDataToReadSpark
* ShuffleJoinReadsWithRefBases and ShuffleJoinReadsWithVariants were only used by AddContextDataToReadSpark
* JoinStrategy was only used for BQSR (HC always uses overlaps partitioner), but is no longer used since #5127
* KnownSitesCache was replaced with Spark Files
* ReferenceMultiSourceAdapter in HaplotypeCallerSpark was replaced with the regular ReferenceDataSource
* BaseRecalibratorEngineSparkWrapper was only used by BaseRecalibratorSparkSharded, which was removed in #5192
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants