Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for hdfs and gcs URI's to be passed to GenomicsDB #5017

Closed
wants to merge 20 commits into from

Conversation

nalinigans
Copy link
Collaborator

@nalinigans nalinigans commented Jul 16, 2018

Currently, only Posix filesystem paths can be passed as workspaces and arrays to GenomicsDB via GenomicsDBImport and SelectVariants. This PR will allow for hdfs and gcs (and emrfs/s3) URIs to be supported as well.
Examples

./gatk GenomicsDBImport -V /vcfs/sample.vcf.gz --genomicsdb-workspace-path hdfs://master:9000/gdb_ws  -L 1:500-10000
./gatk GenomicsDBImport -V /vcfs/sample.vcf.gz --genomicsdb-workspace-path gs://my_bucket/gdb_ws  -L 1:500-10000
  ./gatk SelectVariants -V gendb.hdfs://master:9000/gdb_ws -R hs37d5.fa -O out.vcf
 ./gatk SelectVariants -V gendb.gs://my_bucket/gdb_ws -R hs37d5.fa -O out.vcf

GenomicsDB supports GCS via the Cloud Storage Connector. Set environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the GCS Service Account json file.

@codecov-io
Copy link

codecov-io commented Jul 16, 2018

Codecov Report

Merging #5017 into master will decrease coverage by 0.311%.
The diff coverage is 70.395%.

@@               Coverage Diff               @@
##              master     #5017       +/-   ##
===============================================
- Coverage     86.743%   86.433%   -0.311%     
+ Complexity     29470     29350      -120     
===============================================
  Files           1818      1813        -5     
  Lines         136436    136481       +45     
  Branches       15125     15038       -87     
===============================================
- Hits          118349    117964      -385     
- Misses         12647     13106      +459     
+ Partials        5440      5411       -29
Impacted Files Coverage Δ Complexity Δ
...titute/hellbender/engine/FeatureInputUnitTest.java 89.32% <ø> (ø) 23 <0> (ø) ⬇️
...e/hellbender/engine/FeatureDataSourceUnitTest.java 91.022% <ø> (ø) 41 <0> (ø) ⬇️
...broadinstitute/hellbender/engine/FeatureInput.java 94.366% <100%> (ø) 18 <3> (ø) ⬇️
...nstitute/hellbender/utils/gcs/BucketUtilsTest.java 56.303% <38.889%> (-3.103%) 13 <1> (+1)
...ls/genomicsdb/GenomicsDBImportIntegrationTest.java 86.842% <40%> (-2.541%) 73 <2> (ø)
...institute/hellbender/utils/io/IOUtilsUnitTest.java 81.373% <65.217%> (-6.347%) 45 <3> (+13)
...oadinstitute/hellbender/utils/gcs/BucketUtils.java 58.537% <72.727%> (-21.983%) 33 <2> (-7)
...institute/hellbender/engine/FeatureDataSource.java 77.483% <76.19%> (+0.413%) 44 <5> (-2) ⬇️
.../hellbender/tools/genomicsdb/GenomicsDBImport.java 82.128% <86.667%> (+3.029%) 52 <3> (ø) ⬇️
...rg/broadinstitute/hellbender/utils/io/IOUtils.java 71.519% <90.476%> (+0.505%) 89 <11> (+35) ⬆️
... and 49 more

@kgururaj
Copy link
Collaborator

I think you should use version 0.10.0-proto-3.0.0-beta-1+d392491bafcac337 for GenomicsDB

Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First-pass review complete, back to @nalinigans

GenomicsDBConstants.DEFAULT_VIDMAP_FILE_NAME + "," + GenomicsDBConstants.DEFAULT_CALLSETMAP_FILE_NAME + "," +
GenomicsDBConstants.DEFAULT_VCFHEADER_FILE_NAME + ") could not be read from GenomicsDB workspace " + workspace.getAbsolutePath(), e);
}
final String workspace = path.replace(GENOMIC_DB_URI_SCHEME, "");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regex should have an anchor tying it to the start of the String.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we are using Pattern/Matcher to check valid gendb names, e.g. gendb://file, gendb.hdfs://node:portnum/file and gendb.gs://bucket/file.

final String workspace = path.replace(GENOMIC_DB_URI_SCHEME, "");
final String callsetJson = BucketUtils.appendPathToDir(workspace, GenomicsDBConstants.DEFAULT_CALLSETMAP_FILE_NAME);
final String vidmapJson = BucketUtils.appendPathToDir(workspace, GenomicsDBConstants.DEFAULT_VIDMAP_FILE_NAME);
final String vcfHeader = BucketUtils.appendPathToDir(workspace, GenomicsDBConstants.DEFAULT_VCFHEADER_FILE_NAME);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method would be more at home in IOUtils

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about the BucketUtils.appendPathToDir() method? I was thinking it would belong in IOUtils too, but was following the example of BucketUtils.makeFilePathAbsolute(). I will move appendPathToDir() to IOUtils however. Thanks.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved appendPathToDir to IOUtils.

* @param path the path relative to dir
* @return the appended path as a String.
*/
public static String appendPathToDir(String dir, String path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you convert the String to a Path, and use resolve() instead of adding this method?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, will do.

.setVidMappingFile(vidmapJson.getAbsolutePath())
.setCallsetMappingFile(callsetJson.getAbsolutePath())
.setVcfHeaderFilename(vcfHeader.getAbsolutePath())
.setVidMappingFile(vidmapJson)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these guaranteed to be absolute at this point? (the previous code enforced this)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have made sure that all these paths are absolute in the changed code where we are allowing for gendb://, gendb.hdfs:// and gendb.gs:// schemes.

}
private String overwriteOrCreateWorkspace() {
String workspaceDir = BucketUtils.makeFilePathAbsolute(workspace);
if (GenomicsDBUtils.createTileDBWorkspace(workspaceDir, overwriteExistingWorkspace) < 0) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the workspace already exists and overExistingWorkspace is false, we want an UnableToCreateGenomicsDBWorkspace exception to be thrown, as in the previous implementation.

@@ -680,13 +666,6 @@ private File overwriteOrCreateWorkspace() {
}
}

private static void checkIfValidWorkspace(final File workspaceDir) {
final File tempFile = new File(workspaceDir.getAbsolutePath() + "/__tiledb_workspace.tdb");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this check performed internally now?

Copy link
Collaborator Author

@nalinigans nalinigans Aug 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this check is performed internally in GenomicsDB now. gatk should not have to worry about internal data structures.

Assert.assertEquals(BucketUtils.appendPathToDir("hdfs://namenode:9000/dir", "file"), "hdfs://namenode:9000/dir/file");
Assert.assertEquals(BucketUtils.appendPathToDir("gs://abucket/dir", "file"), "gs://abucket/dir/file");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need some tests covering the new functionality introduced by this PR (ie., GCS support). Let us know if you need help setting this up -- there are examples of tests that access data in GCS buckets in GCSNioIntegrationTest

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@droazen, yes, I need some help setting up tests for using hdfs and gs URIs with GenomicsDB. I am taking a look at GCSNioIntegrationTest, but still not sure on setting up a test bucket with folder, etc. on GCS and getting authentication setup via GOOGLE_APPLICATIONS_CREDENTIALS for GenomicsDB.

importConfig.setOutputCallsetmapJsonFile(callsetMapJSONFile.getAbsolutePath());
importConfig.setOutputVidmapJsonFile(vidMapJSONFile.getAbsolutePath());
importConfig.setOutputVcfHeaderFile(vcfHeaderFile.getAbsolutePath());
importConfig.setOutputCallsetmapJsonFile(callsetMapJSONFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these guaranteed to be absolute at this point? (the previous implementation guaranteed this)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. These are generated with respect to the workspace directory that is made absolute - see line 655 : String workspaceDir = BucketUtils.makeFilePathAbsolute(workspace);

@@ -369,29 +370,13 @@ public static boolean isGenomicsDBPath(final String path) {
throw new IllegalArgumentException("Trying to create a GenomicsDBReader from a non-GenomicsDB input");
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of gendb://gs:// as the URI convention, can you introduce a new URI scheme for "gendb on gcs"? Eg., gendbgs:// (and still also accept gendb:// for local gendb instances).

This way, the URIs will at least be legal...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the URI schemes, would you consider gendb://, gendb.gs:// gendb.hdfs://, etc.?
According to wikipedia, these are valid schemes - A non-empty scheme component followed by a colon (:), consisting of a sequence of characters beginning with a letter and followed by any combination of letters, digits, plus (+), period (.), or hyphen (-).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a period in the scheme is legal, then these would be fine. I'd check to make sure that Java's URI, URL, and java.nio.file.Path classes accept such schemes, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, URI and Path recognize period in their schemes.

@nalinigans
Copy link
Collaborator Author

nalinigans commented Aug 24, 2018

@droazen, I think I have pushed most of the changes requested -

  • Moved out appendPathToDir from BucketUtils to IOUtils
  • appendPathToDir now uses Path.resolve() to append a given path to dir
  • If a workspace already exists and overExistingWorkspace is false, a UnableToCreateGenomicsDBWorkspace exception is thrown while creating a GenomicsDB workspace.
  • Made sure all paths passed to GenomicsDB are absolute.
  • Introduced gendb.hdfs: and gendb.gs: URI schemes in addition to the existing gendb: scheme for identifying Cloud paths in GenomicsDB with unit testing for these new schemes.
  • Added unit tests to test writing to GenomicsDB workspace/arrays to GCS and then reading/querying from the same GenomicsDB instance from GCS with validation.

@droazen droazen assigned droazen and unassigned nalinigans Aug 24, 2018
Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Second-pass review complete, back to @nalinigans with some mostly minor refactoring requests.

/**
* Patterns identifying GenomicsDB paths
*/
private static final Pattern genomicsdb_uri_pattern = Pattern.compile("^" + GENOMIC_DB_URI_SCHEME + "(.?)(.*)(://)(.*)");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Write names of constants IN_THIS_STYLE (ie., GENOMICSDB_URI_PATTERN)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I think that you need to escape the period in the first capture group of your expression. Ie., (\\.?) instead of (.?)

}

public static String getAbsolutePathWithGenDBScheme(final String path) {
String gendb_path = FeatureDataSource.getGenomicsDBAbsolutePath(path);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use camel-case for variable names (ie., gendbPath rather than gendb_path)

Copy link
Collaborator Author

@nalinigans nalinigans Sep 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meant to have them all camel-case, I guess I show my C/C++ habits.

* gendb.s3://my_bucket/my_folder
* @return GenomicsDB acceptable path or null
*/
public static String getGenomicsDBPath(final String path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of cluttering FeatureDataSource with these URI-parsing methods, could you move them all to IOUtils? I think that the isGenomicsDBPath(), getAbsolutePathWithGenDBScheme(), getGenomicsDBAbsolutePath(), and getGenomicsDBPath() methods, plus GENOMIC_DB_URI_SCHEME and genomicsdb_uri_pattern should all be moved -- there's no good reason for them to be embedded in this class.

Copy link
Collaborator Author

@nalinigans nalinigans Sep 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I was just following the existing pattern. @droazen, would you rather have a GenomicsDBUtils class in the org.broadinstitute.hellbender.tools.genomicsdb package instead? That might be cleaner.

return getGenomicsDBPath(path) != null;
}

public static String getAbsolutePathWithGenDBScheme(final String path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add javadoc for these new methods.

if (path != null && path.startsWith(GENOMIC_DB_URI_SCHEME)) { // Check if path starts with "gendb"
Matcher matcher = genomicsdb_uri_pattern.matcher(path);
if (matcher.find() && !matcher.group(3).isEmpty()) { // path contains "://"
if (!matcher.group(1).isEmpty() && matcher.group(1).equals(".")) { // path has a period after gendb, so it is a URI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you escape the period in your regex for the first capture group, as suggested above, you shouldn't need the equals(".") here (though I guess it doesn't hurt...)

}

@Test(dataProvider = "GenomicsDBTestPathData")
public void testGenomicsDBPathParsing(String path, String expectedPath, String gendbExpectedAbsolutePath, boolean expectedComparison){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move tests to IOUtilsUnitTest after you move the underlying methods.

String workspace = BucketUtils.randomRemotePath(getGCPTestInputPath(), "","");
try {
Assert.assertNotNull(getGoogleServiceAccountKeyPath());
System.gc();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need to invoke the garbage collector manually here? Can you remove this call?


@Test(groups = {"bucket"})
public void testWriteToAndQueryFromGCS() throws IOException {
String workspace = BucketUtils.randomRemotePath(getGCPTestInputPath(), "","");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use BucketUtils.getTempFilePath(getGCPTestStaging(), "") instead (here and below)? This will use the staging bucket instead of the test data bucket, and will also mark the temp URI for delete on JVM exit.

@@ -779,4 +780,56 @@ public void testYouCantWriteIntoAnExistingDirectory(){
final String workspace = createTempDir("workspace").getAbsolutePath();
writeToGenomicsDB(LOCAL_GVCFS, INTERVAL, workspace, 0, false, 0, 1);
}

private void cleanupGCSFolder(String path) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of implementing this cleanup method here, could you add a BucketUtils.deleteRecursively(Path) method, and use the slightly cleaner approach below:

public static void deleteRecursively(final Path rootPath) {
    final List<Path> pathsToDelete = Files.walk(rootPath).sorted(Comparator.reverseOrder()).collect(Collectors.toList());
    for (Path path : pathsToDelete) {
        Files.deleteIfExists(path);
    }
}

Then modify BucketUtils.deleteOnExit() to call into deleteRecursively() instead of deleteFile(). The advantage of this approach is that if you switch to calling BucketUtils.getTempFilePath() in your test methods below, the returned temp paths will be automatically scheduled for deletion on JVM exit, and you can get rid of the finally blocks and the cleanupGCSFolder() calls in your test code.

@droazen droazen assigned nalinigans and unassigned droazen Aug 31, 2018
@droazen droazen self-assigned this Sep 10, 2018
Copy link
Contributor

@droazen droazen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Final review complete -- only trivial comments remaining. Since they are so trivial I'll address them myself in a separate copy of this branch. We need to make a copy anyway to make sure that tests pass, since the cloud tests that you added don't actually get run by travis for PRs from a fork.

private static FeatureReader<VariantContext> getGenomicsDBFeatureReader(final String path, final File reference) {
if( !isGenomicsDBPath(path) ) {
throw new IllegalArgumentException("Trying to create a GenomicsDBReader from a non-GenomicsDB input");
private static void verifyPathsAreReadable(final String ... paths) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this method to IOUtils as well.

Assert.assertEquals(IOUtils.appendPathToDir("/path/to/dir", "anotherdir/file"), "/path/to/dir/anotherdir/file");

// hdfs: URI
Path tempPath = IOUtils.getPath(MiniClusterUtils.getWorkingDir(MiniClusterUtils.getMiniCluster()).toUri().toString());
Copy link
Contributor

@droazen droazen Sep 18, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to stop the cluster when done.

public void testGenomicsDBPathParsing(String path, String expectedPath, String gendbExpectedAbsolutePath, boolean expectedComparison) {
Assert.assertEquals(IOUtils.getGenomicsDBPath(path), expectedPath, "Got 1 "+IOUtils.getGenomicsDBPath(path));
Assert.assertEquals(IOUtils.getAbsolutePathWithGenDBScheme(path), gendbExpectedAbsolutePath);
Assert.assertEquals(IOUtils.isGenomicsDBPath(path), expectedComparison, "Got 3 " + IOUtils.isGenomicsDBPath(path));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assertion failure messages here could be better worded.

@droazen
Copy link
Contributor

droazen commented Sep 18, 2018

I've opened #5197 with the last few comments addressed. We'll see if the travis cloud tests pass (since they were never actually run on this PR), and if they do pass we can merge.

@nalinigans
Copy link
Collaborator Author

@droazen, I have pushed a debugging test. It just prints out all the keys and not private and not id values in the service json pointed by GOOGLE_APPLICATION_CREDENTIALS env. Would it be possible to accept this into the nalinigans_genomicsdb_uri_support branch, so I can browse through the stdout for that test on Travis? By the way, the failures in the build seem to be unrelated to my change. Thanks.

@droazen
Copy link
Contributor

droazen commented Oct 3, 2018

Closing this PR in favor of #5197

@droazen droazen closed this Oct 3, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants