-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow for hdfs and gcs URI's to be passed to GenomicsDB #5017
Conversation
Codecov Report
@@ Coverage Diff @@
## master #5017 +/- ##
===============================================
- Coverage 86.743% 86.433% -0.311%
+ Complexity 29470 29350 -120
===============================================
Files 1818 1813 -5
Lines 136436 136481 +45
Branches 15125 15038 -87
===============================================
- Hits 118349 117964 -385
- Misses 12647 13106 +459
+ Partials 5440 5411 -29
|
I think you should use version 0.10.0-proto-3.0.0-beta-1+d392491bafcac337 for GenomicsDB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First-pass review complete, back to @nalinigans
GenomicsDBConstants.DEFAULT_VIDMAP_FILE_NAME + "," + GenomicsDBConstants.DEFAULT_CALLSETMAP_FILE_NAME + "," + | ||
GenomicsDBConstants.DEFAULT_VCFHEADER_FILE_NAME + ") could not be read from GenomicsDB workspace " + workspace.getAbsolutePath(), e); | ||
} | ||
final String workspace = path.replace(GENOMIC_DB_URI_SCHEME, ""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex should have an anchor tying it to the start of the String.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, we are using Pattern/Matcher to check valid gendb names, e.g. gendb://file, gendb.hdfs://node:portnum/file and gendb.gs://bucket/file.
final String workspace = path.replace(GENOMIC_DB_URI_SCHEME, ""); | ||
final String callsetJson = BucketUtils.appendPathToDir(workspace, GenomicsDBConstants.DEFAULT_CALLSETMAP_FILE_NAME); | ||
final String vidmapJson = BucketUtils.appendPathToDir(workspace, GenomicsDBConstants.DEFAULT_VIDMAP_FILE_NAME); | ||
final String vcfHeader = BucketUtils.appendPathToDir(workspace, GenomicsDBConstants.DEFAULT_VCFHEADER_FILE_NAME); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method would be more at home in IOUtils
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about the BucketUtils.appendPathToDir() method? I was thinking it would belong in IOUtils too, but was following the example of BucketUtils.makeFilePathAbsolute(). I will move appendPathToDir() to IOUtils however. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved appendPathToDir to IOUtils.
* @param path the path relative to dir | ||
* @return the appended path as a String. | ||
*/ | ||
public static String appendPathToDir(String dir, String path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you convert the String to a Path
, and use resolve()
instead of adding this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, will do.
.setVidMappingFile(vidmapJson.getAbsolutePath()) | ||
.setCallsetMappingFile(callsetJson.getAbsolutePath()) | ||
.setVcfHeaderFilename(vcfHeader.getAbsolutePath()) | ||
.setVidMappingFile(vidmapJson) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these guaranteed to be absolute at this point? (the previous code enforced this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have made sure that all these paths are absolute in the changed code where we are allowing for gendb://, gendb.hdfs:// and gendb.gs:// schemes.
} | ||
private String overwriteOrCreateWorkspace() { | ||
String workspaceDir = BucketUtils.makeFilePathAbsolute(workspace); | ||
if (GenomicsDBUtils.createTileDBWorkspace(workspaceDir, overwriteExistingWorkspace) < 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the workspace already exists and overExistingWorkspace
is false, we want an UnableToCreateGenomicsDBWorkspace
exception to be thrown, as in the previous implementation.
@@ -680,13 +666,6 @@ private File overwriteOrCreateWorkspace() { | |||
} | |||
} | |||
|
|||
private static void checkIfValidWorkspace(final File workspaceDir) { | |||
final File tempFile = new File(workspaceDir.getAbsolutePath() + "/__tiledb_workspace.tdb"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this check performed internally now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this check is performed internally in GenomicsDB now. gatk should not have to worry about internal data structures.
Assert.assertEquals(BucketUtils.appendPathToDir("hdfs://namenode:9000/dir", "file"), "hdfs://namenode:9000/dir/file"); | ||
Assert.assertEquals(BucketUtils.appendPathToDir("gs://abucket/dir", "file"), "gs://abucket/dir/file"); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need some tests covering the new functionality introduced by this PR (ie., GCS support). Let us know if you need help setting this up -- there are examples of tests that access data in GCS buckets in GCSNioIntegrationTest
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@droazen, yes, I need some help setting up tests for using hdfs and gs URIs with GenomicsDB. I am taking a look at GCSNioIntegrationTest
, but still not sure on setting up a test bucket with folder, etc. on GCS and getting authentication setup via GOOGLE_APPLICATIONS_CREDENTIALS for GenomicsDB.
importConfig.setOutputCallsetmapJsonFile(callsetMapJSONFile.getAbsolutePath()); | ||
importConfig.setOutputVidmapJsonFile(vidMapJSONFile.getAbsolutePath()); | ||
importConfig.setOutputVcfHeaderFile(vcfHeaderFile.getAbsolutePath()); | ||
importConfig.setOutputCallsetmapJsonFile(callsetMapJSONFile); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these guaranteed to be absolute at this point? (the previous implementation guaranteed this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. These are generated with respect to the workspace directory that is made absolute - see line 655 : String workspaceDir = BucketUtils.makeFilePathAbsolute(workspace);
@@ -369,29 +370,13 @@ public static boolean isGenomicsDBPath(final String path) { | |||
throw new IllegalArgumentException("Trying to create a GenomicsDBReader from a non-GenomicsDB input"); | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of gendb://gs://
as the URI convention, can you introduce a new URI scheme for "gendb on gcs"? Eg., gendbgs://
(and still also accept gendb://
for local gendb instances).
This way, the URIs will at least be legal...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the URI schemes, would you consider gendb://, gendb.gs:// gendb.hdfs://, etc.?
According to wikipedia, these are valid schemes - A non-empty scheme component followed by a colon (:), consisting of a sequence of characters beginning with a letter and followed by any combination of letters, digits, plus (+), period (.), or hyphen (-).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a period in the scheme is legal, then these would be fine. I'd check to make sure that Java's URI, URL, and java.nio.file.Path
classes accept such schemes, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, URI and Path recognize period in their schemes.
…ndb.hdfs:// schemes
@droazen, I think I have pushed most of the changes requested -
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Second-pass review complete, back to @nalinigans with some mostly minor refactoring requests.
/** | ||
* Patterns identifying GenomicsDB paths | ||
*/ | ||
private static final Pattern genomicsdb_uri_pattern = Pattern.compile("^" + GENOMIC_DB_URI_SCHEME + "(.?)(.*)(://)(.*)"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Write names of constants IN_THIS_STYLE (ie., GENOMICSDB_URI_PATTERN
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I think that you need to escape the period in the first capture group of your expression. Ie., (\\.?)
instead of (.?)
} | ||
|
||
public static String getAbsolutePathWithGenDBScheme(final String path) { | ||
String gendb_path = FeatureDataSource.getGenomicsDBAbsolutePath(path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use camel-case for variable names (ie., gendbPath
rather than gendb_path
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Meant to have them all camel-case, I guess I show my C/C++ habits.
* gendb.s3://my_bucket/my_folder | ||
* @return GenomicsDB acceptable path or null | ||
*/ | ||
public static String getGenomicsDBPath(final String path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of cluttering FeatureDataSource
with these URI-parsing methods, could you move them all to IOUtils
? I think that the isGenomicsDBPath()
, getAbsolutePathWithGenDBScheme()
, getGenomicsDBAbsolutePath()
, and getGenomicsDBPath()
methods, plus GENOMIC_DB_URI_SCHEME
and genomicsdb_uri_pattern
should all be moved -- there's no good reason for them to be embedded in this class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed. I was just following the existing pattern. @droazen, would you rather have a GenomicsDBUtils class in the org.broadinstitute.hellbender.tools.genomicsdb package instead? That might be cleaner.
return getGenomicsDBPath(path) != null; | ||
} | ||
|
||
public static String getAbsolutePathWithGenDBScheme(final String path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add javadoc for these new methods.
if (path != null && path.startsWith(GENOMIC_DB_URI_SCHEME)) { // Check if path starts with "gendb" | ||
Matcher matcher = genomicsdb_uri_pattern.matcher(path); | ||
if (matcher.find() && !matcher.group(3).isEmpty()) { // path contains "://" | ||
if (!matcher.group(1).isEmpty() && matcher.group(1).equals(".")) { // path has a period after gendb, so it is a URI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you escape the period in your regex for the first capture group, as suggested above, you shouldn't need the equals(".")
here (though I guess it doesn't hurt...)
} | ||
|
||
@Test(dataProvider = "GenomicsDBTestPathData") | ||
public void testGenomicsDBPathParsing(String path, String expectedPath, String gendbExpectedAbsolutePath, boolean expectedComparison){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move tests to IOUtilsUnitTest
after you move the underlying methods.
String workspace = BucketUtils.randomRemotePath(getGCPTestInputPath(), "",""); | ||
try { | ||
Assert.assertNotNull(getGoogleServiceAccountKeyPath()); | ||
System.gc(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need to invoke the garbage collector manually here? Can you remove this call?
|
||
@Test(groups = {"bucket"}) | ||
public void testWriteToAndQueryFromGCS() throws IOException { | ||
String workspace = BucketUtils.randomRemotePath(getGCPTestInputPath(), "",""); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use BucketUtils.getTempFilePath(getGCPTestStaging(), "")
instead (here and below)? This will use the staging bucket instead of the test data bucket, and will also mark the temp URI for delete on JVM exit.
@@ -779,4 +780,56 @@ public void testYouCantWriteIntoAnExistingDirectory(){ | |||
final String workspace = createTempDir("workspace").getAbsolutePath(); | |||
writeToGenomicsDB(LOCAL_GVCFS, INTERVAL, workspace, 0, false, 0, 1); | |||
} | |||
|
|||
private void cleanupGCSFolder(String path) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of implementing this cleanup method here, could you add a BucketUtils.deleteRecursively(Path)
method, and use the slightly cleaner approach below:
public static void deleteRecursively(final Path rootPath) {
final List<Path> pathsToDelete = Files.walk(rootPath).sorted(Comparator.reverseOrder()).collect(Collectors.toList());
for (Path path : pathsToDelete) {
Files.deleteIfExists(path);
}
}
Then modify BucketUtils.deleteOnExit()
to call into deleteRecursively()
instead of deleteFile()
. The advantage of this approach is that if you switch to calling BucketUtils.getTempFilePath()
in your test methods below, the returned temp paths will be automatically scheduled for deletion on JVM exit, and you can get rid of the finally blocks and the cleanupGCSFolder()
calls in your test code.
...est/java/org/broadinstitute/hellbender/tools/genomicsdb/GenomicsDBImportIntegrationTest.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Final review complete -- only trivial comments remaining. Since they are so trivial I'll address them myself in a separate copy of this branch. We need to make a copy anyway to make sure that tests pass, since the cloud tests that you added don't actually get run by travis for PRs from a fork.
private static FeatureReader<VariantContext> getGenomicsDBFeatureReader(final String path, final File reference) { | ||
if( !isGenomicsDBPath(path) ) { | ||
throw new IllegalArgumentException("Trying to create a GenomicsDBReader from a non-GenomicsDB input"); | ||
private static void verifyPathsAreReadable(final String ... paths) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move this method to IOUtils
as well.
Assert.assertEquals(IOUtils.appendPathToDir("/path/to/dir", "anotherdir/file"), "/path/to/dir/anotherdir/file"); | ||
|
||
// hdfs: URI | ||
Path tempPath = IOUtils.getPath(MiniClusterUtils.getWorkingDir(MiniClusterUtils.getMiniCluster()).toUri().toString()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to stop the cluster when done.
public void testGenomicsDBPathParsing(String path, String expectedPath, String gendbExpectedAbsolutePath, boolean expectedComparison) { | ||
Assert.assertEquals(IOUtils.getGenomicsDBPath(path), expectedPath, "Got 1 "+IOUtils.getGenomicsDBPath(path)); | ||
Assert.assertEquals(IOUtils.getAbsolutePathWithGenDBScheme(path), gendbExpectedAbsolutePath); | ||
Assert.assertEquals(IOUtils.isGenomicsDBPath(path), expectedComparison, "Got 3 " + IOUtils.isGenomicsDBPath(path)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Assertion failure messages here could be better worded.
I've opened #5197 with the last few comments addressed. We'll see if the travis cloud tests pass (since they were never actually run on this PR), and if they do pass we can merge. |
@droazen, I have pushed a debugging test. It just prints out all the keys and not private and not id values in the service json pointed by GOOGLE_APPLICATION_CREDENTIALS env. Would it be possible to accept this into the nalinigans_genomicsdb_uri_support branch, so I can browse through the stdout for that test on Travis? By the way, the failures in the build seem to be unrelated to my change. Thanks. |
Closing this PR in favor of #5197 |
Currently, only Posix filesystem paths can be passed as workspaces and arrays to GenomicsDB via GenomicsDBImport and SelectVariants. This PR will allow for hdfs and gcs (and emrfs/s3) URIs to be supported as well.
Examples
GenomicsDB supports GCS via the Cloud Storage Connector. Set environment variable GOOGLE_APPLICATION_CREDENTIALS to point to the GCS Service Account json file.