-
Notifications
You must be signed in to change notification settings - Fork 597
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Migrate read arguments and downstream code to GATKPathSpecifier #6561
Conversation
@lbergelson @droazen This is the next phase in URI migration. I'm planning to make more changes, but we can checkpoint this at any time (though I'd remove the TODOs I sprinkled throughout as reminders to myself for things I need to follow up on). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Looks good. I think there are a few minor functions we should add now, and I had a few comments. Do you want to open tickets to track the todos?
|
||
/** | ||
* Get the list of BAM/SAM/CRAM files specified at the command line. | ||
* Paths are the preferred format, as this can handle both local disk and NIO direct access to cloud storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This mention of Paths being the preferred form is no longer true with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
protected List<String> readIndices; | ||
common = true, | ||
optional = true) | ||
protected List<GATKPathSpecifier> readIndices; | ||
|
||
/** | ||
* Get the list of BAM/SAM/CRAM files specified at the command line. | ||
* Paths are the preferred format, as this can handle both local disk and NIO direct access to cloud storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment about Paths isn't accurate anymore.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
} | ||
/** | ||
* Get the list of BAM/SAM/CRAM files specified at the command line. | ||
* Paths are the preferred format, as this can handle both local disk and NIO direct access to cloud storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
comment is out of date
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -277,7 +281,7 @@ public CountingReadFilter makeReadFilter(){ | |||
|
|||
/** | |||
* Must be overridden in order to add annotation arguments to the engine. If this is set to true the engine will | |||
* dynamically discover all {@link Annotation}s in the package defined by {@link org.broadinstitute.hellbender.cmdline.GATKPlugin.GATKAnnotationPluginDescriptor#pluginPackageName} and automatically | |||
* dynamically discover all {@link Annotation}s in the packages defined by {@link GATKAnnotationPluginDescriptor#getPackageNames()} and automatically |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, thanks for updating this.
JavaRDD<GATKRead> output; | ||
// TODO: This if statement is a temporary hack until #959 gets resolve | ||
if (input.endsWith(".adam")) { | ||
if (inputSpecifier.getURIString().endsWith(".adam")) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't deal correctly with http path ending weirdness. I think we should add a "getExtension" or 'getFilename" method to avoid making this mistake.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed.
@@ -57,12 +58,13 @@ public static void logItemizedWarning(final Logger logger, final Collection<?> i | |||
/** | |||
* Same as GATKSparkTool's getRecommendedNumReducers(), but can specify input BAM path (for when --input is not used) | |||
*/ | |||
//TODO: fix this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we lift the GATKPathSpecifier higher here and get rid of the string or is that difficult?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It starts a cascade so it will be in a different pr.
@@ -152,12 +153,12 @@ | |||
final ReadsSparkSource readsSource) { | |||
if (path == null) return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can these paths be input as Path specifiers instead of strings here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it pulls on a longer thread. The pathseq package needs its own PR.
@@ -136,13 +137,15 @@ | |||
|
|||
private int recommendedNumReducers = 0; | |||
|
|||
//TODO: fix this |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like the same thread as above, I assume the todos mean we should hit them later instead of in this PR? Should we open an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced these with #6610 to track remaining todos.
@@ -122,6 +132,7 @@ public static boolean isRemoteStorageUrl(String path) { | |||
* @param path the path | |||
* @return an absolute file path if the original path was a relative file path, otherwise the original path | |||
*/ | |||
//TODO: get rid of this.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's burn this whole class with fire.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but still more work to do to get there.
public static boolean isCramFile(final File inputFile) { | ||
return isCramFileName(inputFile.getName()); | ||
public static boolean isCramFile(final GATKPathSpecifier pathSpec) { | ||
return pathSpec != null && FileExtensions.CRAM.equalsIgnoreCase("." + FilenameUtils.getExtension(pathSpec.getURI().getPath())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this needs a convenience method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
cbcb80c
to
c066a80
Compare
Travis reported job failures from build 30367
|
Back to you @lbergelson:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Some comments about the getBaseName/ getExtension methods. I agree that there isn't any single correct solution, but I think I lean towards including everything after any '.' as the extension. I can probably be talked out of that though. Maybe an alternative would be to have a method that strips known exceptions. "stripGivenExtensions(FASTA_EXTENSIONS) or something. I think ultimately what people really want is a method that says "give me a file in the same place but with this new extension. Maybe we should provide that? Or we can leave it off until adding resolve() and resolveSibling(). resolveSiblingWithExtensionReplaced(extensionsToStrip, newExtension)
Utils.validateArg(extension.charAt(0) == '.', "Target extension must include the leading '.'"); | ||
|
||
// We don't want to use {@code #getExtension} here, since it won't work correctly if we're comparing an | ||
// extension that uses multiple . chars, such as .fasta.gz., and {@code #getExtension} will throw if there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There needs to be a comment about the different behavior on both methods javadoc. Otherwise people will accidentally interchange file.getExtension().equals(".fasta.gz") and file.hasExtension(".fasta.gz")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
* @throws IllegalArgumentException if the hierarchical name ends with the default file system separator | ||
* (i.e. "/") or ".", or if the last component does not contain a ".". | ||
*/ | ||
default String getExtension() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to use this safely there probably needs to be a matching hasExtension()
. Otherwise people are forced to catch the exception in order to call this safely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I modified both getExtension and getBaseName to return "" if there is no extension or basename.
} | ||
} | ||
} | ||
throw new IllegalArgumentException(String.format("Input path (%s) has no extension", this)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like the wrong error message.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed the throw and now returns "".
* @throws IllegalArgumentException if the last component is empty (ie, the component ends in "/"), or the last | ||
* component exists but starts with "." | ||
*/ | ||
default String getBaseName() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to be greedy about the extension and include everything after the first .
as the extension? fasta.gz
and bam.bai
seem more common than "my.filename.has.random.dots.fasta". Maybe I'm wrong about that though.
Typically when people call baseName it's because they want to create a similarly name file with a different extension, so stripping all the extensions makes sense.
I think the comment about the componentExisting but starting with "." doesn't match the tests where you have ".fasta.gz".baseName() == ".fasta".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to prefer returning empty string when the base name/ extension are empty. (not when they're missing because the file ends in "/" though...).
default String getExtension() { | ||
final String hierarchicalPath = getURI().getPath(); | ||
final int indexOfLastComponent = hierarchicalPath.lastIndexOf(FileSystems.getDefault().getSeparator()); | ||
if (indexOfLastComponent != -1 && indexOfLastComponent < hierarchicalPath.length() - 1) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this problematic with files that are at the root of the directory structure or with relative paths that don't include any path above the filename?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is. The underlying URI always has a scheme with an (possibly empty) auth, and if the scheme is file
, the path component is always an absolute path from the root, even if the input string was a relative path. There are tests for such cases, i.e., localFile.bam
and /localFile.bam
, so I think its ok.
final GATKPathSpecifier outFile = new GATKPathSpecifier(new File(OUTPUT_DIRECTORY, base + keyName + extension).getAbsolutePath()); | ||
private SAMFileGATKReadWriter prepareSAMFileWriter(final String keyName) { | ||
final GATKPathSpecifier pathSpec = readArguments.getReadPathSpecifiers().get(0); | ||
final GATKPathSpecifier outFile = new GATKPathSpecifier( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we mark this as a todo when we get resolve methods on GATKPathSpecifier directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added to #6610.
{"/."}, | ||
{"/name/.fasta"}, | ||
{"localFile"}, | ||
{"gs://hellbender/test/resources"}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe this is inconsistent, but we should allow a basename to not have an extension. localFile's base name is just "localFile" I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense. Done.
|
||
// no extensions | ||
{"/", ".ext", false }, | ||
{".", ".ext", false }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add a test for the case ".".hasExtension(".") ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, it doesn't. "." gets turned into a URI with an absolute path that ends in "/./".
public void testIsHadoopURL(final String referenceSpec, final boolean expectedIsHadoop) { | ||
Assert.assertEquals(new GATKPathSpecifier(referenceSpec).isHadoopURL(), expectedIsHadoop); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we simple tests for isSam, isBam, is Cram ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've come this far so, yes.
@cmnbroad Sorry about this. I didn't really think about how nasty these methods might be when I asked for them in this pr. |
So, to summarize the final basename/extension rules:
|
Ah, now it has conflicts...rebasing... |
4db2db8
to
a7015d2
Compare
Need to resolve conflicts again...[Edit] done. |
448b180
to
7aef0fd
Compare
…KPathSpecifier - Phase 1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cmnbroad Looks good. You're back to last only? Seems fine. Lets get this merged...
Yeah last only (I think we only discussed a greedy one, but I never implemented it). Anyway, the only failures left are the SortSamSpark, so I'm merging. Next up - the GATKPathSpecifier name change. |
WIP. Some VCF outs are also converted, and probably a few other things as well.