Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate read arguments and downstream code to GATKPathSpecifier #6561

Merged
merged 16 commits into from
Jun 2, 2020

Conversation

cmnbroad
Copy link
Collaborator

@cmnbroad cmnbroad commented Apr 21, 2020

WIP. Some VCF outs are also converted, and probably a few other things as well.

@cmnbroad
Copy link
Collaborator Author

@lbergelson @droazen This is the next phase in URI migration. I'm planning to make more changes, but we can checkpoint this at any time (though I'd remove the TODOs I sprinkled throughout as reminders to myself for things I need to follow up on).

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Looks good. I think there are a few minor functions we should add now, and I had a few comments. Do you want to open tickets to track the todos?


/**
* Get the list of BAM/SAM/CRAM files specified at the command line.
* Paths are the preferred format, as this can handle both local disk and NIO direct access to cloud storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This mention of Paths being the preferred form is no longer true with this PR.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

protected List<String> readIndices;
common = true,
optional = true)
protected List<GATKPathSpecifier> readIndices;

/**
* Get the list of BAM/SAM/CRAM files specified at the command line.
* Paths are the preferred format, as this can handle both local disk and NIO direct access to cloud storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment about Paths isn't accurate anymore.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}
/**
* Get the list of BAM/SAM/CRAM files specified at the command line.
* Paths are the preferred format, as this can handle both local disk and NIO direct access to cloud storage.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment is out of date

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -277,7 +281,7 @@ public CountingReadFilter makeReadFilter(){

/**
* Must be overridden in order to add annotation arguments to the engine. If this is set to true the engine will
* dynamically discover all {@link Annotation}s in the package defined by {@link org.broadinstitute.hellbender.cmdline.GATKPlugin.GATKAnnotationPluginDescriptor#pluginPackageName} and automatically
* dynamically discover all {@link Annotation}s in the packages defined by {@link GATKAnnotationPluginDescriptor#getPackageNames()} and automatically
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, thanks for updating this.

JavaRDD<GATKRead> output;
// TODO: This if statement is a temporary hack until #959 gets resolve
if (input.endsWith(".adam")) {
if (inputSpecifier.getURIString().endsWith(".adam")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't deal correctly with http path ending weirdness. I think we should add a "getExtension" or 'getFilename" method to avoid making this mistake.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@@ -57,12 +58,13 @@ public static void logItemizedWarning(final Logger logger, final Collection<?> i
/**
* Same as GATKSparkTool's getRecommendedNumReducers(), but can specify input BAM path (for when --input is not used)
*/
//TODO: fix this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we lift the GATKPathSpecifier higher here and get rid of the string or is that difficult?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It starts a cascade so it will be in a different pr.

@@ -152,12 +153,12 @@
final ReadsSparkSource readsSource) {
if (path == null) return null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these paths be input as Path specifiers instead of strings here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it pulls on a longer thread. The pathseq package needs its own PR.

@@ -136,13 +137,15 @@

private int recommendedNumReducers = 0;

//TODO: fix this
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like the same thread as above, I assume the todos mean we should hit them later instead of in this PR? Should we open an issue?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced these with #6610 to track remaining todos.

@@ -122,6 +132,7 @@ public static boolean isRemoteStorageUrl(String path) {
* @param path the path
* @return an absolute file path if the original path was a relative file path, otherwise the original path
*/
//TODO: get rid of this..
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's burn this whole class with fire.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but still more work to do to get there.

public static boolean isCramFile(final File inputFile) {
return isCramFileName(inputFile.getName());
public static boolean isCramFile(final GATKPathSpecifier pathSpec) {
return pathSpec != null && FileExtensions.CRAM.equalsIgnoreCase("." + FilenameUtils.getExtension(pathSpec.getURI().getPath()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs a convenience method

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@lbergelson lbergelson assigned cmnbroad and unassigned lbergelson May 6, 2020
@cmnbroad cmnbroad force-pushed the cn_uri_migration_reads branch from cbcb80c to c066a80 Compare May 19, 2020 14:43
@gatk-bot
Copy link

gatk-bot commented May 19, 2020

Travis reported job failures from build 30367
Failures in the following jobs:

Test Type JDK Job ID Logs
cloud openjdk8 30367.1 logs
conda openjdk8 30367.5 logs
cloud openjdk11 30367.13 logs
unit openjdk11 30367.12 logs
integration openjdk11 30367.11 logs
variantcalling openjdk8 30367.4 logs
unit openjdk8 30367.3 logs
integration openjdk8 30367.2 logs

@cmnbroad
Copy link
Collaborator Author

Back to you @lbergelson:

  • I added some new methods to PathURI and/or GATKPathSpecifier (getExtension/hasExtension, getBaseName, isSam, isBam, isCram, isHadoopURL). See what you think about the getExtension/hasExtension semantics - there is inherent weirdness/ambiguity around treatment of names that have multiple "." segments, like some.fasta.gz. The methods I added are mutually consistent, but maybe there is a better way.
  • I also added Remaining GATKPath migration TODOs #6610 to track todos.
  • Unfortunately I had to rebase this and make some more changes in order to get the merge build to pass.
  • Once this goes in, I'll do the rename, and then continue with additional PRs.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Some comments about the getBaseName/ getExtension methods. I agree that there isn't any single correct solution, but I think I lean towards including everything after any '.' as the extension. I can probably be talked out of that though. Maybe an alternative would be to have a method that strips known exceptions. "stripGivenExtensions(FASTA_EXTENSIONS) or something. I think ultimately what people really want is a method that says "give me a file in the same place but with this new extension. Maybe we should provide that? Or we can leave it off until adding resolve() and resolveSibling(). resolveSiblingWithExtensionReplaced(extensionsToStrip, newExtension)

Utils.validateArg(extension.charAt(0) == '.', "Target extension must include the leading '.'");

// We don't want to use {@code #getExtension} here, since it won't work correctly if we're comparing an
// extension that uses multiple . chars, such as .fasta.gz., and {@code #getExtension} will throw if there
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There needs to be a comment about the different behavior on both methods javadoc. Otherwise people will accidentally interchange file.getExtension().equals(".fasta.gz") and file.hasExtension(".fasta.gz")

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

* @throws IllegalArgumentException if the hierarchical name ends with the default file system separator
* (i.e. "/") or ".", or if the last component does not contain a ".".
*/
default String getExtension() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to use this safely there probably needs to be a matching hasExtension(). Otherwise people are forced to catch the exception in order to call this safely.

Copy link
Collaborator Author

@cmnbroad cmnbroad May 26, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I modified both getExtension and getBaseName to return "" if there is no extension or basename.

}
}
}
throw new IllegalArgumentException(String.format("Input path (%s) has no extension", this));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like the wrong error message.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the throw and now returns "".

* @throws IllegalArgumentException if the last component is empty (ie, the component ends in "/"), or the last
* component exists but starts with "."
*/
default String getBaseName() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be better to be greedy about the extension and include everything after the first . as the extension? fasta.gz and bam.bai seem more common than "my.filename.has.random.dots.fasta". Maybe I'm wrong about that though.

Typically when people call baseName it's because they want to create a similarly name file with a different extension, so stripping all the extensions makes sense.

I think the comment about the componentExisting but starting with "." doesn't match the tests where you have ".fasta.gz".baseName() == ".fasta".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might want to prefer returning empty string when the base name/ extension are empty. (not when they're missing because the file ends in "/" though...).

default String getExtension() {
final String hierarchicalPath = getURI().getPath();
final int indexOfLastComponent = hierarchicalPath.lastIndexOf(FileSystems.getDefault().getSeparator());
if (indexOfLastComponent != -1 && indexOfLastComponent < hierarchicalPath.length() - 1) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this problematic with files that are at the root of the directory structure or with relative paths that don't include any path above the filename?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it is. The underlying URI always has a scheme with an (possibly empty) auth, and if the scheme is file, the path component is always an absolute path from the root, even if the input string was a relative path. There are tests for such cases, i.e., localFile.bam and /localFile.bam, so I think its ok.

final GATKPathSpecifier outFile = new GATKPathSpecifier(new File(OUTPUT_DIRECTORY, base + keyName + extension).getAbsolutePath());
private SAMFileGATKReadWriter prepareSAMFileWriter(final String keyName) {
final GATKPathSpecifier pathSpec = readArguments.getReadPathSpecifiers().get(0);
final GATKPathSpecifier outFile = new GATKPathSpecifier(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we mark this as a todo when we get resolve methods on GATKPathSpecifier directly?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added to #6610.

{"/."},
{"/name/.fasta"},
{"localFile"},
{"gs://hellbender/test/resources"},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe this is inconsistent, but we should allow a basename to not have an extension. localFile's base name is just "localFile" I think.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Done.


// no extensions
{"/", ".ext", false },
{".", ".ext", false },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you add a test for the case ".".hasExtension(".") ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, it doesn't. "." gets turned into a URI with an absolute path that ends in "/./".

public void testIsHadoopURL(final String referenceSpec, final boolean expectedIsHadoop) {
Assert.assertEquals(new GATKPathSpecifier(referenceSpec).isHadoopURL(), expectedIsHadoop);
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we simple tests for isSam, isBam, is Cram ?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've come this far so, yes.

@lbergelson
Copy link
Member

@cmnbroad Sorry about this. I didn't really think about how nasty these methods might be when I asked for them in this pr.

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented May 27, 2020

So, to summarize the final basename/extension rules:
[Edited]:

  • getExtension will return Optional.empty() if the path has no file name component (ie, it ends in '/') or if the name has no extension, otherwise it will return the shortest possible extension (i.e., ".gz" for "my.fasta.gz"),
  • hasExtension will return true for ".gz" or ".fasta.gz". hasExtension will only throw if the extension passed in does not include a leading "." followed by at least one other char
  • getBaseName will return Optional.empty() if the path has no file name component (ie, it ends in '/') or has an extension, otherwise it will return the base of the file name component up to the last (possibly non-existent) "."

@cmnbroad cmnbroad marked this pull request as ready for review May 27, 2020 19:48
@cmnbroad
Copy link
Collaborator Author

Ah, now it has conflicts...rebasing...

@cmnbroad cmnbroad force-pushed the cn_uri_migration_reads branch from 4db2db8 to a7015d2 Compare May 27, 2020 20:06
@gatk-bot
Copy link

gatk-bot commented Jun 1, 2020

Travis reported job failures from build 30459
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30459.11 logs

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Jun 2, 2020

Need to resolve conflicts again...[Edit] done.

@cmnbroad cmnbroad force-pushed the cn_uri_migration_reads branch from 448b180 to 7aef0fd Compare June 2, 2020 15:40
@gatk-bot
Copy link

gatk-bot commented Jun 2, 2020

Travis reported job failures from build 30475
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30475.11 logs
integration openjdk8 30475.2 logs

@gatk-bot
Copy link

gatk-bot commented Jun 2, 2020

Travis reported job failures from build 30477
Failures in the following jobs:

Test Type JDK Job ID Logs
integration openjdk11 30477.11 logs
integration openjdk8 30477.2 logs

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad Looks good. You're back to last only? Seems fine. Lets get this merged...

@cmnbroad
Copy link
Collaborator Author

cmnbroad commented Jun 2, 2020

Yeah last only (I think we only discussed a greedy one, but I never implemented it). Anyway, the only failures left are the SortSamSpark, so I'm merging. Next up - the GATKPathSpecifier name change.

@cmnbroad cmnbroad merged commit 74c0e48 into master Jun 2, 2020
@lbergelson lbergelson deleted the cn_uri_migration_reads branch June 2, 2020 19:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants