Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement VCF bundles. #1703

Merged
merged 10 commits into from
Sep 3, 2024
Merged

Implement VCF bundles. #1703

merged 10 commits into from
Sep 3, 2024

Conversation

cmnbroad
Copy link
Collaborator

@cmnbroad cmnbroad commented Jan 23, 2024

Draft until #1702 goes in.

@cmnbroad cmnbroad marked this pull request as draft January 23, 2024 21:17
Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cmnbroad I'm pretty sure the resolve index method is completely broken. I think either delete it or add some tests.

I'm finding it very confusing to differentiate where to use VARIANT_CONTEXTS vs VARIANTS_VCF. Maybe we should change the naming or put them in different classes? Alternatively, it might be less confusing if we added VARIANTS_BCF and constants for the different indexes? It's not super important but I found it confusing, especially in the tests where in some places you use VARIANTS_VCF to specify the extension while VARIANTS_INDEX is different. We can definitely punt on it in this pr.

Other than that, intellij found a few minor suggestions.

* this type.
*/
public class VariantsBundle<T extends IOPath> extends Bundle implements Serializable {
private static final long serialVersionUID = 1L;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably good practice to start annotating these with @Serial.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done (on the variable itself - though I'm still not sure exactly what that does).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It throws a compiler error if the variable or method isn't actually part of the serialization system. It prevents things like serialVersionID which has been a problem for us back in the spark days. Annoyingly hard to notice. Sort of like the Override annotation, just a helpful sanity check.

* for variants sources that are backed by streams or otherv{@link BundleResource} types, the {@link Bundle} and
* {@link BundleBuilder} classes can be used to construct such bundles directly.
*
* @param <T> The type to use when creating a new IOPathResource for a {@link htsjdk.beta.plugin.variants.VariantsBundle}.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe someday java will add self typed calls.

* @param vcfPath An {@link IOPath}-derived object that represents a source of variants.
*/
public VariantsBundle(final T vcfPath) {
this(Arrays.asList(toInputResource(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're free from the tyranny of Arrays.asLis()! List.of(), unless of course that causes some horrible confusing serialization error...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*/
public VariantsBundle(final T vcfPath, final T indexPath) {
this(Arrays.asList(
toInputResource(BundleResourceType.VARIANT_CONTEXTS,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

toInputResource() is called with this verbose nonNull construct everywhere. It looks like that could be just moved into the method itself since the first parameter seems to always want to be the same as error message name in the null check.

(I think in this case it actually should VARIANT_CONTEXT and not VARIANTS_VCF also. It seems inconsistent.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like there is an identical copy of this method in ReadsBundle that has the same unnecessary verbosity also. Maybe it should be pulled out into a utility

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I got rid of the nonNull business -it turns out the called method already had that anyway. As for the reads one, its slightly different in that they each use different (content type-specific) getInferredContentTypes methods.

}

/**
* return the {@link BundleResourceType#VARIANTS_VCF} {@link BundleResource} for this {@link htsjdk.beta.plugin.variants.VariantsBundle}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop the duplicate comment line and just keep the `@returns'?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

public static boolean looksLikeAVariantsBundle(final IOPath rawVCFPath) {
return rawVCFPath.getURI().getPath().endsWith(BundleJSON.BUNDLE_EXTENSION);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this use rawVCFPath.hasExtension()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But I've removed this method until such a time as we actually need it (so far I don't).

// and correctly infer the types in all cases without reproducing all the logic embedded in all the
// codecs (i.e., an htsget IOPath ends in ".bam", but has format HTSGET_BAM, not BAM - detecting
// that here would require parsing the entire IOPath structure, which is best left to the codecs
// themselves). So for now its advisory, but maybe it should be abandoned altogether.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this check is better than nothing.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Retained this for now, since it does some checking, though it could do a lot more.

/**
* return the {@link BundleResourceType#VARIANTS_VCF} {@link BundleResource} for this {@link htsjdk.beta.plugin.variants.VariantsBundle}
*
* @return the {@link BundleResourceType#VARIANTS_VCF} {@link BundleResource} for this {@link htsjdk.beta.plugin.variants.VariantsBundle}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the doc say VARIANT_CONTEXTS instead off VARIANTS_VCF? There are few times it refers to requiring a resource with the type VARIANTs_VCF but the actual check seems to be for the more general VARIANTS_CONTEXT type instead of the VCF specifically.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - those names are a bit of a mess. I'll update them.

"ALIGNED_READS":{"path":"my.cram","format":"READS_CRAM"},
"primary":"ALIGNED_READS"
}
""".formatted();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unnecessary format

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. I replaced some of the static text in these cases with additional parameters anyway.

}
""".formatted(VCF_FILE, VCF_INDEX_FILE),
// VariantsBundle doesn't automatically infer format, so create one manually
new VariantsBundle(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

VariantsBundle<>

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed the type params from VariantsBundle altogether, since they really weren't needed anyway.

Copy link
Member

@lbergelson lbergelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems pretty straight forward. Just a lot of it. I think it might makes ense to change the new utility method to be two different ones instead of one with a mode toggle. It's not a blocker though.

Another musing:
I wonder if t would make more sense to write the schema as:

schema: {verson: x, name: htsbundle} 

instead of:

schemaVersion: x
schemaName: htsbundle

Maybe for version x+1...


/**
* Takes an IOPath and returns a new IOPath object that keeps the same basename as the original but has
* a new extension. If append is set to false, only the last component of an extension will be replaced.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only the last -> the last
I read that as implying that append == true will replace all of the components.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. Will separate append/replace to eliminate confusion.

* @param ioPathConstructor a function that takes a string and returns an IOPath-derived class of type <T>
* @return A new IOPath object with the new extension
*/
public static <T extends IOPath> T replaceExtension(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a favor to my future self seeing this fail at runtime, is there a reason we shouldn't make it accept it either with or without the dot?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense split these into two different functions? replaceExtension() and appendExtension()?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I like that better. I back ported this from one of Takuto's Picard PRs since I needed it. Also, I made him add tests for it ;-). Will separate these and add some tests.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, wait I did add tests. Good for me. Now I have to update them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

String.format("A JSON string with more than one bundle was provided but only a single bundle is allowed in this context (%s)",
e.getMessage()));
}
return bundles.stream().findFirst().get();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems like a weird way to get the first element of a list. why not just bundles.get(0)? Or update to java 21 and just use bundles .getFirst() :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, yeah. Done.

}
}
if (bundles.isEmpty()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should refer to a collection of bundles as a bindle for added alliteration. Definitely won't make it more confusing.

} else if (ext.equals((FileExtensions.SAM))) {
return Optional.of(new Tuple<>(BundleResourceType.ALIGNED_READS, BundleResourceType.READS_SAM));
return Optional.of(new Tuple<>(BundleResourceType.CT_ALIGNED_READS, BundleResourceType.FMT_READS_SAM));
}
//TODO: finish this, else SRA, htsget,...
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can probably leave this as a todo for now.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, ReadsBundle needs some work since it was never really finished. But its not used by my current work, so I'll do that in a separate PR.

* this type.
*/
public class VariantsBundle<T extends IOPath> extends Bundle implements Serializable {
private static final long serialVersionUID = 1L;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It throws a compiler error if the variable or method isn't actually part of the serialization system. It prevents things like serialVersionID which has been a problem for us back in the spark days. Annoyingly hard to notice. Sort of like the Override annotation, just a helpful sanity check.

@@ -77,6 +77,14 @@ default String getScheme() {
return getURI().getScheme();
}

// default String resolveSibling(final String sibling) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be deleted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Om, yeah, I have no idea why this is here. Deleted.

"schemaVersion":"%s",
"primary":"%s",
"%s":{"path":"%s","format":"BAM"}
}""".formatted(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The triple quotes string is so much more readable.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next year in string interpolation!

@@ -194,49 +213,143 @@ public void testFromJSONValidWithPathOverride(final String jsonString, final Str
});
}

// Make sure that any JSON that is a valid single Bundle can also be parsed as a Bundle collection, so that users
// can provide an individual bundle to any client that expects a collection of bundles.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This requirement added complexity but I think it will save so much frustration.

@cmnbroad cmnbroad marked this pull request as ready for review September 3, 2024 14:45
@cmnbroad cmnbroad merged commit 46f6f0f into master Sep 3, 2024
4 checks passed
@cmnbroad cmnbroad deleted the cn_vcf_bundles branch September 3, 2024 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants