Evaluate bdg-convert external conversion library proposal #1197

heuermh · 2016-10-05T19:56:06Z

@tomwhite @ryan-williams @fnothaft @massie

Ping for review of a proposal for a new external bdg-formats <--> { htsjdk, ga4gh, string, etc.} conversion library. I've migrated the repo from heuermh to the bigdatagenomics organization here

https://github.com/bigdatagenomics/bdg-convert

fnothaft · 2016-10-07T06:09:40Z

Generally looks OK to me. I'm +0. A variety of nits:

I imagine that there's going to be lots of common logic that is duplicated across converters (e.g., the logic that is shared between AlignmentRecordToSamRecord and AlignmentRecordToFastqLine, etc). Since the classes here are specific to a single conversion sink/source pair, is the plan to factor those out into some private static classes?
Why Java? I'm weakly anti-Java, esp. since all of our other implementation code is in Scala.
I'm not entirely sure that splitting it out into a separate repo makes sense, since it is ultimately going to rely on some of our models (e.g., SAMRecord <-> AlignmentRecord needs RecordGroupDictionary access, IIRC).
There's a bit of weird package naming going on, e.g., org.bdgenomics.convert.bdgenomics, org.bdgenomics.convert.htsjdk. What's the reasoning behind this scheme? It reads really weird.
Can't we just use HTSJDK's validation stringency instead of this? https://github.com/bigdatagenomics/bdg-convert/blob/master/src/main/java/org/bdgenomics/convert/ConversionStringency.java

heuermh · 2016-10-07T20:48:11Z

Thanks for the review.

There are more implementations here, including some with nested converters
https://github.com/heuermh/dishevelled-bio/tree/master/convert/src/main/java/org/dishevelled/bio/convert

For example, the VcfRecordToGenotype converter delegates to VcfRecordToVariant (see VcfRecordToGenotypes.java#L85).

Other shared code could be extracted to (package) private static classes.

I proposed to write it in Java because I have found calling Java from Scala to be less troublesome than calling Scala from Java, and some potential clients of this library are implemented in Java (e.g. my stuff above, GATK4, UCSD's https://github.com/biojava/biojava-spark, etc.)

A dependency on our Scala models may answer the question though, and the separate repo question as well.

I don't like the package naming either. Suggestions are welcome.

For ConversionStringency, I don't like the name, and would rather not have a public API dependency on a third party class/enum out of our control. It may also allow say clients of the biojava package to exclude the htsjdk transitive dependency.

fnothaft · 2016-10-07T22:57:27Z

A dependency on our Scala models may answer the question though, and the separate repo question as well.

We may be able to defer to the Avro companions to said models. E.g., org.bdgenomics.adam.models.SequenceRecord <-> org.bdgenomics.formats.avro.Contig, org.bdgenomics.adam.models.RecordGroup <-> org.bdgenomics.formats.avro.RecordGroupMetadata.

I don't like the package naming either. Suggestions are welcome.

What's the general goal of the present naming scheme? I would suggest something along the lines of org.bdgenomics.convert.<recordname>, similar to what we do with org.bdgenomics.adam.rdd.*. I know that has it's own problems, but I think it's a bit cleaner to organize by datatype.

For ConversionStringency, I don't like the name, and would rather not have a public API dependency on a third party class/enum out of our control. It may also allow say clients of the biojava package to exclude the htsjdk transitive dependency.

I get your point, and agree in spirit, but it's pretty hard to avoid public third party API dependencies in a library that converts to/from third party formats. ;)

heuermh · 2016-10-09T18:39:31Z

I would suggest something along the lines of org.bdgenomics.convert., similar to what we do with org.bdgenomics.adam.rdd.*.

While I don't care for the awkward package names, the current packaging is essential.

The only proper public API is in the convert package. Extensibility is provided by the public *Module classes in each third party dependency-specific package. The packages could be split out into separate Maven modules. Clients pick and choose which dependencies to enable by assembling modules when the injector is instantiated.

If packaging were done by recordname, then the module would have to refer to all the different third party classes, and there would be no extensibility.

heuermh · 2017-06-01T15:29:00Z

bdg-convert version 0.1 was released to Maven Central on May 26 2017.
https://github.com/bigdatagenomics/bdg-convert/releases

heuermh mentioned this issue Oct 5, 2016

Clean up packaging of conversion methods #1170

Closed

heuermh closed this as completed Jun 1, 2017

heuermh modified the milestone: 0.23.0 Jul 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluate bdg-convert external conversion library proposal #1197

Evaluate bdg-convert external conversion library proposal #1197

heuermh commented Oct 5, 2016

fnothaft commented Oct 7, 2016

heuermh commented Oct 7, 2016

fnothaft commented Oct 7, 2016

heuermh commented Oct 9, 2016

heuermh commented Jun 1, 2017

Evaluate bdg-convert external conversion library proposal #1197

Evaluate bdg-convert external conversion library proposal #1197

Comments

heuermh commented Oct 5, 2016

fnothaft commented Oct 7, 2016

heuermh commented Oct 7, 2016

fnothaft commented Oct 7, 2016

heuermh commented Oct 9, 2016

heuermh commented Jun 1, 2017