Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Future of schemas in bdg-formats #925

Closed
tdanford opened this issue Jan 20, 2016 · 22 comments
Closed

Future of schemas in bdg-formats #925

tdanford opened this issue Jan 20, 2016 · 22 comments
Milestone

Comments

@tdanford
Copy link
Contributor

Thoughts:
(1) We need to make progress towards a 1.0 release
(2) Part, maybe the most important part, of such a release is a clear way where we can commit to supporting certain 'stable' Avro record schemas and data models,
(3) the current models package is a mess, built without Avro (Scala classes mostly) and whose conceptual overlap with the models from bdg-formats is unclear (even to us!)
(4) the current bdg-formats package is also a mess, containing both record schemas that we rely on (AlignmentRecord!) as well as a number of duplicative or more experimental schemas.

Prior to the Amplab retreat last week (and unfortunately, @fnothaft, after you had left for the day) we -- myself, Matt, Uri, Michael, and Justin -- tried to talk through how we might approach some of these problems in a measured way.

Here are some of the ideas we came up with:
(A) bdg-formats will be, post-1.0 release, the "supported" schemas: pared down as much as possible, changing as slowly as possible, and with attention paid to migrations when they change,
(B) some of the more experimental, unused, or redundant schemas will be moved back into the models package in the adam repo,
(C) the models package will become our experimental 'staging' area, with interfaces that we use but aren't committed to supporting in their current form for the long term,
(D) the existing model classes will each be split, into an Avro record schema and (for the class methods that live inside each model class at the moment) a paired "Utils" class containing static methods.

Here's a screenshot of one of the whiteboards from last week's conversation, showing a list of all the record schemas in bdg-formats today and some rough notation as to which schemas we think might stay, disappear, or move back into models:
amplab-todo

One of the questions arising from the discussion was "how difficult will (D) be?" That's the question I'm trying to address, with the first PR.

I'll editorialize: I'm not sure this is the right answer, but I think (personally) it's a move in the right direction. We need to be clear about what models is, which interfaces we're supporting, and how the models stuff interacts with the bdg-formats stuff. Also, we've got a lot of single-use classes in models (Frank, as you pointed out, SingleReadBucket is mostly used by MarkDuplicates) which seems like a code-smell to me. At any rate, even if this exercise is impetus to re-evaluate all the schemas that are currently in bdg-formats and to pare down the (unmanageably large) number in there at the moment to something more usable, then I believe it will have been worth it.

@tdanford
Copy link
Contributor Author

I'll try to render that whiteboard as text soon.

@heuermh
Copy link
Member

heuermh commented Jan 20, 2016

feel free to copy from the google doc as necessary

@tdanford
Copy link
Contributor Author

Will do. I mostly want this Issue to be about the Avro records / models issue and designating a core of the bdg-formats schemas as "supported."

@fnothaft
Copy link
Member

Thanks for putting this together, @tdanford!

(3) the current models package is a mess, built without Avro (Scala classes mostly) and whose conceptual overlap with the models from bdg-formats is unclear (even to us!)

I think this is too imprecise to be actionable. Specifically, what is a mess about our current models? E.g.,:

  • Do we have unused models?
  • Should some of the models be migrated into another package or made package private? <-- I think this is definitely the case for models like Alphabet, SnpTable, IndelTable, Consensus...
  • Do we have dead code in some of our models?
  • Do the data models that our models present not fit their intended use cases?
  • Should our models do less? I.e., should they just be bean-like?

Also, why an Avro defined class preferable here to Scala classes/case classes?

As for the overlap between org.bdgenomics.adam.models and bdg-formats, I've always conceived the separation as:

  • bdg-formats: primary genomic datatypes that we expect to store
  • org.bdgenomics.adam.models: common classes that describe secondaey genomic data (e.g., genomic positions), or metadata (e.g., collections of read groups)
  • org.bdgenomics.adam.rich: Classes that enrich our basic bdg-formats records

(4) the current bdg-formats package is also a mess, containing both record schemas that we rely on (AlignmentRecord!) as well as a number of duplicative or more experimental schemas.

I would describe the feature, variant annotation, and fragments schemas as experimental. My preference would be to harden important schemas to production quality, instead of making the schemas experimental. For example, we can't do a round trip from annotated VCF --> ADAM --> annotated VCF without a quality DatabaseVariantAnnotation schema.

From my point of view, it is much more important for a 1.0 release to be feature complete and concordant for basic features (like data import and export) than for the 1.0 release to have a maximally clean API. Not that API cleanliness isn't important—it's definitively a high priority—but who cares about what the APIs or schemas look like if we don't have the fundamentals down? Essentially, if we don't have a variant annotation schema that we're going to stand behind, then we can't do annotated VCF --> ADAM --> annotated VCF, and if we can't do that, then we shouldn't be tagging a 1.0.

I'll editorialize: I'm not sure this is the right answer, but I think (personally) it's a move in the right direction. We need to be clear about what models is, which interfaces we're supporting, and how the models stuff interacts with the bdg-formats stuff. Also, we've got a lot of single-use classes in models (Frank, as you pointed out, SingleReadBucket is mostly used by MarkDuplicates) which seems like a code-smell to me.

I think the code smell is that we're exposing these classes as public classes in the models package. IMO, the solution is much simpler than what this issue posits: we make those classes private, and move them to the package where their algorithm lives.

At any rate, even if this exercise is impetus to re-evaluate all the schemas that are currently in bdg-formats and to pare down the (unmanageably large) number in there at the moment to something more usable, then I believe it will have been worth it.

@fnothaft
Copy link
Member

Sorry, that comment posted before it was finished... Part 2 coming shortly.

@fnothaft
Copy link
Member

(D) the existing model classes will each be split, into an Avro record schema and (for the class methods that live inside each model class at the moment) a paired "Utils" class containing static methods.

IMO, this is pretty smelly. I don't understand why this would be preferable to having a case class whose instance methods do what the new Utils class would do. To do this for all model classes would be a major refactor, and I'm just not seeing what we'd gain...?

At any rate, even if this exercise is impetus to re-evaluate all the schemas that are currently in bdg-formats and to pare down the (unmanageably large) number in there at the moment to something more usable, then I believe it will have been worth it.

I think that re-evaluating the schemas is wise and is something we should be doing all the time, but I don't think that we have an unimaginable large set of schemas... 18 individual objects is a bit steep, I suppose, but that's mostly because our schemas are fairly deeply nested. E.g., a VCF line explodes out into something like this:

  • DatabaseVariantAnnotation
  • Genotype
    • Variant
      • Contig
      • StructuralVariant
        • StructuralVariantType
    • GenotypeAllele
    • VariantCallingAnnotations

So, 18 schemas is a bit steep I suppose, but I think the number of schemas we have just reflects the complexity of the data we're working with and the way that we've decomposed our data model.

@heuermh
Copy link
Member

heuermh commented Jan 20, 2016

I care about two things:

  1. a version 1.0 of schema should mean that users can leave data around on disk in those formats for long periods of time without worrying that they will become incompatibly out of date. From that point forward we need to not change things or provide simple conversion tools.

  2. any public APIs in version 1.0 of the code become subject to binary compatibility rules going forward, which makes it more difficult to make changes. Anything public at that point should be worth committing to.

By some combination of hardening the schema and API we care about and moving to experimental or making private those we don't, we'll get there.

@tdanford
Copy link
Contributor Author

  • Do we have unused models?

I believe we do.

  • Should some of the models be migrated into another package or made package private? <-- I think this is definitely the case for models like Alphabet, SnpTable, IndelTable, Consensus...

Yeah, I agree, and I'm not claiming that everything in models should stay there.

  • Do we have dead code in some of our models?

Yes.

  • Do the data models that our models present not fit their intended use cases?

Clearly, for two reasons:

  1. It's not clear what the intended use case is, for a lot of the models, and
  2. more importantly, it's not clear to me why some models should be Avro records with schemas in bdg-formats, and some should be classes living in the models package. It's not clear to me which schemas, and which models, are "supported" and which are "experimental."
  • Should our models do less? I.e., should they just be bean-like?

This was exactly Matt's claim -- that the models should be more bean-like. @massie want to elaborate?

@fnothaft
Copy link
Member

One of the reasons that I listed the bullets that I did is because I think we can get a better bang-for-the-buck by doing the small refactoring that these bullets imply, than by doing rewriting our models. I think this issue conflates two issues:

  1. Our models need to be refactored and cleaned up.
  2. It isn't clear what schemas in bdg-formats are supported/experimental.

I don't really support rewriting our models as bean-like classes with singleton objects that provide all of our current functionality via static methods mostly because this seems unnecessarily expansive. Realistically, that would mean rewriting the entire models package. I think we would get better bang-for-the-buck from doing the simple refactoring I suggested above (moving some models out of the model package and cleaning up dead code). Rewriting the models package just doesn't seem like a good use of effort—we've got many other issues to spread our effort across between now and 1.0.0. Perhaps I'm missing the benefits of rewriting our models to follow the bean + singleton pattern?

As for the experimental/supported schema issue, it might be good here to work backwards from a 1.0.0 release. Specifically, what formats do we feel like we would need to support in a 1.0.0? Once we've identified which schemas need to be supported, we can triage any remaining schemas into "supported" and "experimental" based on their maturity.

@tdanford
Copy link
Contributor Author

Okay, sorry, let me say this in a different way:

I believe we need a way to mark some of our Avro schemas as "experimental" or in "development." We can do this with comments, with annotations (which I don't think are a thing in Avro?) or by putting them in a new, separate place.

I support the third option (new location), and I think the models package is a good place for those experimental schemas.

This is orthogonal to whether some of the existing model classes should move into other, private, tool- or application-specific packages (I believe that that's probably a good idea anyway). If you accept that models is a good location for experimental Avro schemas, then it seems clear (to me) that some of our current model classes will be (should become) Avro schemas, and then the question is: how hard is it, to make this transition from class -> schema.

That's what the PR was an attempt to explore: how hard is back-porting a class to a schema?

Answer: probably not the worst.

If we decide that SingleReadBucket should just stay a class, but should live in another package, that's fine too. If we decide that we want the experimental Avro schemas should live somewhere else and not in models, that's fine too (and I think people should start making suggestions as to where that should be). If we think we'd rather use another method for marking out "experimental" schemas, then let's hear that as well.

Make sense?

@fnothaft
Copy link
Member

I believe we need a way to mark some of our Avro schemas as "experimental" or in "development." We can do this with comments, with annotations (which I don't think are a thing in Avro?) or by putting them in a new, separate place.

Agreed. Sorry, I think I misinterpreted this issue as focusing on:

This is orthogonal to whether some of the existing model classes should move into other, private, tool- or application-specific packages (I believe that that's probably a good idea anyway). If you accept that models is a good location for experimental Avro schemas, then it seems clear (to me) that some of our current model classes will be (should become) Avro schemas, and then the question is: how hard is it, to make this transition from class -> schema.

Again, if the issue is about making a home for experimental schemas, that sounds reasonable to me. I think though—and this may be because I missed the conversation where the original discussion took place—that the correct way to determine which schemas should be supported and which schemas should be experimental is to figure out what data import/export we want to support in a 1.0, and to work backwards from that. As I understand, the image posted with this ticket was us triaging our current schemas into supported/experimental on the basis of current usage/maturity. Since we control the timeline to a 1.0, I think it is preferable to work backwards from what we want to support, instead of working forwards from where we are right now.

This is orthogonal to whether some of the existing model classes should move into other, private, tool- or application-specific packages (I believe that that's probably a good idea anyway). If you accept that models is a good location for experimental Avro schemas, then it seems clear (to me) that some of our current model classes will be (should become) Avro schemas, and then the question is: how hard is it, to make this transition from class -> schema.

It isn't clear to met at all that any of our current model classes should become Avro schemas. What is the advantage of having Avro beans + singleton objects vs. Scala case classes with instance methods?

That's what the PR was an attempt to explore: how hard is back-porting a class to a schema?

Answer: probably not the worst.

I think this is a false extrapolation. SingleReadBucket is easy to rewrite because it is used in one place (MarkDuplicates), while other models (ReferenceRegion/Position, SequenceDictionary, RecordGroupDictionary, etc) are used pervasively throughout our codebase. The models that are used in a single algorithm (SnpTable, IndelTable, Consensus, etc) probably shouldn't be rewritten, they should just be moved to the package that the algorithm that uses them lives in, and be made package private.

To summarize my opinions:

  • I think marking schemas as experimental/supported is a good idea for a 1.0 release. I am agnostic as to whether they should go in an "experimental package" or be annotated or whatever.
  • I think we should work backwards from what we plan to support in a 1.0 release to determine which schemas should be supported vs. experimental.
  • I think we should migrate single purpose models into their appropriate package and make them package private.
  • I do not think we should rewrite our existing models as Avro schemas + Singleton objects, because I do not understand why this is better than our current approach.

@tdanford
Copy link
Contributor Author

It isn't clear to met at all that any of our current model classes should become Avro schemas.
What is the advantage of having Avro beans + singleton objects vs. Scala case classes with
instance methods?

Serialization! Managed data structures on disk, matching up metadata to external databases, using the same metadata for different data structures, etc.

SequenceDictionary was, you may recall, originally an Avro schema. RecordGroupDictionary could be too. Managing and carrying along metadata for things like variants and alignment records and raw reads and features and reference sequence is, I argue, one of the major blind spots in ADAM right now.

There was also a suggestion (@massie @laserson ? ) that Avro schemas might make it easier to support Java APIs and (through the Java API) the Python API in the future.

@tdanford
Copy link
Contributor Author

I think this is a false extrapolation.

Actually, the question about "how hard is the refactoring" was mostly a question of things like: imports between Avro .avdls, usage of IntelliJ, etc. The original question was really mechanical, not so much about the particular class that I chose first.

@tdanford
Copy link
Contributor Author

As I understand, the image posted with this ticket was us triaging our current schemas
into supported/experimental on the basis of current usage/maturity. Since we control the
timeline to a 1.0, I think it is preferable to work backwards from what we want to support,
instead of working forwards from where we are right now.

Yeah, that whiteboard image wasn't a final answer, it was an example of what kinds of decisions we might make. I posted it for completeness, or for an example of what we might work backwards from.

The priority here is to get moving on 1.0. We all basically agreed with you, Frank: getting to a 1.0 release is important to a number of us, and no one wants you to be the only person doing any of the work.

@fnothaft
Copy link
Member

SequenceDictionary was, you may recall, originally an Avro schema. RecordGroupDictionary could be too. Managing and carrying along metadata for things like variants and alignment records and raw reads and features and reference sequence is, I argue, one of the major blind spots in ADAM right now.

Both are backed on disk by Avro schemas (from bdg-formats) as of #906. #909 is open to port this exact fix over to RDD[Genotype].

@tdanford tdanford changed the title Convert ADAM models classes into Avro records + "Utils" classes Future of schemas in bdg-formats Jan 26, 2016
@tdanford
Copy link
Contributor Author

In the interest of moving this issue forward, I'm going to focus on this suggestion

I think we should work backwards from what we plan to support in a 1.0 release to
determine which schemas should be supported vs. experimental.

@fnothaft
Copy link
Member

In the interest of moving this issue forward, I'm going to focus on this suggestion

+1

I'd like to support:

  • AlignmentRecord <-> SAM, BAM
  • Genotype and all schemas nested underneath <-> VCF
  • DatabaseVariantAnnotation and all schemas nested underneath <-> VCF

@heuermh
Copy link
Member

heuermh commented Jan 26, 2016

The red scribbles on the left side of the board were our high-level suggestions

  • Reference sequence (refactor Contig and NucleotideContigFragment to look more like GA4GH)
  • Sequence annotations/Features (support GFF3 and BED-like formats)
  • Variants and Genotypes
    • Variant annotations (merge DatabaseVariantAnnotation and VCF ANN spec stuff)
  • Sample tracking/metadata (sampleId might not be enough)

Some of this might be considered new/experimental, though I'm not concerned we couldn't get it all whipped into shape for a version 1.0.

@tdanford
Copy link
Contributor Author

There were also some high level comments, like "factor out all the enums to a separate file and align their values with external ontologies, explicitly noted in teh comments."

@laserson
Copy link
Contributor

laserson commented Apr 1, 2016

+1 to working backwards from what to support in 1.0. I think @fnothaft's limited selection is a good start. I'd also support @heuermh's additions, though I don't think they are as crucial. Since we don't operate actively on those additional data types (AFAIK), we're probably farther from production-readiness there.

@fnothaft
Copy link
Member

With the various refactors we've done on the schemas and the conversion datapaths recently, I don't think this issues is relevant anymore. Does anyone have an objection to closing? CC @heuermh @laserson ?

@heuermh
Copy link
Member

heuermh commented Jun 22, 2017

I'd be ok with closing this. Those things mentioned here I see as still unresolved are in separate issues on bdg-formats.

@heuermh heuermh modified the milestones: 1.0.0, 0.23.0 Dec 7, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants