-
Notifications
You must be signed in to change notification settings - Fork 309
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Future of schemas in bdg-formats #925
Comments
I'll try to render that whiteboard as text soon. |
feel free to copy from the google doc as necessary |
Will do. I mostly want this Issue to be about the Avro records / models issue and designating a core of the bdg-formats schemas as "supported." |
Thanks for putting this together, @tdanford!
I think this is too imprecise to be actionable. Specifically, what is a mess about our current models? E.g.,:
Also, why an Avro defined class preferable here to Scala classes/case classes? As for the overlap between
I would describe the feature, variant annotation, and fragments schemas as experimental. My preference would be to harden important schemas to production quality, instead of making the schemas experimental. For example, we can't do a round trip from annotated VCF --> ADAM --> annotated VCF without a quality From my point of view, it is much more important for a 1.0 release to be feature complete and concordant for basic features (like data import and export) than for the 1.0 release to have a maximally clean API. Not that API cleanliness isn't important—it's definitively a high priority—but who cares about what the APIs or schemas look like if we don't have the fundamentals down? Essentially, if we don't have a variant annotation schema that we're going to stand behind, then we can't do annotated VCF --> ADAM --> annotated VCF, and if we can't do that, then we shouldn't be tagging a 1.0.
I think the code smell is that we're exposing these classes as public classes in the models package. IMO, the solution is much simpler than what this issue posits: we make those classes private, and move them to the package where their algorithm lives.
|
Sorry, that comment posted before it was finished... Part 2 coming shortly. |
IMO, this is pretty smelly. I don't understand why this would be preferable to having a case class whose instance methods do what the new Utils class would do. To do this for all model classes would be a major refactor, and I'm just not seeing what we'd gain...?
I think that re-evaluating the schemas is wise and is something we should be doing all the time, but I don't think that we have an unimaginable large set of schemas... 18 individual objects is a bit steep, I suppose, but that's mostly because our schemas are fairly deeply nested. E.g., a VCF line explodes out into something like this:
So, 18 schemas is a bit steep I suppose, but I think the number of schemas we have just reflects the complexity of the data we're working with and the way that we've decomposed our data model. |
I care about two things:
By some combination of hardening the schema and API we care about and moving to experimental or making private those we don't, we'll get there. |
I believe we do.
Yeah, I agree, and I'm not claiming that everything in models should stay there.
Yes.
Clearly, for two reasons:
This was exactly Matt's claim -- that the models should be more bean-like. @massie want to elaborate? |
One of the reasons that I listed the bullets that I did is because I think we can get a better bang-for-the-buck by doing the small refactoring that these bullets imply, than by doing rewriting our models. I think this issue conflates two issues:
I don't really support rewriting our models as bean-like classes with singleton objects that provide all of our current functionality via static methods mostly because this seems unnecessarily expansive. Realistically, that would mean rewriting the entire models package. I think we would get better bang-for-the-buck from doing the simple refactoring I suggested above (moving some models out of the model package and cleaning up dead code). Rewriting the models package just doesn't seem like a good use of effort—we've got many other issues to spread our effort across between now and 1.0.0. Perhaps I'm missing the benefits of rewriting our models to follow the bean + singleton pattern? As for the experimental/supported schema issue, it might be good here to work backwards from a 1.0.0 release. Specifically, what formats do we feel like we would need to support in a 1.0.0? Once we've identified which schemas need to be supported, we can triage any remaining schemas into "supported" and "experimental" based on their maturity. |
Okay, sorry, let me say this in a different way: I believe we need a way to mark some of our Avro schemas as "experimental" or in "development." We can do this with comments, with annotations (which I don't think are a thing in Avro?) or by putting them in a new, separate place. I support the third option (new location), and I think the models package is a good place for those experimental schemas. This is orthogonal to whether some of the existing model classes should move into other, private, tool- or application-specific packages (I believe that that's probably a good idea anyway). If you accept that That's what the PR was an attempt to explore: how hard is back-porting a class to a schema? Answer: probably not the worst. If we decide that SingleReadBucket should just stay a class, but should live in another package, that's fine too. If we decide that we want the experimental Avro schemas should live somewhere else and not in Make sense? |
Agreed. Sorry, I think I misinterpreted this issue as focusing on:
Again, if the issue is about making a home for experimental schemas, that sounds reasonable to me. I think though—and this may be because I missed the conversation where the original discussion took place—that the correct way to determine which schemas should be supported and which schemas should be experimental is to figure out what data import/export we want to support in a 1.0, and to work backwards from that. As I understand, the image posted with this ticket was us triaging our current schemas into supported/experimental on the basis of current usage/maturity. Since we control the timeline to a 1.0, I think it is preferable to work backwards from what we want to support, instead of working forwards from where we are right now.
It isn't clear to met at all that any of our current model classes should become Avro schemas. What is the advantage of having Avro beans + singleton objects vs. Scala case classes with instance methods?
I think this is a false extrapolation. To summarize my opinions:
|
Serialization! Managed data structures on disk, matching up metadata to external databases, using the same metadata for different data structures, etc.
There was also a suggestion (@massie @laserson ? ) that Avro schemas might make it easier to support Java APIs and (through the Java API) the Python API in the future. |
Actually, the question about "how hard is the refactoring" was mostly a question of things like: imports between Avro .avdls, usage of IntelliJ, etc. The original question was really mechanical, not so much about the particular class that I chose first. |
Yeah, that whiteboard image wasn't a final answer, it was an example of what kinds of decisions we might make. I posted it for completeness, or for an example of what we might work backwards from. The priority here is to get moving on 1.0. We all basically agreed with you, Frank: getting to a 1.0 release is important to a number of us, and no one wants you to be the only person doing any of the work. |
Both are backed on disk by Avro schemas (from bdg-formats) as of #906. #909 is open to port this exact fix over to |
In the interest of moving this issue forward, I'm going to focus on this suggestion
|
+1 I'd like to support:
|
The red scribbles on the left side of the board were our high-level suggestions
Some of this might be considered new/experimental, though I'm not concerned we couldn't get it all whipped into shape for a version 1.0. |
There were also some high level comments, like "factor out all the enums to a separate file and align their values with external ontologies, explicitly noted in teh comments." |
+1 to working backwards from what to support in 1.0. I think @fnothaft's limited selection is a good start. I'd also support @heuermh's additions, though I don't think they are as crucial. Since we don't operate actively on those additional data types (AFAIK), we're probably farther from production-readiness there. |
I'd be ok with closing this. Those things mentioned here I see as still unresolved are in separate issues on bdg-formats. |
Thoughts:
(1) We need to make progress towards a 1.0 release
(2) Part, maybe the most important part, of such a release is a clear way where we can commit to supporting certain 'stable' Avro record schemas and data models,
(3) the current
models
package is a mess, built without Avro (Scala classes mostly) and whose conceptual overlap with the models frombdg-formats
is unclear (even to us!)(4) the current
bdg-formats
package is also a mess, containing both record schemas that we rely on (AlignmentRecord
!) as well as a number of duplicative or more experimental schemas.Prior to the Amplab retreat last week (and unfortunately, @fnothaft, after you had left for the day) we -- myself, Matt, Uri, Michael, and Justin -- tried to talk through how we might approach some of these problems in a measured way.
Here are some of the ideas we came up with:
(A)
bdg-formats
will be, post-1.0 release, the "supported" schemas: pared down as much as possible, changing as slowly as possible, and with attention paid to migrations when they change,(B) some of the more experimental, unused, or redundant schemas will be moved back into the
models
package in theadam
repo,(C) the models package will become our experimental 'staging' area, with interfaces that we use but aren't committed to supporting in their current form for the long term,
(D) the existing model classes will each be split, into an Avro record schema and (for the class methods that live inside each model class at the moment) a paired "Utils" class containing static methods.
Here's a screenshot of one of the whiteboards from last week's conversation, showing a list of all the record schemas in bdg-formats today and some rough notation as to which schemas we think might stay, disappear, or move back into models:
One of the questions arising from the discussion was "how difficult will (D) be?" That's the question I'm trying to address, with the first PR.
I'll editorialize: I'm not sure this is the right answer, but I think (personally) it's a move in the right direction. We need to be clear about what models is, which interfaces we're supporting, and how the models stuff interacts with the bdg-formats stuff. Also, we've got a lot of single-use classes in models (Frank, as you pointed out, SingleReadBucket is mostly used by MarkDuplicates) which seems like a code-smell to me. At any rate, even if this exercise is impetus to re-evaluate all the schemas that are currently in bdg-formats and to pare down the (unmanageably large) number in there at the moment to something more usable, then I believe it will have been worth it.
The text was updated successfully, but these errors were encountered: