discovery and schema inference #640
Replies: 2 comments 1 reply
-
I agree that our current approach to the problem seems wrong, and it seems preferable in case 2 to start with a permissive schema and then tighten it later. The main question that I think still needs answered relates to the impact on the UX for materializations and derivations. Schema inference will generally output a schema that's as restrictive as possible while allowing all observed documents to validate against it. This seems to indicate that schemas won't gradually become more restrictive over time if they're being inferred automatically. Consider the following hypothetical:
More generally, the schema represents a guarantee to downstream consumers of the data. Relaxation of those guarantees seems far more likely to cause problems for those consumers than tightening them. If a field previously was allowed to be of type Idea?This has me thinking that maybe we actually want there to be multiple schemas. We could add a new schema that is purely informational, and kept up to date continuously by feeding their data through schema inference. This schema would start out as maximally restrictive, and may relax over time. Tasks would not use the informational schema at runtime. It's only there to inform user actions in the control plane. It's OK for it to relax over time because we're not using it for tasks, only to power UX. The other schema is the collection schema, which is used by tasks for validation. This would start out very permissive, and would gradually tighten over time. The schema would tighten through user action, ideally at the point where they want to actually use the data. Say they're starting with a collection that has a very permissive schema. They want to materialize it, and there's basically no projections, and no useful columns. So we present them with an editor to edit the collection schema and projections. We use the informational schema to power an editor that lets them select the fields they want to materialize, which get added to the collection schema and projections. This lets the schema tighten only as much as it needs to in order to accommodate the tasks that the user wants to setup. The other thought is that there's conceptually little distance between an "informational" schema and data profiling output1. It seems like it would work just fine to have that continuous automated process that profiles their data, and use the output of that to power the schema/projection editing. Process wise, it also seems like it could be a drop-in replacement for schema inference because the outputs ought to be reducible in the same way. Footnotes
|
Beta Was this translation helpful? Give feedback.
-
@psFried, @jgraettinger and I talked about this more today, and came to the following. Collections can have two different schemas: a write-time schema, and an optional read-time schema. Schema types:
Paint ColorCurrently, collection specs contain IntentSeparating read-time schemas from write-time schemas allow us to capture potentially dirty data and then apply an evolving schema to it after-the-fact, while also retaining the ability to apply a hard schema upfront when capturing from a datasource with strong data-shape guarantees. BehaviorWhen we don't have a strong write-schema, we'll need a significant amount of data from which to infer a meaningful read-schema. This is why we'll wait until the user goes to actually do something (derive, materialize) with their collection before attempting to infer its read-time schema. At the moment this will probably look like running a once-off time-limited schema inference job that returns you a schema, but in the future we could very well have a live-tailing schema inference task that keeps track of inferred schemas for each collection. This has the added benefit of not losing data like change-history from capture sources that don't allow recovery of this data. Adapting to changeOver time, the inferred schema for a collection may change as a result of seeing new records that don't match the existing inferred schema. When we encounter this scenario, the consequence is that derivations/materializations will stop, and the user will have to go into the UI and approve an update to the schema that allows the new data through. As part of that process, all existing derivations/materializations will get re-validated with the new schema, which will result either in success if the new schema doesn't conflict with the existing one, or an error if it does. In the case of an error, you'll have to pause or delete the failing task, re-apply the schema, then re-create the task using the new schema which will, in the case of materializations, require replaying all data. Note: This is the only place that I feel is sticky here. My proposal was to introduce a second option in addition "approve schema changes", which would be something like "filter data like this". The intent is to allow users to keep processing with the existing schema, and simply drop data that differs from the existing schema in a specific way. The upshot was that this is a conceptual addition, and while it might very well make sense, we should hold off on implementing it until we see clear customer demand, which I do agree with. The work-around for now is to create a derivation that takes the broader schema from the capture, and filters out data that doesn't match the desired schema. This necessitates the same replaying of materializations as above, but at least it does offer a path if someone wants to do it. The requirement of deleting/re-creating materializations is largely technical, not product. In the future, the desire is to support migrating materializations in specific ways, where possible. We just haven't built that yet. |
Beta Was this translation helpful? Give feedback.
-
Background
We have a few rough categories of capture connectors:
a) where the user strictly controls the endpoint schema (because, say, they produce the file objects), or
b) where the user has no control over endpoint schema (because they're produced by a 3rd party or vendor).
Case 1) is more-or-less a solved problem. The connector can quickly produce an appropriate, restrictive JSON schema as part of the discovery workflow.
Cases 2a) and 2b) are more interesting. Previously, discovery would produce a strict but minimal schema, covering properties we knew would be present (such as the source file path and record offset).
Recently we introduced a workflow where discovery will actively read documents from the endpoint and feed them into a schema inference tool, in order to produce a usefully restrictive schema for downstream materialization / derivations.
Problem
After some real-world testing, we're coming across issues with schema inference during discovery:
required
properties which later turn out to be optional, and properties which are almost-alwaysstring
but occasionally show up asinteger
.Proposal
I think we got this all a bit wrong, when we tried to place schema inference into the discovery workflow. Discovery should be a fast operation. In cases where we know the schema (1 above), great. In cases where we don't, we should output a permissive and minimal schema as we had been before.
What we should do instead is to offer tools for "tightening" the schema of existing collections after-the-fact, using our knowledge of their existing documents:
For any collection, we should be able to tell the user what a suitable schema is, and offer them options to take action (say, by updating the collection schema through a UI action).
What this change buys:
Implementation Challenges
One approach is to infer collection schema using a one-time batch map/reduce operation, where "map" consumes a journal fragment and produces a Shape, and "reduce" unions these Shapes into a final output schema. It's clear that this could work, and is embarrassingly parallel.
Ideally we'd keep a running view of the collection's Shape as we're processing it, and would be able to incrementally / cheaply offer a current schema given all documents to-date.
At first blush this seems doable -- after all, we're processing these documents in shards already in, say, a capture-- but there are some major caveats: a collection can be bound to multiple captures, or both be a derivation and also be captured into, for example. In cases where multiple tasks produce into a collection, no single task can have a complete picture of the collection's schema. Additionally, a collection need not always have a task which is reading or writing it. Inference should still work in all of these cases.
Broader Context
Incremental schema inference requires reading the collection content and (because there can be more than one producer) cannot be instrumented within the producer.
Inference is not alone in this: a related use case is exactly-once random access reads at arbitrary locations within collection journals. Such reads are useful to support pull-based materializations (Kafka read API) as well as batch map/reduce backfills over historical collection content. These random reads are a related challenge because exactly-once requires that the reader maintain sequencer state by incrementally reading the journal from byte zero.
For this use case, we probably want an index where, for a given fragment, we can retrieve a sequencer snapshot to use in reading that fragment. This index could be produced either lazily (on-demand) or proactively. If we're reading the data anyway, it seems reasonable that this operation could also accumulate a schema Shape over the given fragment, which can be further reduced as-needed.
Beta Was this translation helpful? Give feedback.
All reactions