discovery and schema inference #640

jgraettinger · 2022-08-21T16:28:46Z

jgraettinger
Aug 21, 2022
Maintainer

Background

We have a few rough categories of capture connectors:

Captures which have a strict, known schema from their captured endpoint system (such as SQL CDC)
Captures over unstructured stores, like cloud storage...
a) where the user strictly controls the endpoint schema (because, say, they produce the file objects), or
b) where the user has no control over endpoint schema (because they're produced by a 3rd party or vendor).

Case 1) is more-or-less a solved problem. The connector can quickly produce an appropriate, restrictive JSON schema as part of the discovery workflow.

Cases 2a) and 2b) are more interesting. Previously, discovery would produce a strict but minimal schema, covering properties we knew would be present (such as the source file path and record offset).

Recently we introduced a workflow where discovery will actively read documents from the endpoint and feed them into a schema inference tool, in order to produce a usefully restrictive schema for downstream materialization / derivations.

Problem

After some real-world testing, we're coming across issues with schema inference during discovery:

It's slow. Connectors we've implemented must read a meaningful amount of data in order to come up with a useful schema. Today when the user clicks "Discover Endpoint", they essentially watch a spinner for an indefinite amount of time without an indication of what's happening or why. This is doubly-true with realtime or streaming data sources where we must await a reasonable amount of data that's trickling in.
It's wrong. Even after waiting for a reasonable chunk of data to infer over, we frequently see that schemas are still incorrect with respect to future documents. Specific examples are required properties which later turn out to be optional, and properties which are almost-always string but occasionally show up as integer.
It can be hard to get a good sample. It's impractical to introspect a bucket with terrabytes of data as a quick user-in-the-loop operation. We've spent a bit of time adding reservoir sampling heuristics, but schemas often change over time and there are no guarantees we'll get a representative sample.
Every connector is a bit different. We're finding we must implement customized discover logic for each connectors, which is untennable and tends to largely repeat work we've already done (for example, listing file objects and reading a selection in a slightly different way than that done at runtime).

Proposal

I think we got this all a bit wrong, when we tried to place schema inference into the discovery workflow. Discovery should be a fast operation. In cases where we know the schema (1 above), great. In cases where we don't, we should output a permissive and minimal schema as we had been before.

What we should do instead is to offer tools for "tightening" the schema of existing collections after-the-fact, using our knowledge of their existing documents:

For 2a) above, the user might update the collection schema to a restrictive version after capturing all current data.
For 2b), the user might leave the permissive schema in place but use a restrictive read-time schema for derivations.

For any collection, we should be able to tell the user what a suitable schema is, and offer them options to take action (say, by updating the collection schema through a UI action).

What this change buys:

Discovery is once again a fast operation, and doesn't require looking at actual data.
Inferred schemas are comprehensive over the entire contents of a collection, and not just a sample.
Connectors are significantly simplified, as they no longer concern themselves with inference.

Implementation Challenges

One approach is to infer collection schema using a one-time batch map/reduce operation, where "map" consumes a journal fragment and produces a Shape, and "reduce" unions these Shapes into a final output schema. It's clear that this could work, and is embarrassingly parallel.

Ideally we'd keep a running view of the collection's Shape as we're processing it, and would be able to incrementally / cheaply offer a current schema given all documents to-date.

At first blush this seems doable -- after all, we're processing these documents in shards already in, say, a capture-- but there are some major caveats: a collection can be bound to multiple captures, or both be a derivation and also be captured into, for example. In cases where multiple tasks produce into a collection, no single task can have a complete picture of the collection's schema. Additionally, a collection need not always have a task which is reading or writing it. Inference should still work in all of these cases.

Broader Context

Incremental schema inference requires reading the collection content and (because there can be more than one producer) cannot be instrumented within the producer.

Inference is not alone in this: a related use case is exactly-once random access reads at arbitrary locations within collection journals. Such reads are useful to support pull-based materializations (Kafka read API) as well as batch map/reduce backfills over historical collection content. These random reads are a related challenge because exactly-once requires that the reader maintain sequencer state by incrementally reading the journal from byte zero.

For this use case, we probably want an index where, for a given fragment, we can retrieve a sequencer snapshot to use in reading that fragment. This index could be produced either lazily (on-demand) or proactively. If we're reading the data anyway, it seems reasonable that this operation could also accumulate a schema Shape over the given fragment, which can be further reduced as-needed.

psFried · 2022-08-22T22:30:17Z

psFried
Aug 22, 2022
Maintainer

I agree that our current approach to the problem seems wrong, and it seems preferable in case 2 to start with a permissive schema and then tighten it later.

The main question that I think still needs answered relates to the impact on the UX for materializations and derivations. Schema inference will generally output a schema that's as restrictive as possible while allowing all observed documents to validate against it. This seems to indicate that schemas won't gradually become more restrictive over time if they're being inferred automatically. Consider the following hypothetical:

User creates a capture from cloud storage, which starts out with a very permissive schema.
The first fragment of data is written, and schema inference is run and the user updates the schema with the result.
Subsequent fragment files are written and schema inference is run again. It now outputs a less restrictive schema because it's observed new types of values that weren't in the original, smaller corpus.
- As we periodically update the schema, it might need to become less restrictive, but it will never become more restrictive. So the schema that's output from that first run is likely to be the most restrictive schema that would ever be applied to that collection.
Now the the user creates a materialization of the collection, which pulls in all the fields that have a single scalar value (because Flow helpfully infers and selects all such projections by default).
When schema inference is run again, one of those fields is no longer a single scalar value because a new type of value was observed. A less restrictive schema is output.
The new schema can't be applied to the collection now because the materialization driver will fail validation because you can't materialize a field with multiple types.

More generally, the schema represents a guarantee to downstream consumers of the data. Relaxation of those guarantees seems far more likely to cause problems for those consumers than tightening them. If a field previously was allowed to be of type ["string", "null"], then it's unlikely to cause problems for consumers of that data if you update the schema to only allow "string". But if you went the other way and change it to ["string", "null", "integer"], then 💥 So if we're updating the schema over time, it seems best to try to make it gradually more restrictive rather immediately going to the most restrictive schema possible.

Idea?

This has me thinking that maybe we actually want there to be multiple schemas.

We could add a new schema that is purely informational, and kept up to date continuously by feeding their data through schema inference. This schema would start out as maximally restrictive, and may relax over time. Tasks would not use the informational schema at runtime. It's only there to inform user actions in the control plane. It's OK for it to relax over time because we're not using it for tasks, only to power UX.

The other schema is the collection schema, which is used by tasks for validation. This would start out very permissive, and would gradually tighten over time. The schema would tighten through user action, ideally at the point where they want to actually use the data. Say they're starting with a collection that has a very permissive schema. They want to materialize it, and there's basically no projections, and no useful columns. So we present them with an editor to edit the collection schema and projections. We use the informational schema to power an editor that lets them select the fields they want to materialize, which get added to the collection schema and projections. This lets the schema tighten only as much as it needs to in order to accommodate the tasks that the user wants to setup.

The other thought is that there's conceptually little distance between an "informational" schema and data profiling output¹. It seems like it would work just fine to have that continuous automated process that profiles their data, and use the output of that to power the schema/projection editing. Process wise, it also seems like it could be a drop-in replacement for schema inference because the outputs ought to be reducible in the same way.

Profiling here means counting the number of observed values of different types at all locations within each document. We want to do this anyway, so we can tell users about how often various fields are null/undefined, and other information like that. Profiling is a rich and contentious subject, but for our purposes, even fairly simplistic data profiling strategies will be able to produce output that's good enough to turn into a JSON schema. ↩

0 replies

jshearer · 2022-09-28T16:55:08Z

jshearer
Sep 28, 2022
Maintainer

@psFried, @jgraettinger and I talked about this more today, and came to the following. Collections can have two different schemas: a write-time schema, and an optional read-time schema.

Schema types:

Write-Time schemas exist to constrain the data coming into a collection with constraints we know for sure to be true. These probably come from the capture's schema output, which should only contain constraints that can be guaranteed by the data source in question. These are the only schemas that collections have right now.
Read-Time schemas' purpose is to allow us to evolve (relax) the collection schema over time, as new data is read. In the case of a well-constrained capture, the read-time schema doesn't need to be provided and will default to being equivalent to the write-time schema. A collection's read-time schema will be used by derivations/materializations of that collection.

Paint Color

Currently, collection specs contain {"schema": {...}}. The proposal is to define that "schema" as the collection's write-time schema, add an additional {"read_schema":{...}} which, when specified, would be the read-time schema for that collection. If not specified, the read-time schema will default to the write-time schema.

Intent

Separating read-time schemas from write-time schemas allow us to capture potentially dirty data and then apply an evolving schema to it after-the-fact, while also retaining the ability to apply a hard schema upfront when capturing from a datasource with strong data-shape guarantees.

Behavior

When we don't have a strong write-schema, we'll need a significant amount of data from which to infer a meaningful read-schema. This is why we'll wait until the user goes to actually do something (derive, materialize) with their collection before attempting to infer its read-time schema.

At the moment this will probably look like running a once-off time-limited schema inference job that returns you a schema, but in the future we could very well have a live-tailing schema inference task that keeps track of inferred schemas for each collection. This has the added benefit of not losing data like change-history from capture sources that don't allow recovery of this data.

Adapting to change

Over time, the inferred schema for a collection may change as a result of seeing new records that don't match the existing inferred schema. When we encounter this scenario, the consequence is that derivations/materializations will stop, and the user will have to go into the UI and approve an update to the schema that allows the new data through.

As part of that process, all existing derivations/materializations will get re-validated with the new schema, which will result either in success if the new schema doesn't conflict with the existing one, or an error if it does. In the case of an error, you'll have to pause or delete the failing task, re-apply the schema, then re-create the task using the new schema which will, in the case of materializations, require replaying all data.

Note: This is the only place that I feel is sticky here. My proposal was to introduce a second option in addition "approve schema changes", which would be something like "filter data like this". The intent is to allow users to keep processing with the existing schema, and simply drop data that differs from the existing schema in a specific way. The upshot was that this is a conceptual addition, and while it might very well make sense, we should hold off on implementing it until we see clear customer demand, which I do agree with.

The work-around for now is to create a derivation that takes the broader schema from the capture, and filters out data that doesn't match the desired schema. This necessitates the same replaying of materializations as above, but at least it does offer a path if someone wants to do it.

The requirement of deleting/re-creating materializations is largely technical, not product. In the future, the desire is to support migrating materializations in specific ways, where possible. We just haven't built that yet.

1 reply

jgraettinger Sep 28, 2022
Maintainer Author

This separation of write vs read schema also opens up a really interesting capability:

the task which is producing documents to a collection already knows what it's read schema is, and can proactively "do something" if it wrote documents which passed the write-schema but failed the read-schema. For example, it could produce a summary of the shape of violating documents into an ops/ collection.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

discovery and schema inference #640

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

discovery and schema inference #640

jgraettinger Aug 21, 2022 Maintainer

Background

Problem

Proposal

Implementation Challenges

Broader Context

Replies: 2 comments · 1 reply

psFried Aug 22, 2022 Maintainer

Idea?

Footnotes

jshearer Sep 28, 2022 Maintainer

Schema types:

Paint Color

Intent

Behavior

Adapting to change

jgraettinger Sep 28, 2022 Maintainer Author

jgraettinger
Aug 21, 2022
Maintainer

Replies: 2 comments 1 reply

psFried
Aug 22, 2022
Maintainer

jshearer
Sep 28, 2022
Maintainer

jgraettinger Sep 28, 2022
Maintainer Author