How to better merge discovered bindings #1268

psFried · 2023-10-31T18:24:04Z

psFried
Oct 31, 2023
Maintainer

Tracking issue for implementation: #1285

A discovers job runs a connector to request information about resources that can be captured, and merges that information with specs from an existing draft and/or live specs. There's some problems with the way that we're merging the disocvered bindings and collections.

Original issue

The discover merge logic iterates over the newly discovered bindings and tries to find an existing binding that corresponds to it. It determines this based on whether the resource object of the existing binding is a strict superset of the resource object of the discovered binding. If so, then it's a match. Here's a description of a recent issue we enountered with that:

The connector discovered a bunch of bindings where the resources looked something like {"name":"foo","template":"/* default template foo */"}
The user edited the resource config to set "template":"<something-else>", and published.
Later, discover ran again and {"name":"foo","template":"/* default template */"} did not match {"name":"foo","template":"<something-else>"}, so the binding was removed (since addNewBindings was false)

evergreen

The plan is to:

Update the Discovered response to add a resource_path field to each binding
Update all connectors to set resource_path in the discovered response bindings
Expect and require that every discovered binding has a unique (to the capture) resource_path
Have agent use the resource_path as the merge key
Existing bindings on the live spec having a resource_path that doesn't exist in the set of discovered bindings will be removed from the live spec

psFried · 2023-10-31T18:24:27Z

psFried
Oct 31, 2023
Maintainer Author

Resource path merging

We think that we can improve on this by changing the logic of how we match up discovered bindings with the bindings of the live spec. We already have a resource_path attached to each binding in the built_spec, which uniquely identifies resources that can be captured. This mirrors the resource_path in the materialization protocol.

The Validate RPC returns the resource_path of each binding, which get included in the built_spec. So the remaining need is just to add resource_path as part of the Discovered response, so that the discovered bindings can be matched to those in the built spec. Something like this:

  message Discovered {
    message Binding {
      string recommended_name = 1;
      string resource_config_json = 2;
      string document_schema_json = 3;
      repeated string key = 4;
      bool disable = 5;
      repeated string resource_path = 6;  // <- Add resource path here
    }
    repeated Binding bindings = 1;
  }

Deleting bindings from live captures

The current semantics of the discovered response are that it enumerates every possible resource that could be captured. In other words, any resource that isn't explicitly included in the discover response is assumed to be un-capturable, and is removed from the live spec. For many source systems, such as database CDC captures, this is indeed the case. If the discover response from a CDC capture connector doesn't include a given table, then it's probably because it's been dropped or the user no longer has access. In either case, the capture likely to fail at runtime if those bindings are not removed.

But for other systems, it may not be possible or practical to enumerate every distinct resource that could be captured. For example, cloud storage captures have a stream in their resource configs that can represent an arbitrary bucket/prefix combination to capture from. Cloud storage connectors don't emit every possible prefix in their discover responses, so if a user edits the stream in the resource config, then discovers would delete their binding.

Another example is the database batch connectors. Users could enter any arbitrary query in the template, which need not correspond directly to a single extant table.

With the current logic, we don't really have the option to avoid deleting such "custom" bindings. The introduction of resource_path to discover responses opens up some possible approaches. I'm seeing a few different paths we can explore:

Ascribe special meaning to a zero-valued resource_path to indicate a "custom" binding that should not be removed automatically.
Pass existing resource configs as part of a discover request, so that connectors can return them as part of the response.
Do nothing for not, and hope that users don't customize the resource and enable autoDiscover

One motivation for thinking about this now is that the solution may involve tweaking some things related to resource_paths, and it might be nice to update connectors while we're at it anyway.

I can unpack those ideas a little more, but I'd like to first pause and see what capacity everyone has for even thinking about it.

6 replies

jgraettinger Nov 2, 2023
Maintainer

Ascribe special meaning to a zero-valued resource_path to indicate a "custom" binding that should not be removed automatically.

It's a possibility. I'm not hard-no, but disposed against because it means a connector's Validate RPC needs to effectively do a Discover as well, in order to figure out if a validated resource is one that would be emitted by discovery (in which case, send a full resourcePath) or one that wouldn't (in which case, send an empty resource path).

It also means updating a bunch of validation checks that assert resourcePaths are unique -- it also needs to know about & handle this exception.

Pass existing resource configs as part of a discover request, so that connectors can return them as part of the response.

I think this is a non-starter because it blows up the scope of what a connector needs to do in it's discover RPC (it's now responsible for reconciliation).

Do nothing for not, and hope that users don't customize the resource and enable autoDiscover

I think this is our only realistic option right now. Captures should either have only (customized) resources returned by discovery, or should never have used discovery in the first place. If we do it this way, I cannot see a clear argument why that would be insufficient or unreasonable in the future. They're wildly different use cases and setup workflows.

jgraettinger Nov 2, 2023
Maintainer

I think that "do nothing" isn't an option here.

@dyaffe what this is in reference to, is whether users would be able to have customized bindings / tables in their capture that are also mixed with bindings that were never ever returned by discovery and were crafted from whole-cloth by the user (this is what @psFried meant by "custom" in this context).

An example would be a binding that's querying a completely arbitrary SELECT ... query using a batch-sql capture that's mixed -- in the same capture -- with bindings that were originally returned via discovery of actual tables and perhaps customized from there.

IMO our position should be "use a different, GitOps-managed capture without auto-discovery for non-discovered bindings". I think that's the right workflow anyway even without the technical implementation considerations.

psFried Nov 2, 2023
Maintainer Author

I cannot see a clear argument why that would be insufficient or unreasonable in the future.

One argument is that users will likely want to use schema inference with "custom" bindings, and they would have no way to automatically update the effective inferred schema without autoDiscover. This would be true regardless of whether they used a separate capture or not.

Some other options that come to mind:

We could add a retainUnknownBindings option under autoDiscover, and make users opt in or out on a per-capture basis. It would probably still be best to segregate the "custom" bindings from the discovered ones in a separate capture. But at least you'd be able to use autoDiscover.

We could also add a retainUnknownBindings flag on the spec response, and use that either set the default value of autoDiscover.retainUnknownBindings, or else just use that flag directly within the discovers handler. I can think of at least one example (source-http-ingest) where this would seem pretty reasonable, since all of the bindings of that connector are effectively "custom".

jgraettinger Nov 2, 2023
Maintainer

One argument is that users will likely want to use schema inference with "custom" bindings, and they would have no way to automatically update the effective inferred schema without autoDiscover.

That's true. But I think it's a symptom of a different problem: capture autoDiscover is a kludge, with regard to updating materializations due to a change of the inferred schema of a collection.

It also doesn't really work for derivations, and only happens to work today because we are over-broad in the graph expansion we do within the control plane. We ought to expand the publication graph less, but supposing we did that, it would definitely be a problem for derivations as well.

I think we need to separate these problems.

psFried Nov 2, 2023
Maintainer Author

I think it's a symptom of a different problem: capture autoDiscover is a kludge, with regard to updating materializations due to a change of the inferred schema of a collection.

I think we're seeing this differently, assuming I've understood your meaning. I don't see autoDiscover as a kludge. The way I see it, the point of autoDiscover isn't to try to update materializations per se, but rather to update the collections (and the capture spec, of course). The fact that materializations and derivations that read from a given collection get updated when you publish a change to that collection seems right to me. I see this as the basic mechanism that we use in order to maintain consistency. I thought this was by design, so that we don't end up in situations where every consumer of a collection has a slightly different view of what they think the read schema is.

And having autoDiscover be responsible for updating the schemas of captured collections also seems in line with the established principle that the source connector "owns" the definitions of the schemas. Note that it's potentially also updating the writeSchema as part of this. The fact that autoDiscover doesn't also handle updating derived collection schemas doesn't seem like a deficiency to me. My understanding was that we were good with the plan to introduce a separate mechanism for updating derivation schemas.

I'm also not really sure how to take the comment about graph expansion being overly broad. My understanding is that the general idea is that we perform graph expansion as required to ensure that all tasks bound to a given collection will use the exact same collection spec when reading from or writing to it. So given that, I'm not seeing how we can do less of that while still maintaining a consistent view of collection specs across all tasks that are bound to it.

I'll pause here and ask that you try to point out where you're seeing things differently.

jgraettinger · 2023-11-03T14:31:12Z

jgraettinger
Nov 3, 2023
Maintainer

I wasn't precise enough: autoDiscover is a kludge when relied on for updating a long-distance graph of derivations and materializations.

The primary role of auto-discover is to add and update bound collections and type-A (coming from the endpoint) schemas. It's perfect for that.

But we've also leaned on it heavily to kick derivations and materializations at-a-distance due to inferred schema changes of type-B systems (coming through inference), even where those schema changes may propagate through layers of derivations. That part is a kludge.

The missing feature would be something like an "auto publish", where consumers of a collection are automatically re-published based on inferred schema changes. We've previously discussed monitoring logs of derivations/materializations to look for read schema violations over inferred schemas so we could queue those tasks for re-publication: this is in-essence an implementation of an "auto publish" feature.

Today, if tenant A has a capture A/capture writing to inferred-schema collection A/collection, that's materialized by tenant B through materialization B/materialize, and A doesn't have or want autoDiscover enabled, then tenant B has no recourse for having their materialization automatically adapt to schema changes of A/collection. That smells. Why should A have to do that? Why does automatically discovering new bindings / collections / type-A schemas need to have anything to do with reacting to inference updates of already-established, type-B collections ?

Getting back to the original discussion, if they weren't coupled as they currently are, then a capture with hand-written bindings and no auto-discovery would Just Work.

My understanding is that the general idea is that we perform graph expansion as required to ensure that all tasks bound to a given collection will use the exact same collection spec when reading from or writing to it.

Let's be precise about what's meant by "spec" here: does "spec" encapsulate just user-directed configuration? Or also the current state of an inferred schema? IMO ideally it's just the former, and that we ought to be able to limit graph expansion to roll up immediate consumers of collections that have changes to user-directed configuration.

Another thought experiment and going back to the tenant A & B example: suppose A is publishing a change to A/capture that does not directly change or relate to the existing A/collection. However B/materialize happens to be failed due to an inferred schema change in A/collection that cannot be migrated without user intervention. Today, our graph expansion means that A's attempt to publish their capture will fail, and there's nothing they can do about it, as they cannot themselves fix B/materialize (they don't admin it). That seems broken, and fixing it would require not expanding to B/materialize.

Even this may not be enough. Consider a future where we've opened up ops logs & stats for direct user use, and there are (say) thousands of materializations of the stats collection. How can we practically update the stats collection? I don't think the answer is in one publication applying all of those materializations at-once. I think we'll ultimately need to invert this and have consuming tasks asynchronously update themselves in response to upstream changes, in smaller publication units.

But we've probably derailed this discussion enough as it is lol.

2 replies

psFried Nov 3, 2023
Maintainer Author

Thanks for this, I think I can see where we weren't on the same page now.

does "spec" encapsulate just user-directed configuration? Or also the current state of an inferred schema? IMO ideally it's just the former, and that we ought to be able to limit graph expansion to roll up immediate consumers of collections that have changes to user-directed configuration.

I see the sense in this, but I also see it as being incompatible with a few aspects of the current system. Most recently, #1235 seems to be based on the opposite framing where the inferred schema is treated as part of the collection spec. But more importantly, there's not currently any way to publish a collection spec without causing the inferred schema to be updated for all consumers of it.

going back to the tenant A & B example: suppose A is publishing a change to A/capture that does not directly change or relate to the existing A/collection. However B/materialize happens to be failed due to an inferred schema change in A/collection that cannot be migrated without user intervention. Today, our graph expansion means that A's attempt to publish their capture will fail, and there's nothing they can do about it, as they cannot themselves fix B/materialize (they don't admin it). That seems broken, and fixing it would require not expanding to B/materialize.

Fixing this seems like it would require reverting #1235, and moving to an approach where inferred schemas are only updated by the separate "auto publish" process. If we include the collection in the publication, then we still have to expand the graph to include B/materialize, yeah? So our only recourse is to not have autoDiscovers include collections where the spec is identical but the inferred schema has changed.

I would think that we'd also need to change how we track the live_inferred_schema_md5, which would need to instead track the effective inferred schema md5 for each binding of each consumer of the collection (since they can change independently).

I understand this is a bit off topic, but I feel like I got whooshed regarding the data model, and want to understand the scope of implementation that isn't aligned with the "new" model.

jgraettinger Nov 3, 2023
Maintainer

Sure. Though there isn't a "new" model per-se, and I don't have definite answers here. The only take-aways I have to offer right now are:

a) I don't think we've totally nailed the dynamic between inferred schema changes and resulting publications
b) autoDiscovers -- as a practical implementation -- is super helpful for inferred schemas, but IMO we are asking a lot of it
c) we're probably missing a control feedback loop that reacts to observed inferred schema failures
d) given all that, we shouldn't couple or require that autoDiscovers be an answer for hand-written capture bindings, which is how the discussion lead here.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to better merge discovered bindings #1268

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 8 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

How to better merge discovered bindings #1268

psFried Oct 31, 2023 Maintainer

Original issue

evergreen

Replies: 2 comments · 8 replies

psFried Oct 31, 2023 Maintainer Author

Resource path merging

Deleting bindings from live captures

jgraettinger Nov 2, 2023 Maintainer

jgraettinger Nov 2, 2023 Maintainer

psFried Nov 2, 2023 Maintainer Author

jgraettinger Nov 2, 2023 Maintainer

psFried Nov 2, 2023 Maintainer Author

jgraettinger Nov 3, 2023 Maintainer

psFried Nov 3, 2023 Maintainer Author

jgraettinger Nov 3, 2023 Maintainer

psFried
Oct 31, 2023
Maintainer

Replies: 2 comments 8 replies

psFried
Oct 31, 2023
Maintainer Author

jgraettinger Nov 2, 2023
Maintainer

jgraettinger Nov 2, 2023
Maintainer

psFried Nov 2, 2023
Maintainer Author

jgraettinger Nov 2, 2023
Maintainer

psFried Nov 2, 2023
Maintainer Author

jgraettinger
Nov 3, 2023
Maintainer

psFried Nov 3, 2023
Maintainer Author

jgraettinger Nov 3, 2023
Maintainer