Replies: 2 comments 8 replies
-
Resource path mergingWe think that we can improve on this by changing the logic of how we match up discovered bindings with the bindings of the live spec. We already have a The message Discovered {
message Binding {
string recommended_name = 1;
string resource_config_json = 2;
string document_schema_json = 3;
repeated string key = 4;
bool disable = 5;
repeated string resource_path = 6; // <- Add resource path here
}
repeated Binding bindings = 1;
} Deleting bindings from live capturesThe current semantics of the discovered response are that it enumerates every possible resource that could be captured. In other words, any resource that isn't explicitly included in the discover response is assumed to be un-capturable, and is removed from the live spec. For many source systems, such as database CDC captures, this is indeed the case. If the discover response from a CDC capture connector doesn't include a given table, then it's probably because it's been dropped or the user no longer has access. In either case, the capture likely to fail at runtime if those bindings are not removed. But for other systems, it may not be possible or practical to enumerate every distinct resource that could be captured. For example, cloud storage captures have a Another example is the database batch connectors. Users could enter any arbitrary query in the With the current logic, we don't really have the option to avoid deleting such "custom" bindings. The introduction of
One motivation for thinking about this now is that the solution may involve tweaking some things related to I can unpack those ideas a little more, but I'd like to first pause and see what capacity everyone has for even thinking about it. |
Beta Was this translation helpful? Give feedback.
-
I wasn't precise enough: The primary role of auto-discover is to add and update bound collections and type-A (coming from the endpoint) schemas. It's perfect for that. But we've also leaned on it heavily to kick derivations and materializations at-a-distance due to inferred schema changes of type-B systems (coming through inference), even where those schema changes may propagate through layers of derivations. That part is a kludge. The missing feature would be something like an "auto publish", where consumers of a collection are automatically re-published based on inferred schema changes. We've previously discussed monitoring logs of derivations/materializations to look for read schema violations over inferred schemas so we could queue those tasks for re-publication: this is in-essence an implementation of an "auto publish" feature. Today, if tenant A has a capture Getting back to the original discussion, if they weren't coupled as they currently are, then a capture with hand-written bindings and no auto-discovery would Just Work.
Let's be precise about what's meant by "spec" here: does "spec" encapsulate just user-directed configuration? Or also the current state of an inferred schema? IMO ideally it's just the former, and that we ought to be able to limit graph expansion to roll up immediate consumers of collections that have changes to user-directed configuration. Another thought experiment and going back to the tenant A & B example: suppose A is publishing a change to Even this may not be enough. Consider a future where we've opened up ops logs & stats for direct user use, and there are (say) thousands of materializations of the stats collection. How can we practically update the stats collection? I don't think the answer is in one publication applying all of those materializations at-once. I think we'll ultimately need to invert this and have consuming tasks asynchronously update themselves in response to upstream changes, in smaller publication units. But we've probably derailed this discussion enough as it is lol. |
Beta Was this translation helpful? Give feedback.
-
Tracking issue for implementation: #1285
A
discovers
job runs a connector to request information aboutresource
s that can be captured, and merges that information with specs from an existing draft and/or live specs. There's some problems with the way that we're merging the disocvered bindings and collections.Original issue
The discover merge logic iterates over the newly discovered bindings and tries to find an existing binding that corresponds to it. It determines this based on whether the
resource
object of the existing binding is a strict superset of theresource
object of the discovered binding. If so, then it's a match. Here's a description of a recent issue we enountered with that:{"name":"foo","template":"/* default template foo */"}
"template":"<something-else>"
, and published.{"name":"foo","template":"/* default template */"}
did not match{"name":"foo","template":"<something-else>"}
, so the binding was removed (sinceaddNewBindings
was false)evergreen
The plan is to:
Discovered
response to add aresource_path
field to each bindingresource_path
in the discovered response bindingsresource_path
agent
use theresource_path
as the merge keyresource_path
that doesn't exist in the set of discovered bindings will be removed from the live specBeta Was this translation helpful? Give feedback.
All reactions