-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Phil/discover merge #1291
Phil/discover merge #1291
Conversation
Updates the `connector_tags` handler to set the `resource_path_pointers` column based on the spec response. Also updates the agent to use the new rust `runtime` crate for executing `spec` RPCs instead of the old `flowctl-go api spec`, which will be removed in a subsequent commit.
Introduces a new way of merging discovered and live capture bindings, based on the `resource_path_pointers` from the connector's `spec` response. If `resource_path_pointers` are unavailable, then it will fallback to the previous behavior for matching bindings. This is intended to be left in place only for the short term, while we update connectors to return `resource_path_pointers` as part of their `spec`. The new merge behavior allows for other fields (besides those that are part of `resource_path_pointers`) to differ between the discovered and live bindings while still being recognized as being "the same". In practice, this lets connectors return discovered bindings containing `resource` fields that can be changed by users, without having those bindings be removed by the next discover operation.
b36d45a
to
ff6c0b1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
.arg(&self.connector_network), | ||
) | ||
.await?; | ||
let proto_type = match runtime::flow_runtime_protocol(&image_composed).await { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI no action -- we've not really talked about it IIRC -- but I've been thinking / endeavoring to keep the protobuf tags and names equal so that Spec request / response can be a "universal" RPC that's applied without knowing its type a priori.
This is still pretty half baked. I think it would mean, for example, arbitrarily picking Runtime::unary_capture
to get a spec response even though you have no idea what the actual connector type is.
(And Derivation connectors do also have a Spec RPC)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, and this code makes the problem pretty apparent. In the case of image connectors, we can know the protocol up front by inspecting the image. But for other (local) connectors, there's no way to tell ahead of time unless we store the configuration elsewhere.
I think it would be nice if we could come up with a way to make spec
a "universal" RPC. Another idea would be to pull the protocol
from the connector_tags
table instead of the other way around. Basically, make whoever deploys the connector specify what type of connector it is.
|
||
/// Extracts the value of each of the given `resource_path_pointers` and encodes | ||
/// them into a `ResourcePath`. Each pointer always gets mapped to a string | ||
/// value, with the string `undefined` being used to represent a missing value. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to have a more cohesive story for how ResourcePath
above, and the repeated string resource_path =
in the connector protocols, become one and the same thing.
Today, resource_path is always a bare string extracted from resource configs in every connector we have, IIRC. I'm thinking we ought to have a glide path where we're no longer asking the connector to produce resource_path
values, but if it does, it had better be the same as our computed ResourcePath
.
That's in tension, though, with using a JSON string serialization here. I recognize that this doesn't really matter just yet, as this representation only lasts for the duration of a single discovery merge, so no immediate action required.
We could do nothing just yet and land this as-is. I'm cool with that.
A proposal, though: we could require that resource path pointers always point to a string in the resource config, and make it an error if they aren't a string.
If they're missing, they're coerced into the empty string (which I don't necessarily love, but think I would support because of 👇. IMO the ideal would be to error if they're absent)
This would work for every specific connector resource config I'm aware of and would also match the extraction behavior of their various Validate RPCs as they exist today. Once migrated, we can always relax it later...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they're missing, they're coerced into the empty string (which I don't necessarily love, but think I would support because of 👇. IMO the ideal would be to error if they're absent)
I came across a few cases that I think require us to support undefined values. The most notable is airbyte connectors, which would properly define resource_path_pointers
as [/stream, /namespace]
. The /namespace
can be used to disambiguate e.g. tables having the same name in different schemas. But the property itself is optional, and typically doesn't appear.
I could definitely get behind the idea of restricting resource paths to string values, just out of a sense of caution. And same goes for validating that any path that's returned from the connector should match the one we've computed. But I think at this point I'm inclined to merge this as is, and take that as a separate effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following up here. I confirmed that we do actually have connectors that conditionally use the optional namespace
portion of the resource config. Also, the logic that's currently in place for returning the resource_path
is:
if ns := config.get("namespace"):
resource_path = [ns, config["stream"]]
else:
resource_path = [config["stream"]]
I'm thinking that the safe option is to preserve this existing behavior if/when we extract the resource path using resource_path_pointers
. If we did, then this rust function should actually look more like:
fn resource_path(
resource_path_pointers: &[doc::Pointer],
resource: &serde_json::Value,
) -> anyhow::Result<ResourcePath> {
let mut path = Vec::new();
for pointer in resource_path_pointers {
match pointer.query(resource) {
Some(serde_json::Value::String(str_val)) => path.push(str_val.to_string()),
Some(other) => {
anyhow::bail!("resource path component must be a string, got: {other:?}")
}
None => { /* this is fine */ }
}
}
Ok(path)
}
This actually seems more reasonable to me than the current behavior, in terms of how it (doesn't) represent path components that aren't actually defined by the resource
. The "undefined"
placeholder is smelly, and technically could conflict with an absurdly named postgres schema.
Regarding how we might rectify the resource_path_pointers
with the resource_path
in Validated:
I'm still of two minds. One the one hand, I think it makes sense to just validate that the resource_path
on the Validated response matches the path that's extracted using resource_path_pointers
. On the other hand, we don't actually need the resource_path
to be part of the Validated response at all, and would be a simplification to deprecate and remove it. If there's value in keeping resource_path
in Validated, it seems to be as a sanity check that the current connector code is in agreement with the known spec. That's something, but...
I've also been thinking that we should be extremely careful about any changes to the resource_path
, to the point were it probably makes sense for us to validate that a connector_tags
job doesn't change an existing value of resource_path_pointers
. This is because the resource path is used to key per-binding task states, so any changes to it will obviously cause problems.
So coming full circle, I'm leaning toward:
- Use the
resource_path_pointers
to extract the resource path during validation- bikeshed: how should we thread through the connector spec, so validation can use the pointers?
- Deprecate and remove
resource_path
from the Validated response - Short term: validate that
resource_path
in Validated response, if present, matches the extracted resource path - Update the
connector_tags
job to error out if it would change an existing non-null value ofresource_path_pointers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of this sounds right to me, including your proposed implementation for resource_path
and handling for missing path pointers.
bikeshed: how should we thread through the connector spec, so validation can use the pointers?
I think you're referring to kind-of inverting how validation currently works: instead of the response having a resource path that the connector gives the runtime, instead the request would have a resource path given to the connector by the runtime?
This makes sense to me. As a matter of protocol philosophy (which we've already been living), IMO the runtime should over-communicate to the connector and expect as little as possible in return.
Description:
Introduces a new approach to discover merge, based on a configured array of JSON pointers that are used to extract a
resourcePath
from theresource
of capture bindings.The
resource_path_pointers
can be returned by connectors as part of theirspec
response.If they are, then they'll be persisted as part of the
connector_tags
table.Eventually,
resource_path_pointers
will be a required field in the response, but for now, we're allowing connectors to still omit them during the transition period.If the
resource_path_pointers
are available, then we'll use them to merge discovered bindings with the bindings on the live spec. Otherwise, we'll fallback to the original behavior for doing that merge.The new discover merge behavior works by using the
resource_path_pointers
to extract aResourcePath
(Vec<String>
under the hood) from each discovered and liveresource
. The extractedResourcePath
is then used to define equality between bindings. This lets it ignore changes to any fields that do not belong to theResourcePath
when merging bindings.Workflow steps:
I tested this locally by:
source-http-ingest
andsource-postgres-batch
resource_path_pointers
value inconnector_tags
resource_path_pointers = array['/stream']
forsource-http-ingest
resource_path_pointers = array['/name']
forsource-postgres-batch
Documentation links affected:
None
Notes for reviewers:
Connector developers will need to be extremely careful about any updates to
resource_path_pointers
. If you change the pointers that are used to merge bindings, it can easily result in discovers removing bindings where it wasn't intended. I think a good general rule should be to just not changeresource_path_pointers
without introducing a new majorconnector_tags
version. Though there's probably ways to add pointers without breaking things in some cases, so I've chosen not to actually prevent you from changingresource_path_pointers
at this time.The migration is numbered to slot it after @kiahna-tucker's migration that's being added in #1248, which is why it skips a number.
This change isdata:image/s3,"s3://crabby-images/d0bb7/d0bb7f7625ca5bf5c3cf7a2b7a514cf841ab8395" alt="Reviewable"