Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Phil/discover merge #1291

Merged
merged 3 commits into from
Nov 21, 2023
Merged

Phil/discover merge #1291

merged 3 commits into from
Nov 21, 2023

Conversation

psFried
Copy link
Member

@psFried psFried commented Nov 21, 2023

Description:

Introduces a new approach to discover merge, based on a configured array of JSON pointers that are used to extract a resourcePath from the resource of capture bindings.

The resource_path_pointers can be returned by connectors as part of their spec response.
If they are, then they'll be persisted as part of the connector_tags table.
Eventually, resource_path_pointers will be a required field in the response, but for now, we're allowing connectors to still omit them during the transition period.

If the resource_path_pointers are available, then we'll use them to merge discovered bindings with the bindings on the live spec. Otherwise, we'll fallback to the original behavior for doing that merge.

The new discover merge behavior works by using the resource_path_pointers to extract a ResourcePath (Vec<String> under the hood) from each discovered and live resource. The extracted ResourcePath is then used to define equality between bindings. This lets it ignore changes to any fields that do not belong to the ResourcePath when merging bindings.

Workflow steps:

I tested this locally by:

  • Installing source-http-ingest and source-postgres-batch
  • Since the connectors have not yet been updated, there is no resource_path_pointers value in connector_tags
  • Setup a capture for each, update the query template for the batch sql capture to a custom value
  • Edit the captures and hit the "re-fresh" button to perform discovers, observe that the binding with the custom template was removed. Cancel the edit
  • Manually update connector tags to set resource_path_pointers = array['/stream'] for source-http-ingest
  • Manually update connector tags to set resource_path_pointers = array['/name'] for source-postgres-batch
  • Now try to edit the captures again and when you hit "re-fresh" the bindings are no longer getting deleted. They should be merged as expected

Documentation links affected:

None

Notes for reviewers:

Connector developers will need to be extremely careful about any updates to resource_path_pointers. If you change the pointers that are used to merge bindings, it can easily result in discovers removing bindings where it wasn't intended. I think a good general rule should be to just not change resource_path_pointers without introducing a new major connector_tags version. Though there's probably ways to add pointers without breaking things in some cases, so I've chosen not to actually prevent you from changing resource_path_pointers at this time.

The migration is numbered to slot it after @kiahna-tucker's migration that's being added in #1248, which is why it skips a number.


This change is Reviewable

@psFried psFried requested a review from jgraettinger November 21, 2023 17:11
Updates the `connector_tags` handler to set the `resource_path_pointers` column
based on the spec response.

Also updates the agent to use the new rust `runtime` crate for executing `spec`
RPCs instead of the old `flowctl-go api spec`, which will be removed in a
subsequent commit.
Introduces a new way of merging discovered and live capture bindings, based on
the `resource_path_pointers` from the connector's `spec` response.  If
`resource_path_pointers` are unavailable, then it will fallback to the previous
behavior for matching bindings.  This is intended to be left in place only for
the short term, while we update connectors to return `resource_path_pointers`
as part of their `spec`.

The new merge behavior allows for other fields (besides those that are part of
`resource_path_pointers`) to differ between the discovered and live bindings
while still being recognized as being "the same".  In practice, this lets
connectors return discovered bindings containing `resource` fields that can be
changed by users, without having those bindings be removed by the next discover
operation.
@psFried psFried force-pushed the phil/discover-merge branch from b36d45a to ff6c0b1 Compare November 21, 2023 18:16
Copy link
Member

@jgraettinger jgraettinger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

.arg(&self.connector_network),
)
.await?;
let proto_type = match runtime::flow_runtime_protocol(&image_composed).await {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI no action -- we've not really talked about it IIRC -- but I've been thinking / endeavoring to keep the protobuf tags and names equal so that Spec request / response can be a "universal" RPC that's applied without knowing its type a priori.

This is still pretty half baked. I think it would mean, for example, arbitrarily picking Runtime::unary_capture to get a spec response even though you have no idea what the actual connector type is.

(And Derivation connectors do also have a Spec RPC)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, and this code makes the problem pretty apparent. In the case of image connectors, we can know the protocol up front by inspecting the image. But for other (local) connectors, there's no way to tell ahead of time unless we store the configuration elsewhere.

I think it would be nice if we could come up with a way to make spec a "universal" RPC. Another idea would be to pull the protocol from the connector_tags table instead of the other way around. Basically, make whoever deploys the connector specify what type of connector it is.


/// Extracts the value of each of the given `resource_path_pointers` and encodes
/// them into a `ResourcePath`. Each pointer always gets mapped to a string
/// value, with the string `undefined` being used to represent a missing value.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to have a more cohesive story for how ResourcePath above, and the repeated string resource_path = in the connector protocols, become one and the same thing.

Today, resource_path is always a bare string extracted from resource configs in every connector we have, IIRC. I'm thinking we ought to have a glide path where we're no longer asking the connector to produce resource_path values, but if it does, it had better be the same as our computed ResourcePath.

That's in tension, though, with using a JSON string serialization here. I recognize that this doesn't really matter just yet, as this representation only lasts for the duration of a single discovery merge, so no immediate action required.

We could do nothing just yet and land this as-is. I'm cool with that.

A proposal, though: we could require that resource path pointers always point to a string in the resource config, and make it an error if they aren't a string.

If they're missing, they're coerced into the empty string (which I don't necessarily love, but think I would support because of 👇. IMO the ideal would be to error if they're absent)

This would work for every specific connector resource config I'm aware of and would also match the extraction behavior of their various Validate RPCs as they exist today. Once migrated, we can always relax it later...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they're missing, they're coerced into the empty string (which I don't necessarily love, but think I would support because of 👇. IMO the ideal would be to error if they're absent)

I came across a few cases that I think require us to support undefined values. The most notable is airbyte connectors, which would properly define resource_path_pointers as [/stream, /namespace]. The /namespace can be used to disambiguate e.g. tables having the same name in different schemas. But the property itself is optional, and typically doesn't appear.

I could definitely get behind the idea of restricting resource paths to string values, just out of a sense of caution. And same goes for validating that any path that's returned from the connector should match the one we've computed. But I think at this point I'm inclined to merge this as is, and take that as a separate effort.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up here. I confirmed that we do actually have connectors that conditionally use the optional namespace portion of the resource config. Also, the logic that's currently in place for returning the resource_path is:

            if ns := config.get("namespace"):
                resource_path = [ns, config["stream"]]
            else:
                resource_path = [config["stream"]]

I'm thinking that the safe option is to preserve this existing behavior if/when we extract the resource path using resource_path_pointers. If we did, then this rust function should actually look more like:

fn resource_path(
    resource_path_pointers: &[doc::Pointer],
    resource: &serde_json::Value,
) -> anyhow::Result<ResourcePath> {
    let mut path = Vec::new();
    for pointer in resource_path_pointers {
        match pointer.query(resource) {
            Some(serde_json::Value::String(str_val)) => path.push(str_val.to_string()),
            Some(other) => {
                anyhow::bail!("resource path component must be a string, got: {other:?}")
            }
            None => { /* this is fine */ }
        }
    }
    Ok(path)
}

This actually seems more reasonable to me than the current behavior, in terms of how it (doesn't) represent path components that aren't actually defined by the resource. The "undefined" placeholder is smelly, and technically could conflict with an absurdly named postgres schema.

Regarding how we might rectify the resource_path_pointers with the resource_path in Validated:

I'm still of two minds. One the one hand, I think it makes sense to just validate that the resource_path on the Validated response matches the path that's extracted using resource_path_pointers. On the other hand, we don't actually need the resource_path to be part of the Validated response at all, and would be a simplification to deprecate and remove it. If there's value in keeping resource_path in Validated, it seems to be as a sanity check that the current connector code is in agreement with the known spec. That's something, but...

I've also been thinking that we should be extremely careful about any changes to the resource_path, to the point were it probably makes sense for us to validate that a connector_tags job doesn't change an existing value of resource_path_pointers. This is because the resource path is used to key per-binding task states, so any changes to it will obviously cause problems.

So coming full circle, I'm leaning toward:

  • Use the resource_path_pointers to extract the resource path during validation
    • bikeshed: how should we thread through the connector spec, so validation can use the pointers?
  • Deprecate and remove resource_path from the Validated response
  • Short term: validate that resource_path in Validated response, if present, matches the extracted resource path
  • Update the connector_tags job to error out if it would change an existing non-null value of resource_path_pointers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All of this sounds right to me, including your proposed implementation for resource_path and handling for missing path pointers.

bikeshed: how should we thread through the connector spec, so validation can use the pointers?

I think you're referring to kind-of inverting how validation currently works: instead of the response having a resource path that the connector gives the runtime, instead the request would have a resource path given to the connector by the runtime?

This makes sense to me. As a matter of protocol philosophy (which we've already been living), IMO the runtime should over-communicate to the connector and expect as little as possible in return.

@psFried psFried merged commit 01dbb5b into master Nov 21, 2023
@psFried psFried deleted the phil/discover-merge branch November 21, 2023 19:39
@psFried psFried mentioned this pull request Nov 28, 2023
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants