From c6b110611e8c5aba0a4abbdd78cccc9942ff3760 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Thu, 21 Nov 2024 22:48:46 +0000 Subject: [PATCH 01/20] WIP --- A53-xds-ignore-resource-deletion.md | 1 + A88-xds-data-error-handling.md | 359 ++++++++++++++++++++++++++++ 2 files changed, 360 insertions(+) create mode 100644 A88-xds-data-error-handling.md diff --git a/A53-xds-ignore-resource-deletion.md b/A53-xds-ignore-resource-deletion.md index def859495..fac9ce151 100644 --- a/A53-xds-ignore-resource-deletion.md +++ b/A53-xds-ignore-resource-deletion.md @@ -6,6 +6,7 @@ A53: Option for Ignoring xDS Resource Deletion * Implemented in: C-core, Java * Last updated: 2022-05-19 * Discussion at: https://groups.google.com/g/grpc-io/c/eh7En9CY3oY +* Updated by: [A88: xDS Data Error Handling](A88-xds-data-error-handling.md) ## Abstract diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md new file mode 100644 index 000000000..b8f9d3722 --- /dev/null +++ b/A88-xds-data-error-handling.md @@ -0,0 +1,359 @@ +A88: xDS Data Error Handling +---- +* Author(s): @markdroth +* Approver: @ejona86, @dfawley +* Status: {Draft, In Review, Ready for Implementation, Implemented} +* Implemented in: +* Last updated: 2024-11-21 +* Discussion at: (filled after thread exists) + +## Abstract + +This document proposes adding support for the new xDS error propagation +mechanism described in [xRFC TP3], which provides a cleaner mechanism +for detecting non-existing resources. It also defines improved behavior +for handling all xDS-related data errors. + +## Background + +### xDS Error Handling + +There are two basic categories of errors that can occur in an xDS client: +- **Transient errors:** These are connectivity problems reaching the + xDS server, such as the xDS channel reporting TRANSIENT_FAILURE or the + ADS stream failing without having received a response from the server, + as described in [gRFC A57]. +- **Data errors:** This includes things like resource validation failures + that cause the client to NACK an update. It also includes errors + reported by the control plane via the new mechanism described in + [xRFC TP3], including the case where a subscribed resource does not + exist. + +For either category of error, if the error occurs when we do not yet +have a cached version of the resource, then we have no choice but to +fail data plane RPCs. That is already our behavior today, and this +proposal does not modify that behavior. + +For transient errors, if we do already have a previously cached version +of the resource, the desired behavior is to ignore the error and keep +using the previously cached version of the resource. This is already +our behavior today, and this proposal does not modify that behavior. + +However, for data errors, if we do already have a previously cached +version of the resource, our current behavior is somewhat inconsistent +and ad hoc. It is useful to consider the desired behavior here from +first principles. There are two main approaches that we can take here: + +1. We can ignore the error and keep using the previously cached resource. + This approach is safer from the perspective of helping to avoid an + immediate outage. However, it does rely on the control plane identifying + the problem and alerting humans to take action, or else the problem + becomes completely silent, which might simply defer the outage until + some arbitrary amount of time later when the clients are restarted, + at which point it will be harder to debug. + +2. We can throw away the previously cached resource and act as if we + never saw it, thus failing data plane requests. This makes the problem + immediately obvious to human operators but may cause a user-visible + outage. + +Because different xDS servers may have different capabilities in terms +of how effectively they can alert humans to take action on data errors, +we cannot know which of those two approaches we should take for a given +xDS server. Therefore, a choice between these two approaches is ideally +something that should be configurable on a per-server basis. + +Unfortunately, our current behavior for data errors does not follow the +principles outlined above. Currently, we have special handling for +the resource deletion case: by default, we throw away the previously +cached resource (i.e., approach 2 above), but we provide an option in +the bootstrap config to inhibit this behavior (see [gRFC A53]). However, +all other data errors are actually treated the same as transient errors: +we unconditionally stick with the previously cached resource (i.e., +approach 1 above). + +This proposal changes the data error handling behavior to match the +principles outlined above. + +### The Resource Timer + +The xDS SotW protocol does not currently provide a way for servers to +explicitly inform clients when a subscribed resource does not exist, +so clients have to impose a 15-second does-not-exist timer for each +individual resource when the client initially subscribes to it, as +described in the [xDS protocol +documentation](https://www.envoyproxy.io/docs/envoy/latest/api-docs/xds_protocol#knowing-when-a-requested-resource-does-not-exist). The exact semantics of +this resource timer are tricky to get right, as described in [gRFC A57].[^1] + +[^1]: Note that gRFC A57 primarily addresses transient errors, while the + does-not-exist timer actually reflects a data error. However, + the two needed to be addressed together, since we wanted the timer + to run only when there is no transient error condition. + +We have seen a number of cases where the resource timer has triggered +in production. These cases have all been very hard to debug due to +the lack of a better error message, and in almost all cases there was +at least the suspicion that the timer triggered even when the resource +really did exist due to the xDS server simply being slow. + +As a result of these problems, we want to work toward eliminating this +resource timer in the client in the long term. As a first step toward +that goal, we propose some concrete changes to minimize the impact of +this timer. + +A key part of this is adding support for [xRFC TP3], which proposes +a mechanism for the xDS server to communicate errors to clients in +cases where the server is unable to send a resource that the client has +subscribed to. + +### Related Proposals: +- [gRFC A53: Option for Ignoring xDS Resource Deletion][gRFC A53] +- [gRFC A57: XdsClient Failure Mode Behavior][gRFC A57] +- [xRFC TP3: Error Propagation][xRFC TP3] + +[gRFC A53]: A53-xds-ignore-resource-deletion.md +[gRFC A57]: A57-xds-client-failure-mode-behavior.md +[xRFC TP3]: https://github.com/cncf/xds/pull/107 + +## Proposal + +This proposal has the following parts: +- changes to the resource timer behavior +- a mechanism to control data error handling behavior +- changes in the XdsClient watcher APIs +- support for errors provided by the xDS server as per [xRFC TP3] +- deprecating the "ignore_resource_deletion" server feature + +### Changes to Resource Timer Behavior + +If we know that the xDS server will use [xRFC TP3] to inform us of +non-existing resources, then we no longer need to rely on the resource +timer to detect non-existing resources, so we can instead re-purpose +the timer to detect slow xDS servers. + +To that end, we will introduce a new server feature in the bootstrap +config called "resource_timer_indicates_transient_failure". When this +server feature is present, there will be two changes to the behavior of +the resource timer: +- The timer will be set for 30 seconds instead of 15 seconds. +- When the timer fires, it will be treated as a transient error + instead of a data error. + +This server feature should be used only in cases where the server is known +to support [xRFC TP3]. If the server feature is used when the server +does *not* support [xRFC TP3], then the client will never report a +non-existant resource; it will instead treat a timeout as a transient +error talking to the xDS server. + +Note that the client will honor [xRFC TP3] errors returned by the server +regardless of whether this server feature is present. The presence of +this server feature actually controls only whether we can *depend* on the +server to provide these errors, so that we can be more lax about the +resource timer. + +### Mechanism to Control Data Error Handling Behavior + +As described above, the behavior that we want on data errors depends +on whether we can depend on the control plane to properly alert human +operators when these errors occur. To configure this, we will introduce a +new server feature in the bootstrap config called "fail_on_data_errors", +which indicates that we cannot depend on the control plane to properly +alert human operators. + +This server feature controls the behavior for handling data errors when +we already have a cached version of the resource. By default, we will +ignore the error and continue using the cached version of the resource. +However, if this server feature is set, then we will instead discard any +previously cached version of the resource and start failing data plane +RPCs. + +### Changes to XdsClient Watcher APIs + +The current `XdsClient` API has two error-handling methods: +- `OnError(Status status)`: Invoked for all transient errors and for some data + errors (NACKs). The watcher is generally expected to ignore the error + if it already has a valid cached resource. +- `OnResourceDoesNotExist()`: Invoked specifically for the does-not-exist + case. The watcher is generally expected to stop using any previously + cached resource and put itself into a failing state. + - Note: The mechanism described in [gRFC A53] inhibits these + notifications for only the case where we already had a cached version + of the resource, which in the SotW protocol applies only to LDS + and CDS. That mechanism does not affect the case where we do not + already have a cached version of the resource, which is where the + does-not-exist timer is used. + +These methods do not map well to the two error categories defined above. + +This proposal replaces those two methods with the following new methods, +which do map cleanly to the two error categories defined above: +- `OnTransientError(Status status)`: Will be invoked for transient errors only. + The watcher is generally expected to ignore the error if it already has + a valid cached resource. +- `OnDataError(Status status, bool fail_on_data_errors)`: Will be invoked for + data errors only. The `fail_on_data_errors` parameter indicates + whether the server feature of the same name was present in the + bootstrap configuration for the xDS server that caused the data error. + If the watcher already has a valid resource, it will use the two + parameters to determine whether it should continue using that resource + or whether it should stop using that resource and put itself into a + failing state. + +For transient errors, we will now call `OnTransientError()` instead of +`OnError()`, but the semantics and behavior will remain the same. + +For data errors, we will now call `OnDataError()` instead of `OnError()` +or `OnResourceDoesNotExist()`. The semantics here will depend on the +bootstrap configuration and, in the case of a status returned by the +xDS server via the mechanism described in [xRFC TP3], the status code. + + + + + +Watchers will be expected to check the +status code to determine how to handle the error. As per pending xRFC +TP3, the status codes NOT_FOUND and PERMISSION_DENIED will cause the +watcher to drop any previously cached version of the resource and put +themselves in a failing state. All other status codes will be ignored +if the watcher already has a previously cached version of the resource. + + +### Support for Errors Provided by the xDS server as per [xRFC TP3] + +We will unconditionally honor the new fields added in [xRFC TP3]. + +When we receive an error for a given resource, we will cancel the resource +timer, if any, and report the error to the watchers' `OnDataError()` +methods. + +### Deprecating "ignore_resource_deletion" Server Feature + +Because we now want the ability to control the behavior of all data +errors consistently rather than special-casing resource deletion, we +will deprecate the "ignore_resource_deletion" server feature that was +added in [gRFC A53]. + +For existing bootstrap configs that have the "ignore_resource_deletion" +server feature, gRPC will ignore that server feature, but the new +default behavior is to ignore all data errors, including resource +deletion, so they will still get the right behavior. + +For existing bootstrap configs that do *not* have the +"ignore_resource_deletion" server feature, the behavior will change: we +will now ignore all data errors (including resource deletions) by default. +If the client wants to instead drop existing cached resources and start +failing data plane RPCs on data errors, they may update their bootstrap +to use the new "fail_on_data_errors" server feature instead. + +### Summary of Behavior Changes + +The following table shows the old and new behavior for each case. + +Data Error | fail_on_data_errors Server Feature | Old Watcher Notification | Old Behavior | New Watcher Notification | New Behavior +---------- | ---------------------------------- | ------------------------ | ------------ | ------------------------ | ------------ +NACK from client | false | `OnError(status)` | Ignore | `OnDataError(status, false)` | Ignore if already have resource +NACK from client | true | `OnError(status)` | Ignore | `OnDataError(status, true)` | Drop existing resource and fail RPCs +Does-not-exist timeout | false | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnDataError(status, false)` | Ignore +Does-not-exist timeout | true | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnDataError(status, true)` | Drop existing resource and fail RPCs +LDS or CDS resource deletion from server | false | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnDataError(status, false)` | Ignore +LDS or CDS resource deletion from server | true | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnDataError(status, false)` | Drop existing resource and fail RPCs +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | false | N/A | N/A | `OnDataError(status, false)` | Ignore +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | N/A | N/A | `OnDataError(status, true)` | Drop existing resource and fail RPCs +[xRFC TP3] error with other status | false | N/A | N/A | `OnDataError(status, false)` | Ignore +[xRFC TP3] error with other status | true | N/A | N/A | `OnDataError(status, true)` | Ignore + +### Temporary environment variable protection + +[Name the environment variable(s) used to enable/disable the feature(s) this proposal introduces and their default(s). Generally, features that are enabled by I/O should include this type of control until they have passed some testing criteria, which should also be detailed here. This section may be omitted if there are none.] + +## Rationale + +[A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.] + + +## Implementation + +[A description of the steps in the implementation, who will do them, and when. If a particular language is going to get the implementation first, this section should list the proposed order.] + +## Open issues (if applicable) + +[A discussion of issues relating to this proposal for which the author does not know the solution. This section may be omitted if there are none.] + +## Future Direction + +To avoid problems where newly started clients fail when the control plane +is down, we have designed xDS fallback functionality, as described in +[gRFC A71](A71-xds-fallback.md). However, because we do not want to +trigger fallback due to an individual resource not existing if the control +plane is actually responding, we do not currently trigger fallback when +the resource timer fires if we are still connected to the control plane. +This has led to concerns about failure modes where control plane slowness +may cause clients to act as if resources do not exist rather than using a +fallback server. Ideally, it seems like that we would like to use a timer +to consider the xDS server connection to have failed if we haven't gotten +back a response in a certain amount of time, but we cannot do that while +we are depending on the timer to detect individual resources not existing. + +As future work, we should consider whether there are ways to safely +trigger xDS fallback if the primary server is responding too slowly. +There are a number of challenges here: + +- Given the number of issues we've seen where the control plane has been + slow to respond, it's not clear to me that we would want all of those + cases to trigger fallback, since we'd likely hit fallback way too often. + I suspect that we would first need the control plane to monitor and provide + an SLO for response time upon initial subscription to a resource. + +- The current timer semantics are that we start a separate timer for each + individual resource when it is initially subscribed to. This ensures + that any control plane slowness that affects the client's functionality + at any time during the stream will cause us to see a problem. + However, given that our criteria for switching back to the primary + server is when we get a single resource back from the server, there + is an incompatibility here: if the control plane is sending back at + least one resource but is timing out for others, then we will have + simultaneously met the criteria for using fallback and the criteria + for switching back to the primary server. + +- Given the above, it might seem beneficial to consider changing the timer + semantics to just check for a timeout on any xDS response, rather + than having a separate timer per resource. Unfortunately, this won't + actually work, because the xDS protocol does not require any minimum + interval between responses. Once the server has sent all of the + resources to the client, if those resources never change, then the + server never needs to send another response, and we would not want to + trigger fallback in this case. + +- As a middle ground, we could consider only having a timeout on the very + first response on the ADS stream to trigger fallback, or even the + very first response for each resource type. However, this would + still leave us blind to cases where the control plane is slow when new + subscriptions are added later in the stream's lifetime, which we have + seen in the wild. + +- Even if we could identify the right timeout criteria to trigger fallback, + we would need to decide whether or not to close the ADS stream to + the primary, and both options here are problematic. If we don't + close the stream to the primary while the primary is timing out for + some resources but still sending updates for others, then it's not + clear how we decide when to switch back to the primary. However, + if we do close the stream and retry, and the new stream gets back at + least one response before hitting another timeout, then we will wind + up continually flapping between the primary and fallbacks servers, + thus DoSing the fallback server with short-lived connections. + +- Note: There's a pre-existing case here where the primary server + can send an initial response and then the stream can die, and if that + happens repeatedly over and over again, it will never cause us to + trigger fallback, because we don't consider the stream to have failed + if we got a response on it before it died. That also means that we + don't impose exponential backoff between stream attempts, so we could + DoS the server. But that problem already exists and is not new here. + +It's not at all clear what the actual requirements are here, nor is it +clear what timer-based behavior can actually be safely used to trigger +xDS fallback, so we'll need a lot more discussion to come to consensus +on what we need here. As a result, I am proposing that we proceed +with only the proposal above and handle this potential improvement as +a subsequent effort. From 040623fbe18b1e3712fb8cb3e4ea4421e799b3f5 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 22 Nov 2024 01:42:09 +0000 Subject: [PATCH 02/20] more details --- A88-xds-data-error-handling.md | 180 +++++++++++++++++++++++---------- 1 file changed, 125 insertions(+), 55 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index b8f9d3722..5b2fc847c 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -16,6 +16,19 @@ for handling all xDS-related data errors. ## Background +### Reporting Errors from the xDS Server + +Currently, the xDS transport protocol does not provide a way for the +server to report per-resource errors to the client. The only thing that +the server can do is react as if the resource does not exist, which +means that the client lacks an informative error message that could be +used for debugging purposes. + +[xRFC TP3] provides a mechanism for the xDS server to send per-resource +errors to the client. This will cover both cases like the resource not +existing (which also allows the client to detect this case more quickly, +as described below) and other errors, such as permission problems. + ### xDS Error Handling There are two basic categories of errors that can occur in an xDS client: @@ -87,7 +100,7 @@ this resource timer are tricky to get right, as described in [gRFC A57].[^1] [^1]: Note that gRFC A57 primarily addresses transient errors, while the does-not-exist timer actually reflects a data error. However, - the two needed to be addressed together, since we wanted the timer + the two needed to be addressed together, since we want the timer to run only when there is no transient error condition. We have seen a number of cases where the resource timer has triggered @@ -137,7 +150,8 @@ server feature is present, there will be two changes to the behavior of the resource timer: - The timer will be set for 30 seconds instead of 15 seconds. - When the timer fires, it will be treated as a transient error - instead of a data error. + instead of a data error. (See below for how this is reflected in the + watcher API.) This server feature should be used only in cases where the server is known to support [xRFC TP3]. If the server feature is used when the server @@ -170,9 +184,11 @@ RPCs. ### Changes to XdsClient Watcher APIs The current `XdsClient` API has two error-handling methods: -- `OnError(Status status)`: Invoked for all transient errors and for some data - errors (NACKs). The watcher is generally expected to ignore the error - if it already has a valid cached resource. + +- `OnError(Status status)`: Invoked for all transient errors and for some + data errors (NACKs). The watcher is generally expected to ignore the + error if it already has a valid cached resource. + - `OnResourceDoesNotExist()`: Invoked specifically for the does-not-exist case. The watcher is generally expected to stop using any previously cached resource and put itself into a failing state. @@ -186,47 +202,92 @@ The current `XdsClient` API has two error-handling methods: These methods do not map well to the two error categories defined above. This proposal replaces those two methods with the following new methods, -which do map cleanly to the two error categories defined above: -- `OnTransientError(Status status)`: Will be invoked for transient errors only. - The watcher is generally expected to ignore the error if it already has - a valid cached resource. -- `OnDataError(Status status, bool fail_on_data_errors)`: Will be invoked for - data errors only. The `fail_on_data_errors` parameter indicates +which do map more cleanly to the two error categories defined above: + +- `OnTransientError(Status status)`: Will be invoked for transient errors + only. This includes the ADS channel reporting TRANSIENT_FAILURE and + the ADS stream terminating without receiving a response, as described + in [gRFC A57]. It also includes the resource timer firing if the + "resource_timer_indicates_transient_failure" server feature is present. + The watcher is generally expected to ignore the error if it already + has a valid cached resource. + +- `OnClientDataError(Status status, bool fail_on_data_errors)`: Will be + invoked for client-generated data errors only. This is used when + we NACK a resource. The `fail_on_data_errors` parameter indicates whether the server feature of the same name was present in the - bootstrap configuration for the xDS server that caused the data error. - If the watcher already has a valid resource, it will use the two - parameters to determine whether it should continue using that resource - or whether it should stop using that resource and put itself into a - failing state. - -For transient errors, we will now call `OnTransientError()` instead of -`OnError()`, but the semantics and behavior will remain the same. - -For data errors, we will now call `OnDataError()` instead of `OnError()` -or `OnResourceDoesNotExist()`. The semantics here will depend on the -bootstrap configuration and, in the case of a status returned by the -xDS server via the mechanism described in [xRFC TP3], the status code. - - - - - -Watchers will be expected to check the -status code to determine how to handle the error. As per pending xRFC -TP3, the status codes NOT_FOUND and PERMISSION_DENIED will cause the -watcher to drop any previously cached version of the resource and put -themselves in a failing state. All other status codes will be ignored -if the watcher already has a previously cached version of the resource. - + bootstrap configuration for the xDS server that caused the data + error. If the watcher already has a valid resource, it will use the + `fail_on_data_errors` parameter to determine whether it should continue + using that resource or whether it should stop using that resource and + put itself into a failing state. + +- `OnServerDataError(Status status, bool fail_on_data_errors)`: Will be + invoked for server-generated data errors only. This is used for + [xRFC TP3] errors and when the does-not-exist timer fires when the + "resource_timer_indicates_transient_failure" server feature is *not* + present. The `fail_on_data_errors` parameter indicates whether + the server feature of the same name was present in the bootstrap + configuration for the xDS server that caused the data error. If the + watcher already has a valid resource, it will use the two parameters to + determine whether it should continue using that resource or whether it + should stop using that resource and put itself into a failing state. + See [xRFC TP3] for guidance on what status codes should be used to + decide to drop an existing resource. + +For transient connectivity errors, including the ADS channel reporting +TRANSIENT_FAILURE and the ADS stream terminating without receiving a +response (as described in [gRFC A57]), we will call `OnTransientError()` +instead of `OnError()`, but the semantics and behavior will remain +the same. The watcher will ignore these errors if there is a previously +cached version of the resource. + +For NACKs, we will call `OnClientDataError()`. The watcher behavior will +depend on the presence of the "fail_on_data_errors" server feature in +the bootstrap configuration: if the server feature is present, we will +drop any previously cached resource and fail data plane RPCs; otherwise, +we will continue using the previously cached resource if we have one. + +For the resource timer, the behavior will depend on the presence of the +"resource_timer_indicates_transient_failure" server feature in the +bootstrap configuration. If that server feature is present, then when +the timer fires, we will call `OnTransientError()` with a status code +of UNAVAILABLE. If the server feature is *not* present, then when +the timer fires, we will call `OnServerDataError()` with a status code +of NOT_FOUND. In either case, since we use this timer only when we do +not have a previously cached version of the resource, the result will +be the same: we will fail data plane RPCs. + +For [xRFC TP3] errors, we will call `OnServerDataError()`. The watcher +behavior will depend on the value of the "fail_on_data_errors" parameter +and the status code. In gRPC, our watchers will throw away a previously +seen resource if `fail_on_data_errors` is true and the status code is +either NOT_FOUND or PERMISSION_DENIED; in any other case, we will retain +the previously seen resource, if any. + +TODO: Do we need to batch updates to the watcher somehow for the case +where we've got a previous version of a resource cached plus an error? +Previously, that was okay, because we'd just deliver both updates to the +watcher, one at a time. But if the watcher is going to decide to throw +away the cached resource when it sees the error, maybe we need to +deliver both at once, in case a new watcher starts when we've already +got both the cached resource and the error? ### Support for Errors Provided by the xDS server as per [xRFC TP3] -We will unconditionally honor the new fields added in [xRFC TP3]. +We will add support for the new response fields added in [xRFC TP3]. +There will be no configuration telling gRPC to use these fields; we will +unconditionally use them if they are present. When we receive an error for a given resource, we will cancel the resource timer, if any, and report the error to the watchers' `OnDataError()` methods. +In the XdsClient cache, when we receive an error for a resource, we will +record the error but still keep the previous version of the resource +cached. We will record the cache status as `RECEIVED_ERROR`, which will +be reported in CSDS. + ### Deprecating "ignore_resource_deletion" Server Feature Because we now want the ability to control the behavior of all data @@ -250,37 +311,46 @@ to use the new "fail_on_data_errors" server feature instead. The following table shows the old and new behavior for each case. -Data Error | fail_on_data_errors Server Feature | Old Watcher Notification | Old Behavior | New Watcher Notification | New Behavior ----------- | ---------------------------------- | ------------------------ | ------------ | ------------------------ | ------------ -NACK from client | false | `OnError(status)` | Ignore | `OnDataError(status, false)` | Ignore if already have resource -NACK from client | true | `OnError(status)` | Ignore | `OnDataError(status, true)` | Drop existing resource and fail RPCs -Does-not-exist timeout | false | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnDataError(status, false)` | Ignore -Does-not-exist timeout | true | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnDataError(status, true)` | Drop existing resource and fail RPCs -LDS or CDS resource deletion from server | false | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnDataError(status, false)` | Ignore -LDS or CDS resource deletion from server | true | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnDataError(status, false)` | Drop existing resource and fail RPCs -[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | false | N/A | N/A | `OnDataError(status, false)` | Ignore -[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | N/A | N/A | `OnDataError(status, true)` | Drop existing resource and fail RPCs -[xRFC TP3] error with other status | false | N/A | N/A | `OnDataError(status, false)` | Ignore -[xRFC TP3] error with other status | true | N/A | N/A | `OnDataError(status, true)` | Ignore +Data Error | fail_on_data_errors Server Feature | resource_timer_indicates_transient_failure Server Feature | Old Watcher Notification | Old Behavior | New Watcher Notification | New Behavior +---------- | ---------------------------------- | --------------------------------------------------------- | ------------------------ | ------------ | ------------------------ | ------------ +NACK from client | false | | `OnError(status)` | Ignore | `OnClientDataError(status, false)` | Ignore if already have resource +NACK from client | true | | `OnError(status)` | Ignore | `OnClientDataError(status, true)` | Drop existing resource and fail RPCs +Resource timeout | false | false | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnClientDataError(Status(NOT_FOUND), false)` | Ignore +Resource timeout | true | false | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnClientDataError(Status(NOT_FOUND), true)` | Drop existing resource and fail RPCs +Resource timeout | false | true | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnTransientError(Status(NOT_FOUND), false)` | Ignore +Resource timeout | true | true | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnTransientError(Status(NOT_FOUND), true)` | Ignore +LDS or CDS resource deletion from server | false | | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnDataError(status, false)` | Ignore +LDS or CDS resource deletion from server | true | | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnDataError(status, false)` | Drop existing resource and fail RPCs +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | false | | N/A | N/A | `OnDataError(status, false)` | Ignore +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | | N/A | N/A | `OnDataError(status, true)` | Drop existing resource and fail RPCs +[xRFC TP3] error with other status | false | | N/A | N/A | `OnDataError(status, false)` | Ignore +[xRFC TP3] error with other status | true | | N/A | N/A | `OnDataError(status, true)` | Ignore ### Temporary environment variable protection -[Name the environment variable(s) used to enable/disable the feature(s) this proposal introduces and their default(s). Generally, features that are enabled by I/O should include this type of control until they have passed some testing criteria, which should also be detailed here. This section may be omitted if there are none.] +The new functionality will be guarded by the +`GRPC_EXPERIMENTAL_XDS_DATA_ERROR_HANDLING` env var until it passes +interop testing. This env var will guard using the new response fields +from [xRFC TP3] and the new server features "fail_on_data_errors" and +"resource_timer_indicates_transient_failure". + +TODO: figure out how to handle legacy "ignore_resource_deletion" server +feature for env var ## Rationale +TODO: Anything we need to note here? + [A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.] ## Implementation -[A description of the steps in the implementation, who will do them, and when. If a particular language is going to get the implementation first, this section should list the proposed order.] - -## Open issues (if applicable) +@markdroth will implement in C-core. -[A discussion of issues relating to this proposal for which the author does not know the solution. This section may be omitted if there are none.] +Will also be implemented in Java, Go, and Node. -## Future Direction +## Open issues To avoid problems where newly started clients fail when the control plane is down, we have designed xDS fallback functionality, as described in From 5f80d8154319cf4b2dd12ce3a0aef5b48516d90c Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 22 Nov 2024 01:44:42 +0000 Subject: [PATCH 03/20] update method names --- A88-xds-data-error-handling.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index 5b2fc847c..70c7c52b7 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -280,7 +280,7 @@ There will be no configuration telling gRPC to use these fields; we will unconditionally use them if they are present. When we receive an error for a given resource, we will cancel the resource -timer, if any, and report the error to the watchers' `OnDataError()` +timer, if any, and report the error to the watchers' `OnServerDataError()` methods. In the XdsClient cache, when we receive an error for a resource, we will @@ -319,12 +319,12 @@ Resource timeout | false | false | `OnResourceDoesNotExist()` | Drop existing re Resource timeout | true | false | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnClientDataError(Status(NOT_FOUND), true)` | Drop existing resource and fail RPCs Resource timeout | false | true | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnTransientError(Status(NOT_FOUND), false)` | Ignore Resource timeout | true | true | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnTransientError(Status(NOT_FOUND), true)` | Ignore -LDS or CDS resource deletion from server | false | | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnDataError(status, false)` | Ignore -LDS or CDS resource deletion from server | true | | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnDataError(status, false)` | Drop existing resource and fail RPCs -[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | false | | N/A | N/A | `OnDataError(status, false)` | Ignore -[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | | N/A | N/A | `OnDataError(status, true)` | Drop existing resource and fail RPCs -[xRFC TP3] error with other status | false | | N/A | N/A | `OnDataError(status, false)` | Ignore -[xRFC TP3] error with other status | true | | N/A | N/A | `OnDataError(status, true)` | Ignore +LDS or CDS resource deletion from server | false | | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnServerDataError(status, false)` | Ignore +LDS or CDS resource deletion from server | true | | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnServerDataError(status, false)` | Drop existing resource and fail RPCs +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | false | | N/A | N/A | `OnServerDataError(status, false)` | Ignore +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | | N/A | N/A | `OnServerDataError(status, true)` | Drop existing resource and fail RPCs +[xRFC TP3] error with other status | false | | N/A | N/A | `OnServerDataError(status, false)` | Ignore +[xRFC TP3] error with other status | true | | N/A | N/A | `OnServerDataError(status, true)` | Ignore ### Temporary environment variable protection From 4903bf62fc4ecd96e38d5178a33df43beea6c243 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Sat, 7 Dec 2024 00:06:27 +0000 Subject: [PATCH 04/20] fix table entries for resource timer --- A88-xds-data-error-handling.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index 70c7c52b7..26540c36b 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -315,10 +315,10 @@ Data Error | fail_on_data_errors Server Feature | resource_timer_indicates_trans ---------- | ---------------------------------- | --------------------------------------------------------- | ------------------------ | ------------ | ------------------------ | ------------ NACK from client | false | | `OnError(status)` | Ignore | `OnClientDataError(status, false)` | Ignore if already have resource NACK from client | true | | `OnError(status)` | Ignore | `OnClientDataError(status, true)` | Drop existing resource and fail RPCs -Resource timeout | false | false | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnClientDataError(Status(NOT_FOUND), false)` | Ignore -Resource timeout | true | false | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnClientDataError(Status(NOT_FOUND), true)` | Drop existing resource and fail RPCs -Resource timeout | false | true | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnTransientError(Status(NOT_FOUND), false)` | Ignore -Resource timeout | true | true | `OnResourceDoesNotExist()` | Drop existing resource and fail RPCs | `OnTransientError(Status(NOT_FOUND), true)` | Ignore +Resource timeout | false | false | `OnResourceDoesNotExist()` | Fail RPCs (no existing resource) | `OnClientDataError(Status(NOT_FOUND), false)` | Fail RPCs (no existing resource) +Resource timeout | true | false | `OnResourceDoesNotExist()` | Fail RPCs (no existing resource) | `OnClientDataError(Status(NOT_FOUND), true)` | Fail RPCs (no existing resource) +Resource timeout | false | true | `OnResourceDoesNotExist()` | Fail RPCs (no existing resource) | `OnTransientError(Status(NOT_FOUND), false)` | Fail RPCs (no existing resource) +Resource timeout | true | true | `OnResourceDoesNotExist()` | Fail RPCs (no existing resource) | `OnTransientError(Status(NOT_FOUND), true)` | Fail RPCs (no existing resource) LDS or CDS resource deletion from server | false | | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnServerDataError(status, false)` | Ignore LDS or CDS resource deletion from server | true | | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnServerDataError(status, false)` | Drop existing resource and fail RPCs [xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | false | | N/A | N/A | `OnServerDataError(status, false)` | Ignore From e68b3ed054b6d06ad378eda00c824b8bf7f7ea42 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Sat, 7 Dec 2024 00:25:35 +0000 Subject: [PATCH 05/20] shorten server feature name --- A88-xds-data-error-handling.md | 42 +++++++++++++++++----------------- 1 file changed, 21 insertions(+), 21 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index 26540c36b..52d105927 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -4,7 +4,7 @@ A88: xDS Data Error Handling * Approver: @ejona86, @dfawley * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2024-11-21 +* Last updated: 2024-12-06 * Discussion at: (filled after thread exists) ## Abstract @@ -145,9 +145,9 @@ timer to detect non-existing resources, so we can instead re-purpose the timer to detect slow xDS servers. To that end, we will introduce a new server feature in the bootstrap -config called "resource_timer_indicates_transient_failure". When this -server feature is present, there will be two changes to the behavior of -the resource timer: +config called "resource_timer_is_transient_error". When this server +feature is present, there will be two changes to the behavior of the +resource timer: - The timer will be set for 30 seconds instead of 15 seconds. - When the timer fires, it will be treated as a transient error instead of a data error. (See below for how this is reflected in the @@ -208,9 +208,9 @@ which do map more cleanly to the two error categories defined above: only. This includes the ADS channel reporting TRANSIENT_FAILURE and the ADS stream terminating without receiving a response, as described in [gRFC A57]. It also includes the resource timer firing if the - "resource_timer_indicates_transient_failure" server feature is present. - The watcher is generally expected to ignore the error if it already - has a valid cached resource. + "resource_timer_is_transient_error" server feature is present. The + watcher is generally expected to ignore the error if it already has a + valid cached resource. - `OnClientDataError(Status status, bool fail_on_data_errors)`: Will be invoked for client-generated data errors only. This is used when @@ -225,15 +225,15 @@ which do map more cleanly to the two error categories defined above: - `OnServerDataError(Status status, bool fail_on_data_errors)`: Will be invoked for server-generated data errors only. This is used for [xRFC TP3] errors and when the does-not-exist timer fires when the - "resource_timer_indicates_transient_failure" server feature is *not* - present. The `fail_on_data_errors` parameter indicates whether - the server feature of the same name was present in the bootstrap - configuration for the xDS server that caused the data error. If the - watcher already has a valid resource, it will use the two parameters to - determine whether it should continue using that resource or whether it - should stop using that resource and put itself into a failing state. - See [xRFC TP3] for guidance on what status codes should be used to - decide to drop an existing resource. + "resource_timer_is_transient_error" server feature is *not* present. + The `fail_on_data_errors` parameter indicates whether the server feature + of the same name was present in the bootstrap configuration for the + xDS server that caused the data error. If the watcher already has a + valid resource, it will use the two parameters to determine whether it + should continue using that resource or whether it should stop using + that resource and put itself into a failing state. See [xRFC TP3] + for guidance on what status codes should be used to decide to drop an + existing resource. For transient connectivity errors, including the ADS channel reporting TRANSIENT_FAILURE and the ADS stream terminating without receiving a @@ -249,9 +249,9 @@ drop any previously cached resource and fail data plane RPCs; otherwise, we will continue using the previously cached resource if we have one. For the resource timer, the behavior will depend on the presence of the -"resource_timer_indicates_transient_failure" server feature in the -bootstrap configuration. If that server feature is present, then when -the timer fires, we will call `OnTransientError()` with a status code +"resource_timer_is_transient_error" server feature in the bootstrap +configuration. If that server feature is present, then when the +timer fires, we will call `OnTransientError()` with a status code of UNAVAILABLE. If the server feature is *not* present, then when the timer fires, we will call `OnServerDataError()` with a status code of NOT_FOUND. In either case, since we use this timer only when we do @@ -311,7 +311,7 @@ to use the new "fail_on_data_errors" server feature instead. The following table shows the old and new behavior for each case. -Data Error | fail_on_data_errors Server Feature | resource_timer_indicates_transient_failure Server Feature | Old Watcher Notification | Old Behavior | New Watcher Notification | New Behavior +Data Error | fail_on_data_errors Server Feature | resource_timer_is_transient_error Server Feature | Old Watcher Notification | Old Behavior | New Watcher Notification | New Behavior ---------- | ---------------------------------- | --------------------------------------------------------- | ------------------------ | ------------ | ------------------------ | ------------ NACK from client | false | | `OnError(status)` | Ignore | `OnClientDataError(status, false)` | Ignore if already have resource NACK from client | true | | `OnError(status)` | Ignore | `OnClientDataError(status, true)` | Drop existing resource and fail RPCs @@ -332,7 +332,7 @@ The new functionality will be guarded by the `GRPC_EXPERIMENTAL_XDS_DATA_ERROR_HANDLING` env var until it passes interop testing. This env var will guard using the new response fields from [xRFC TP3] and the new server features "fail_on_data_errors" and -"resource_timer_indicates_transient_failure". +"resource_timer_is_transient_error". TODO: figure out how to handle legacy "ignore_resource_deletion" server feature for env var From 57adb03677ab63be7480861a478ca9a837a92ceb Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 9 Dec 2024 21:11:05 +0000 Subject: [PATCH 06/20] new watcher approach --- A88-xds-data-error-handling.md | 325 ++++++++++++++++----------------- 1 file changed, 159 insertions(+), 166 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index 52d105927..dba2efc92 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -4,7 +4,7 @@ A88: xDS Data Error Handling * Approver: @ejona86, @dfawley * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2024-12-06 +* Last updated: 2024-12-09 * Discussion at: (filled after thread exists) ## Abstract @@ -12,7 +12,8 @@ A88: xDS Data Error Handling This document proposes adding support for the new xDS error propagation mechanism described in [xRFC TP3], which provides a cleaner mechanism for detecting non-existing resources. It also defines improved behavior -for handling all xDS-related data errors. +for handling all xDS-related data errors, including improved behavior +for the xDS resource timer. ## Background @@ -35,12 +36,14 @@ There are two basic categories of errors that can occur in an xDS client: - **Transient errors:** These are connectivity problems reaching the xDS server, such as the xDS channel reporting TRANSIENT_FAILURE or the ADS stream failing without having received a response from the server, - as described in [gRFC A57]. + as described in [gRFC A57]. It also includes errors reported by the + control plane via the new mechanism described in [xRFC TP3] with a + status code other than NOT_FOUND or PERMISSION_DENIED. - **Data errors:** This includes things like resource validation failures - that cause the client to NACK an update. It also includes errors - reported by the control plane via the new mechanism described in - [xRFC TP3], including the case where a subscribed resource does not - exist. + that cause the client to NACK an update and resource deletions sent + by the xDS server. It also includes errors reported by the control + plane via the new mechanism described in [xRFC TP3] with a status code + of NOT_FOUND or PERMISSION_DENIED. For either category of error, if the error occurs when we do not yet have a cached version of the resource, then we have no choice but to @@ -57,13 +60,14 @@ version of the resource, our current behavior is somewhat inconsistent and ad hoc. It is useful to consider the desired behavior here from first principles. There are two main approaches that we can take here: -1. We can ignore the error and keep using the previously cached resource. - This approach is safer from the perspective of helping to avoid an - immediate outage. However, it does rely on the control plane identifying - the problem and alerting humans to take action, or else the problem - becomes completely silent, which might simply defer the outage until - some arbitrary amount of time later when the clients are restarted, - at which point it will be harder to debug. +1. We can ignore the error and keep using the previously cached resource, + in effect treating it as a transient error. This approach is safer + from the perspective of helping to avoid an immediate outage. However, + it does rely on the control plane identifying the problem and alerting + humans to take action, or else the problem becomes completely silent, + which might simply defer the outage until some arbitrary amount of + time later when the clients are restarted, at which point it will be + harder to debug. 2. We can throw away the previously cached resource and act as if we never saw it, thus failing data plane requests. This makes the problem @@ -79,9 +83,9 @@ something that should be configurable on a per-server basis. Unfortunately, our current behavior for data errors does not follow the principles outlined above. Currently, we have special handling for the resource deletion case: by default, we throw away the previously -cached resource (i.e., approach 2 above), but we provide an option in -the bootstrap config to inhibit this behavior (see [gRFC A53]). However, -all other data errors are actually treated the same as transient errors: +cached resource (i.e., approach 2 above), but we provide an option +in the bootstrap config to inhibit this behavior (see [gRFC A53]). +However, all other data errors are actually treated as transient errors: we unconditionally stick with the previously cached resource (i.e., approach 1 above). @@ -122,68 +126,29 @@ subscribed to. ### Related Proposals: - [gRFC A53: Option for Ignoring xDS Resource Deletion][gRFC A53] - [gRFC A57: XdsClient Failure Mode Behavior][gRFC A57] +- [gRFC A40: xDS Configuration Dump via Client Status Discovery Service][gRFC A40] - [xRFC TP3: Error Propagation][xRFC TP3] [gRFC A53]: A53-xds-ignore-resource-deletion.md [gRFC A57]: A57-xds-client-failure-mode-behavior.md +[gRFC A40]: A40-csds-support.md [xRFC TP3]: https://github.com/cncf/xds/pull/107 ## Proposal This proposal has the following parts: -- changes to the resource timer behavior -- a mechanism to control data error handling behavior - changes in the XdsClient watcher APIs +- handling data errors and transient errors - support for errors provided by the xDS server as per [xRFC TP3] +- changes to the resource timer behavior - deprecating the "ignore_resource_deletion" server feature -### Changes to Resource Timer Behavior - -If we know that the xDS server will use [xRFC TP3] to inform us of -non-existing resources, then we no longer need to rely on the resource -timer to detect non-existing resources, so we can instead re-purpose -the timer to detect slow xDS servers. - -To that end, we will introduce a new server feature in the bootstrap -config called "resource_timer_is_transient_error". When this server -feature is present, there will be two changes to the behavior of the -resource timer: -- The timer will be set for 30 seconds instead of 15 seconds. -- When the timer fires, it will be treated as a transient error - instead of a data error. (See below for how this is reflected in the - watcher API.) - -This server feature should be used only in cases where the server is known -to support [xRFC TP3]. If the server feature is used when the server -does *not* support [xRFC TP3], then the client will never report a -non-existant resource; it will instead treat a timeout as a transient -error talking to the xDS server. - -Note that the client will honor [xRFC TP3] errors returned by the server -regardless of whether this server feature is present. The presence of -this server feature actually controls only whether we can *depend* on the -server to provide these errors, so that we can be more lax about the -resource timer. - -### Mechanism to Control Data Error Handling Behavior - -As described above, the behavior that we want on data errors depends -on whether we can depend on the control plane to properly alert human -operators when these errors occur. To configure this, we will introduce a -new server feature in the bootstrap config called "fail_on_data_errors", -which indicates that we cannot depend on the control plane to properly -alert human operators. - -This server feature controls the behavior for handling data errors when -we already have a cached version of the resource. By default, we will -ignore the error and continue using the cached version of the resource. -However, if this server feature is set, then we will instead discard any -previously cached version of the resource and start failing data plane -RPCs. - ### Changes to XdsClient Watcher APIs -The current `XdsClient` API has two error-handling methods: +The current `XdsClient` API has three methods: + +- `OnResourceChanged(ResourceType resource)`: Invoked whenever a new + version of the resource is received from the xDS server. - `OnError(Status status)`: Invoked for all transient errors and for some data errors (NACKs). The watcher is generally expected to ignore the @@ -199,79 +164,73 @@ The current `XdsClient` API has two error-handling methods: already have a cached version of the resource, which is where the does-not-exist timer is used. -These methods do not map well to the two error categories defined above. - -This proposal replaces those two methods with the following new methods, -which do map more cleanly to the two error categories defined above: - -- `OnTransientError(Status status)`: Will be invoked for transient errors - only. This includes the ADS channel reporting TRANSIENT_FAILURE and - the ADS stream terminating without receiving a response, as described - in [gRFC A57]. It also includes the resource timer firing if the - "resource_timer_is_transient_error" server feature is present. The - watcher is generally expected to ignore the error if it already has a - valid cached resource. - -- `OnClientDataError(Status status, bool fail_on_data_errors)`: Will be - invoked for client-generated data errors only. This is used when - we NACK a resource. The `fail_on_data_errors` parameter indicates - whether the server feature of the same name was present in the - bootstrap configuration for the xDS server that caused the data - error. If the watcher already has a valid resource, it will use the - `fail_on_data_errors` parameter to determine whether it should continue - using that resource or whether it should stop using that resource and - put itself into a failing state. - -- `OnServerDataError(Status status, bool fail_on_data_errors)`: Will be - invoked for server-generated data errors only. This is used for - [xRFC TP3] errors and when the does-not-exist timer fires when the - "resource_timer_is_transient_error" server feature is *not* present. - The `fail_on_data_errors` parameter indicates whether the server feature - of the same name was present in the bootstrap configuration for the - xDS server that caused the data error. If the watcher already has a - valid resource, it will use the two parameters to determine whether it - should continue using that resource or whether it should stop using - that resource and put itself into a failing state. See [xRFC TP3] - for guidance on what status codes should be used to decide to drop an - existing resource. - -For transient connectivity errors, including the ADS channel reporting -TRANSIENT_FAILURE and the ADS stream terminating without receiving a -response (as described in [gRFC A57]), we will call `OnTransientError()` -instead of `OnError()`, but the semantics and behavior will remain -the same. The watcher will ignore these errors if there is a previously -cached version of the resource. - -For NACKs, we will call `OnClientDataError()`. The watcher behavior will -depend on the presence of the "fail_on_data_errors" server feature in -the bootstrap configuration: if the server feature is present, we will -drop any previously cached resource and fail data plane RPCs; otherwise, -we will continue using the previously cached resource if we have one. - -For the resource timer, the behavior will depend on the presence of the -"resource_timer_is_transient_error" server feature in the bootstrap -configuration. If that server feature is present, then when the -timer fires, we will call `OnTransientError()` with a status code -of UNAVAILABLE. If the server feature is *not* present, then when -the timer fires, we will call `OnServerDataError()` with a status code -of NOT_FOUND. In either case, since we use this timer only when we do -not have a previously cached version of the resource, the result will -be the same: we will fail data plane RPCs. - -For [xRFC TP3] errors, we will call `OnServerDataError()`. The watcher -behavior will depend on the value of the "fail_on_data_errors" parameter -and the status code. In gRPC, our watchers will throw away a previously -seen resource if `fail_on_data_errors` is true and the status code is -either NOT_FOUND or PERMISSION_DENIED; in any other case, we will retain -the previously seen resource, if any. - -TODO: Do we need to batch updates to the watcher somehow for the case -where we've got a previous version of a resource cached plus an error? -Previously, that was okay, because we'd just deliver both updates to the -watcher, one at a time. But if the watcher is going to decide to throw -away the cached resource when it sees the error, maybe we need to -deliver both at once, in case a new watcher starts when we've already -got both the cached resource and the error? +These methods do not map well to the desired error handling behavior +defined above. + +This proposal replaces those methods with the following two methods: + +- `OnResourceChanged(StatusOr resource)`: Will be invoked + to notify the watcher of a new version of the resource received from + the xDS server or an error indicating the reason why the resource cannot + be obtained. Note that if an error is passed to this method after the + watcher has previously seen a valid resource, the watcher is expected + to stop using that previously delivered resource. In this case, the + XdsClient will remove the resource from its cache, so that CSDS will + not report a resource that the client is not actually using. + +- `OnAmbientError(Status status)`: Will be invoked to notify the watcher + of an error that occurs after a resource has been received that should + not modify the watcher's use of that resource but that may be useful + information about the ambient state of the XdsClient. In particular, + the watcher should NOT stop using the previously seen resource, and the + XdsClient will NOT remove the resource from its cache. However, the + error message may be useful as additional context to include in errors + that are being generated for other reasons. For example, if failing a + data plane RPC due to all endpoints being unreachable, it may be useful + to also report that the client has lost contact with the xDS server. + +### Handling Data Errors + +As described above, the behavior that we want on data errors depends +on whether we can depend on the control plane to properly alert human +operators when these errors occur. To configure this, we will introduce a +new server feature in the bootstrap config called "fail_on_data_errors", +which indicates that we cannot depend on the control plane to properly +alert human operators. + +This server feature affects all data errors, which are the following cases: +- NACKs +- resource deletions in LDS and CDS +- [xRFC TP3] errors with status NOT_FOUND or PERMISSION_DENIED + +If one of these errors occurs and there is no previously cached version +of the resource, then the XdsClient will call the watchers' +`OnResourceChanged()` method with an appropriate error. + +If one of these errors occurs when there IS a previously cached version +of the resource, then the behavior depends on whether the +"fail_on_data_errors" server feature is present in the bootstrap config. +If the server feature is present, then the XdsClient will remove the +previous version of the resource from its cache and will call the +watchers' `OnResourceChanged()` method with an appropriate error. If +the server feature is NOT present, then the XdsClient will retain the +previous version of the resource in its cache and will call the +watchers' `OnAmbientError()` method with an appropriate error. + +### Handling Transient Errors + +Transient errors include the following cases: +- the xDS channel reporting TRANSIENT_FAILURE or the ADS stream failing + without having received a response from the server, as described in + [gRFC A57] +- [xRFC TP3] errors with any status *except* NOT_FOUND or PERMISSION_DENIED + +When one of these errors occurs, if we do NOT have a previously cached +version of the resource, the XdsClient will call the watchers' +`OnResourceChanged()` method to notify them of the error. If we DO have +a previously cached version of the resource, the XdsClient will instead +call the watchers' `OnAmbientError()` method to notify them of the +error. ### Support for Errors Provided by the xDS server as per [xRFC TP3] @@ -279,14 +238,46 @@ We will add support for the new response fields added in [xRFC TP3]. There will be no configuration telling gRPC to use these fields; we will unconditionally use them if they are present. -When we receive an error for a given resource, we will cancel the resource -timer, if any, and report the error to the watchers' `OnServerDataError()` -methods. +When we receive an error for a given resource, we will do the following: +- Cancel the resource timer, if any. +- If the status code is NOT_FOUND or PERMISSION_DENIED, then the + watcher notification behavior is as described in the [Handling Data + Errors](#handling-data-errors) section. Otherwise, the watcher + notification behavior is as described in the [Handling Transient + Errors](#handling-transient-errors) section. +- If we have a previously cached version of the resource and we are not + deleting it from the cache (i.e., if either the "fail_on_data_errors" + server feature is not present in the bootstrap config or the status + code is something other than NOT_FOUND or PERMISSION_DENIED), we will + record the cache status as `RECEIVED_ERROR`, which will be reported in + CSDS. + +### Changes to Resource Timer Behavior + +If we know that the xDS server will use [xRFC TP3] to inform us of +non-existing resources, then we no longer need to rely on the resource +timer to detect non-existing resources, so we can instead re-purpose +the timer to detect slow xDS servers. + +To that end, we will introduce a new server feature in the bootstrap +config called "resource_timer_is_transient_error". When this server +feature is present, the timer will be set for 30 seconds instead of 15 +seconds. When the timer fires, however, we will still invoke the +watchers' `OnResourceChanged()` method with the error, because by +definition this timer will fire only when we do not have a version of +the resource cached. -In the XdsClient cache, when we receive an error for a resource, we will -record the error but still keep the previous version of the resource -cached. We will record the cache status as `RECEIVED_ERROR`, which will -be reported in CSDS. +This server feature should be used only in cases where the server is known +to support [xRFC TP3]. If the server feature is used when the server +does *not* support [xRFC TP3], then it will take longer for the client +to notice a non-existing resource, because the server has no way of +reporting the condition before the timer fires. + +Note that the client will honor [xRFC TP3] errors returned by the server +regardless of whether this server feature is present. The presence of +this server feature actually controls only whether we can *depend* on the +server to provide these errors, so that we can be more lax about the +resource timer. ### Deprecating "ignore_resource_deletion" Server Feature @@ -296,8 +287,8 @@ will deprecate the "ignore_resource_deletion" server feature that was added in [gRFC A53]. For existing bootstrap configs that have the "ignore_resource_deletion" -server feature, gRPC will ignore that server feature, but the new -default behavior is to ignore all data errors, including resource +server feature, gRPC will ignore that server feature, but the new default +behavior is to treat all data errors as transient, including resource deletion, so they will still get the right behavior. For existing bootstrap configs that do *not* have the @@ -311,20 +302,23 @@ to use the new "fail_on_data_errors" server feature instead. The following table shows the old and new behavior for each case. -Data Error | fail_on_data_errors Server Feature | resource_timer_is_transient_error Server Feature | Old Watcher Notification | Old Behavior | New Watcher Notification | New Behavior +Error | Resource Already Cached | fail_on_data_errors Server Feature | Old Watcher Notification | Old Behavior | New Watcher Notification | New Behavior ---------- | ---------------------------------- | --------------------------------------------------------- | ------------------------ | ------------ | ------------------------ | ------------ -NACK from client | false | | `OnError(status)` | Ignore | `OnClientDataError(status, false)` | Ignore if already have resource -NACK from client | true | | `OnError(status)` | Ignore | `OnClientDataError(status, true)` | Drop existing resource and fail RPCs -Resource timeout | false | false | `OnResourceDoesNotExist()` | Fail RPCs (no existing resource) | `OnClientDataError(Status(NOT_FOUND), false)` | Fail RPCs (no existing resource) -Resource timeout | true | false | `OnResourceDoesNotExist()` | Fail RPCs (no existing resource) | `OnClientDataError(Status(NOT_FOUND), true)` | Fail RPCs (no existing resource) -Resource timeout | false | true | `OnResourceDoesNotExist()` | Fail RPCs (no existing resource) | `OnTransientError(Status(NOT_FOUND), false)` | Fail RPCs (no existing resource) -Resource timeout | true | true | `OnResourceDoesNotExist()` | Fail RPCs (no existing resource) | `OnTransientError(Status(NOT_FOUND), true)` | Fail RPCs (no existing resource) -LDS or CDS resource deletion from server | false | | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnServerDataError(status, false)` | Ignore -LDS or CDS resource deletion from server | true | | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnServerDataError(status, false)` | Drop existing resource and fail RPCs -[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | false | | N/A | N/A | `OnServerDataError(status, false)` | Ignore -[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | | N/A | N/A | `OnServerDataError(status, true)` | Drop existing resource and fail RPCs -[xRFC TP3] error with other status | false | | N/A | N/A | `OnServerDataError(status, false)` | Ignore -[xRFC TP3] error with other status | true | | N/A | N/A | `OnServerDataError(status, true)` | Ignore +xDS channel reports TRANSIENT_FAILURE | false | (any) | `OnError()` | Fail data plane RPCs | `OnResourceChanged(Status)` | Fail data plane RPCs +xDS channel reports TRANSIENT_FAILURE | true | (any) | `OnError()` | Ignore | `OnAmbientError(Status)` | Ignore +ADS stream failed without reading a response | false | (any) | `OnError()` | Fail data plane RPCs | `OnResourceChanged(Status)` | Fail data plane RPCs +ADS stream failed without reading a response | true | (any) | `OnError()` | Ignore | `OnAmbientError(Status)` | Ignore +NACK from client | false | (any) | `OnError(status)` | Fail data plane RPCs | `OnResourceChanged(Status)` | Fail data plane RPCs +NACK from client | true | false | `OnError(status)` | Ignore | `OnAmbientError(Status)` | Ignore +NACK from client | true | true | `OnError(status)` | Ignore | `OnResourceChanged(Status)` | Drop existing resource and fail RPCs +Resource timeout | false | (any) | `OnResourceDoesNotExist()` | Fail data plane RPCs | `OnResourceChanged(Status(NOT_FOUND))` | Fail data plane RPCs +LDS or CDS resource deletion from server | true | false | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnAmbientError(Status(NOT_FOUND))` | Ignore +LDS or CDS resource deletion from server | true | true | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnResourceChanged(Status(NOT_FOUND))` | Drop existing resource and fail RPCs +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | false | (any) | N/A | N/A | `OnResourceChanged(status)` | Fail data plane RPCs +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | false | N/A | N/A | `OnAmbientError(status)` | Ignore +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | true | N/A | N/A | `OnResourceChanged(status)` | Drop existing resource and fail RPCs +[xRFC TP3] error with other status | false | (any) | N/A | N/A | `OnResourceChanged(status)` | Fail data plane RPCs +[xRFC TP3] error with other status | true | (any) | N/A | N/A | `OnAmbientError(status)` | Ignore ### Temporary environment variable protection @@ -334,15 +328,14 @@ interop testing. This env var will guard using the new response fields from [xRFC TP3] and the new server features "fail_on_data_errors" and "resource_timer_is_transient_error". -TODO: figure out how to handle legacy "ignore_resource_deletion" server -feature for env var - ## Rationale -TODO: Anything we need to note here? - -[A discussion of alternate approaches and the trade offs, advantages, and disadvantages of the specified approach.] - +We considered an approach whereby we'd notify the watchers about whether +the error is a data error or a transient error and let the watcher make +the decision about how to respond. However, that would make the watcher +API much harder to understand. It would also break CSDS in that it +would divorce the cache status inside the XdsClient with the watcher +behavior outside of the XdsClient. ## Implementation From 1417a0d377d212e4493a25a06cfd1c5248cd9026 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 9 Dec 2024 21:40:08 +0000 Subject: [PATCH 07/20] improve table formatting --- A88-xds-data-error-handling.md | 34 +++++++++++++++++----------------- 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index dba2efc92..c62cf2442 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -302,23 +302,23 @@ to use the new "fail_on_data_errors" server feature instead. The following table shows the old and new behavior for each case. -Error | Resource Already Cached | fail_on_data_errors Server Feature | Old Watcher Notification | Old Behavior | New Watcher Notification | New Behavior ----------- | ---------------------------------- | --------------------------------------------------------- | ------------------------ | ------------ | ------------------------ | ------------ -xDS channel reports TRANSIENT_FAILURE | false | (any) | `OnError()` | Fail data plane RPCs | `OnResourceChanged(Status)` | Fail data plane RPCs -xDS channel reports TRANSIENT_FAILURE | true | (any) | `OnError()` | Ignore | `OnAmbientError(Status)` | Ignore -ADS stream failed without reading a response | false | (any) | `OnError()` | Fail data plane RPCs | `OnResourceChanged(Status)` | Fail data plane RPCs -ADS stream failed without reading a response | true | (any) | `OnError()` | Ignore | `OnAmbientError(Status)` | Ignore -NACK from client | false | (any) | `OnError(status)` | Fail data plane RPCs | `OnResourceChanged(Status)` | Fail data plane RPCs -NACK from client | true | false | `OnError(status)` | Ignore | `OnAmbientError(Status)` | Ignore -NACK from client | true | true | `OnError(status)` | Ignore | `OnResourceChanged(Status)` | Drop existing resource and fail RPCs -Resource timeout | false | (any) | `OnResourceDoesNotExist()` | Fail data plane RPCs | `OnResourceChanged(Status(NOT_FOUND))` | Fail data plane RPCs -LDS or CDS resource deletion from server | true | false | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnAmbientError(Status(NOT_FOUND))` | Ignore -LDS or CDS resource deletion from server | true | true | `OnResourceDoesNotExist()`, but skipped if "ignore_resource_deletion" server feature is present | Drop existing resource and fail RPCs | `OnResourceChanged(Status(NOT_FOUND))` | Drop existing resource and fail RPCs -[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | false | (any) | N/A | N/A | `OnResourceChanged(status)` | Fail data plane RPCs -[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | false | N/A | N/A | `OnAmbientError(status)` | Ignore -[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | true | N/A | N/A | `OnResourceChanged(status)` | Drop existing resource and fail RPCs -[xRFC TP3] error with other status | false | (any) | N/A | N/A | `OnResourceChanged(status)` | Fail data plane RPCs -[xRFC TP3] error with other status | true | (any) | N/A | N/A | `OnAmbientError(status)` | Ignore +Error | Resource Already Cached | fail_on_data_errors Server Feature | Old Behavior | New Behavior +----- | ----------------------- | ---------------------------------- | ------------ | ------------ +xDS channel reports TRANSIENT_FAILURE | false | (any) | `OnError()`, Fail data plane RPCs | `OnResourceChanged(Status)`, Fail data plane RPCs +xDS channel reports TRANSIENT_FAILURE | true | (any) | `OnError()`, Ignore | `OnAmbientError(Status)`, Ignore +ADS stream failed without reading a response | false | (any) | `OnError()`, Fail data plane RPCs | `OnResourceChanged(Status)`, Fail data plane RPCs +ADS stream failed without reading a response | true | (any) | `OnError()`, Ignore | `OnAmbientError(Status)`, Ignore +NACK from client | false | (any) | `OnError(status)`, Fail data plane RPCs | `OnResourceChanged(Status)`, Fail data plane RPCs +NACK from client | true | false | `OnError(status)`, Ignore | `OnAmbientError(Status)`, Ignore +NACK from client | true | true | `OnError(status)`, Ignore | `OnResourceChanged(Status)`, Drop existing resource and fail RPCs +Resource timeout | false | (any) | `OnResourceDoesNotExist()`, Fail data plane RPCs | `OnResourceChanged(Status(NOT_FOUND))`, Fail data plane RPCs +LDS or CDS resource deletion from server | true | false | `OnResourceDoesNotExist()`, Drop existing resource and fail RPCs (watcher notification skipped if "ignore_resource_deletion" server feature is present) | `OnAmbientError(Status(NOT_FOUND))`, Ignore +LDS or CDS resource deletion from server | true | true | `OnResourceDoesNotExist()`, Drop existing resources and fail RPCs (watcher notification skipped if "ignore_resource_deletion" server feature is present) | `OnResourceChanged(Status(NOT_FOUND))`, Drop existing resource and fail RPCs +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | false | (any) | N/A | `OnResourceChanged(status)`, Fail data plane RPCs +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | false | N/A | `OnAmbientError(status)`, Ignore +[xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | true | true | N/A | `OnResourceChanged(status)`, Drop existing resource and fail RPCs +[xRFC TP3] error with other status | false | (any) | N/A | `OnResourceChanged(status)`, Fail data plane RPCs +[xRFC TP3] error with other status | true | (any) | N/A | `OnAmbientError(status)`, Ignore ### Temporary environment variable protection From 1a7df04a8e582b462b336597f501b591c099c7ae Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Thu, 12 Dec 2024 21:40:35 +0000 Subject: [PATCH 08/20] env var also guards ignore_resource_deletion behavior --- A88-xds-data-error-handling.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index c62cf2442..17d5a13e3 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -4,7 +4,7 @@ A88: xDS Data Error Handling * Approver: @ejona86, @dfawley * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2024-12-09 +* Last updated: 2024-12-12 * Discussion at: (filled after thread exists) ## Abstract @@ -326,7 +326,8 @@ The new functionality will be guarded by the `GRPC_EXPERIMENTAL_XDS_DATA_ERROR_HANDLING` env var until it passes interop testing. This env var will guard using the new response fields from [xRFC TP3] and the new server features "fail_on_data_errors" and -"resource_timer_is_transient_error". +"resource_timer_is_transient_error". It will also guard whether we +default to ignoring resource deletion. ## Rationale From da9741950a2e63898d0d26eb20a95a054b699882 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 13 Dec 2024 21:54:32 +0000 Subject: [PATCH 09/20] add section on XdsClient metrics --- A78-grpc-metrics-wrr-pf-xds.md | 1 + A88-xds-data-error-handling.md | 21 +++++++++++++++++++-- 2 files changed, 20 insertions(+), 2 deletions(-) diff --git a/A78-grpc-metrics-wrr-pf-xds.md b/A78-grpc-metrics-wrr-pf-xds.md index e05047543..3b8136926 100644 --- a/A78-grpc-metrics-wrr-pf-xds.md +++ b/A78-grpc-metrics-wrr-pf-xds.md @@ -6,6 +6,7 @@ A78: gRPC OTel Metrics for WRR, Pick First, and XdsClient * Implemented in: * Last updated: 2024-09-24 * Discussion at: https://groups.google.com/g/grpc-io/c/A2Mqz8OMDys +* Updated by: [A88: xDS Data Error Handling](A88-xds-data-error-handling.md) ## Abstract diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index 17d5a13e3..d7cea11d4 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -4,7 +4,7 @@ A88: xDS Data Error Handling * Approver: @ejona86, @dfawley * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2024-12-12 +* Last updated: 2024-12-13 * Discussion at: (filled after thread exists) ## Abstract @@ -127,12 +127,14 @@ subscribed to. - [gRFC A53: Option for Ignoring xDS Resource Deletion][gRFC A53] - [gRFC A57: XdsClient Failure Mode Behavior][gRFC A57] - [gRFC A40: xDS Configuration Dump via Client Status Discovery Service][gRFC A40] +- [gRFC A78: OTel Metrics for WRR, Pick First, and XdsClient][gRFC A78] - [xRFC TP3: Error Propagation][xRFC TP3] [gRFC A53]: A53-xds-ignore-resource-deletion.md [gRFC A57]: A57-xds-client-failure-mode-behavior.md [gRFC A40]: A40-csds-support.md -[xRFC TP3]: https://github.com/cncf/xds/pull/107 +[gRFC A78]: A78-grpc-metrics-wrr-pf-xds.md +[xRFC TP3]: https://github.com/cncf/xds/blob/main/proposals/TP3-xds-error-propagation.md ## Proposal @@ -320,6 +322,21 @@ LDS or CDS resource deletion from server | true | true | `OnResourceDoesNotExist [xRFC TP3] error with other status | false | (any) | N/A | `OnResourceChanged(status)`, Fail data plane RPCs [xRFC TP3] error with other status | true | (any) | N/A | `OnAmbientError(status)`, Ignore +### XdsClient Metric Labels + +[gRFC A78] defines a set of XdsClient metrics to be exported via +OpenTelemetry. One of the metric labels defined there is the +`grpc.xds.cache_state` label, which is used for the +`grpc.xds_client.resources` metric. In keeping with the new CSDS client +state introduced in [xRFC TP3], we will add two new possible values for +this label: +- `received_error`: The server sent an error for this resource. +- `received_error_but_cached`: There is a version of this resource + cached, but the server subsequently sent an error for this resource. + +By default, the second value will be seen; if the `fail_on_data_errors` +server feature is present, then the second value will be seen. + ### Temporary environment variable protection The new functionality will be guarded by the From bc9b9c2521b999a0af3c974233964d5f141613de Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Sat, 14 Dec 2024 17:55:11 +0000 Subject: [PATCH 10/20] timer status changes with server feature --- A88-xds-data-error-handling.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index d7cea11d4..4448049a2 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -4,7 +4,7 @@ A88: xDS Data Error Handling * Approver: @ejona86, @dfawley * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2024-12-13 +* Last updated: 2024-12-14 * Discussion at: (filled after thread exists) ## Abstract @@ -264,10 +264,11 @@ the timer to detect slow xDS servers. To that end, we will introduce a new server feature in the bootstrap config called "resource_timer_is_transient_error". When this server feature is present, the timer will be set for 30 seconds instead of 15 -seconds. When the timer fires, however, we will still invoke the -watchers' `OnResourceChanged()` method with the error, because by -definition this timer will fire only when we do not have a version of -the resource cached. +seconds, and when the timer fires, the resulting error will have status +UNAVAILABLE instead of NOT_FOUND. When the timer fires, however, we will +still invoke the watchers' `OnResourceChanged()` method with the error, +because by definition this timer will fire only when we do not have a +version of the resource cached. This server feature should be used only in cases where the server is known to support [xRFC TP3]. If the server feature is used when the server @@ -313,7 +314,7 @@ ADS stream failed without reading a response | true | (any) | `OnError()`, Ignor NACK from client | false | (any) | `OnError(status)`, Fail data plane RPCs | `OnResourceChanged(Status)`, Fail data plane RPCs NACK from client | true | false | `OnError(status)`, Ignore | `OnAmbientError(Status)`, Ignore NACK from client | true | true | `OnError(status)`, Ignore | `OnResourceChanged(Status)`, Drop existing resource and fail RPCs -Resource timeout | false | (any) | `OnResourceDoesNotExist()`, Fail data plane RPCs | `OnResourceChanged(Status(NOT_FOUND))`, Fail data plane RPCs +Resource timeout | false | (any) | `OnResourceDoesNotExist()`, Fail data plane RPCs | `OnResourceChanged(Status(NOT_FOUND))` (status will be UNAVAILABLE if `resource_timer_is_transient_error` server feature is present), Fail data plane RPCs LDS or CDS resource deletion from server | true | false | `OnResourceDoesNotExist()`, Drop existing resource and fail RPCs (watcher notification skipped if "ignore_resource_deletion" server feature is present) | `OnAmbientError(Status(NOT_FOUND))`, Ignore LDS or CDS resource deletion from server | true | true | `OnResourceDoesNotExist()`, Drop existing resources and fail RPCs (watcher notification skipped if "ignore_resource_deletion" server feature is present) | `OnResourceChanged(Status(NOT_FOUND))`, Drop existing resource and fail RPCs [xRFC TP3] error with status NOT_FOUND or PERMISSION_DENIED | false | (any) | N/A | `OnResourceChanged(status)`, Fail data plane RPCs From ae4458cb1a61d03a71be9fa3cf232cc223c6df40 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 16 Dec 2024 18:37:13 +0000 Subject: [PATCH 11/20] add mailing list link --- A88-xds-data-error-handling.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index 4448049a2..fa3dc74ce 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -5,7 +5,7 @@ A88: xDS Data Error Handling * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: * Last updated: 2024-12-14 -* Discussion at: (filled after thread exists) +* Discussion at: https://groups.google.com/g/grpc-io/c/Gg_tItYgQoI ## Abstract From d47e5fd5ab4ad9076294490603c2b2a96aef8ba7 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 16 Dec 2024 22:03:04 +0000 Subject: [PATCH 12/20] clarify handling of xRFC TP3 errors --- A88-xds-data-error-handling.md | 21 +++++++++------------ 1 file changed, 9 insertions(+), 12 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index fa3dc74ce..008574900 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -241,18 +241,15 @@ There will be no configuration telling gRPC to use these fields; we will unconditionally use them if they are present. When we receive an error for a given resource, we will do the following: -- Cancel the resource timer, if any. -- If the status code is NOT_FOUND or PERMISSION_DENIED, then the - watcher notification behavior is as described in the [Handling Data - Errors](#handling-data-errors) section. Otherwise, the watcher - notification behavior is as described in the [Handling Transient - Errors](#handling-transient-errors) section. -- If we have a previously cached version of the resource and we are not - deleting it from the cache (i.e., if either the "fail_on_data_errors" - server feature is not present in the bootstrap config or the status - code is something other than NOT_FOUND or PERMISSION_DENIED), we will - record the cache status as `RECEIVED_ERROR`, which will be reported in - CSDS. +1. Cancel the resource timer, if any. +2. Set the cache entry's status to `RECEIVED_ERROR` and record the error, + so that that information can be reported via CSDS. +3. If the status code is NOT_FOUND or PERMISSION_DENIED and the + "fail_on_data_errors" server feature is present in the bootstrap + config, delete the existing cached resource, if any. +4. If there is an existing cached resource, call the watchers' + `OnAmbientError()` method with the error. Otherwise, call th + watchers' `OnResourceChanged()` method with the error. ### Changes to Resource Timer Behavior From ba7950e2382ed1d0928b3ec05aed7b8b038ced63 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 16 Dec 2024 22:23:34 +0000 Subject: [PATCH 13/20] add note about status code propagation to data plane RPCs --- A88-xds-data-error-handling.md | 13 +++++++++++++ 1 file changed, 13 insertions(+) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index 008574900..bfe119e73 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -128,12 +128,14 @@ subscribed to. - [gRFC A57: XdsClient Failure Mode Behavior][gRFC A57] - [gRFC A40: xDS Configuration Dump via Client Status Discovery Service][gRFC A40] - [gRFC A78: OTel Metrics for WRR, Pick First, and XdsClient][gRFC A78] +- [gRFC A54: Restrict Control-Plane Status Codes][gRFC A54] - [xRFC TP3: Error Propagation][xRFC TP3] [gRFC A53]: A53-xds-ignore-resource-deletion.md [gRFC A57]: A57-xds-client-failure-mode-behavior.md [gRFC A40]: A40-csds-support.md [gRFC A78]: A78-grpc-metrics-wrr-pf-xds.md +[gRFC A54]: A54-restrict-control-plane-status-codes.md [xRFC TP3]: https://github.com/cncf/xds/blob/main/proposals/TP3-xds-error-propagation.md ## Proposal @@ -320,6 +322,17 @@ LDS or CDS resource deletion from server | true | true | `OnResourceDoesNotExist [xRFC TP3] error with other status | false | (any) | N/A | `OnResourceChanged(status)`, Fail data plane RPCs [xRFC TP3] error with other status | true | (any) | N/A | `OnAmbientError(status)`, Ignore +### Propagating Status to Data Plane RPCs + +As noted in [gRFC A54], the set of status codes that gRPC should generate +for RPCs is intentionally limited. In particular, INVALID_ARGUMENT and +NOT_FOUND are two of the codes that are reserved for application use +and should never be generated by gRPC itself. Implementors should not +blindly propagate status codes from the XdsClient watcher to data plane +RPCs; instead, they should be careful to set status codes carefully on +data plane RPCs on both the client and server side. In most cases, it +is expected that UNAVAILABLE is an appropriate code for data plane RPCs. + ### XdsClient Metric Labels [gRFC A78] defines a set of XdsClient metrics to be exported via From 6b2decc8d4f78b549ede32d4bfe2e5c24f963a8e Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Tue, 17 Dec 2024 21:14:29 +0000 Subject: [PATCH 14/20] review comments --- A88-xds-data-error-handling.md | 9 +++++---- 1 file changed, 5 insertions(+), 4 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index bfe119e73..ec907ddaa 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -4,7 +4,7 @@ A88: xDS Data Error Handling * Approver: @ejona86, @dfawley * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2024-12-14 +* Last updated: 2024-12-17 * Discussion at: https://groups.google.com/g/grpc-io/c/Gg_tItYgQoI ## Abstract @@ -249,8 +249,9 @@ When we receive an error for a given resource, we will do the following: 3. If the status code is NOT_FOUND or PERMISSION_DENIED and the "fail_on_data_errors" server feature is present in the bootstrap config, delete the existing cached resource, if any. -4. If there is an existing cached resource, call the watchers' - `OnAmbientError()` method with the error. Otherwise, call th +4. If there is an existing cached resource (which there will not be if + it was deleted in the previous step), call the watchers' + `OnAmbientError()` method with the error. Otherwise, call the watchers' `OnResourceChanged()` method with the error. ### Changes to Resource Timer Behavior @@ -346,7 +347,7 @@ this label: cached, but the server subsequently sent an error for this resource. By default, the second value will be seen; if the `fail_on_data_errors` -server feature is present, then the second value will be seen. +server feature is present, then the first value will be seen. ### Temporary environment variable protection From bbe760ac2918585653a84182b8354c905a7cd5b9 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Mon, 23 Dec 2024 16:46:33 +0000 Subject: [PATCH 15/20] describe how to clear an ambient error --- A88-xds-data-error-handling.md | 5 ++++- 1 file changed, 4 insertions(+), 1 deletion(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index ec907ddaa..05aa69e73 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -4,7 +4,7 @@ A88: xDS Data Error Handling * Approver: @ejona86, @dfawley * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2024-12-17 +* Last updated: 2024-12-23 * Discussion at: https://groups.google.com/g/grpc-io/c/Gg_tItYgQoI ## Abstract @@ -192,6 +192,9 @@ This proposal replaces those methods with the following two methods: that are being generated for other reasons. For example, if failing a data plane RPC due to all endpoints being unreachable, it may be useful to also report that the client has lost contact with the xDS server. + The error condition is considered to be cleared when either + `OnResourceChanged()` is invoked or `OnAmbientError()` is invoked + again with an OK status. ### Handling Data Errors From f8c28d9ea308a617199b2e10558fcc13a7eeb429 Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Thu, 2 Jan 2025 21:35:17 +0000 Subject: [PATCH 16/20] clarify how the resource timer affects the cache now --- A88-xds-data-error-handling.md | 17 +++++++++++++---- 1 file changed, 13 insertions(+), 4 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index 05aa69e73..8d40b0f57 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -4,7 +4,7 @@ A88: xDS Data Error Handling * Approver: @ejona86, @dfawley * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2024-12-23 +* Last updated: 2025-01-02 * Discussion at: https://groups.google.com/g/grpc-io/c/Gg_tItYgQoI ## Abstract @@ -179,8 +179,9 @@ This proposal replaces those methods with the following two methods: be obtained. Note that if an error is passed to this method after the watcher has previously seen a valid resource, the watcher is expected to stop using that previously delivered resource. In this case, the - XdsClient will remove the resource from its cache, so that CSDS will - not report a resource that the client is not actually using. + XdsClient will remove the resource from its cache, so that CSDS (see + [gRFC A40]) will not report a resource that the client is not actually + using. - `OnAmbientError(Status status)`: Will be invoked to notify the watcher of an error that occurs after a resource has been received that should @@ -268,7 +269,15 @@ To that end, we will introduce a new server feature in the bootstrap config called "resource_timer_is_transient_error". When this server feature is present, the timer will be set for 30 seconds instead of 15 seconds, and when the timer fires, the resulting error will have status -UNAVAILABLE instead of NOT_FOUND. When the timer fires, however, we will +UNAVAILABLE instead of NOT_FOUND. We will also record this transient +error in the cache, so that subsequent watches started on the XdsClient +will be notified of this error immediately. However, note that this +transient error in the cache will not be visible to CSDS (see [gRFC A40]), +nor will it affect the XdsClient metrics described in [gRFC A78] -- for +both of those purposes, the cache entry will continue to be in "requested" +state. + +Note that regardless of whether this server feature is present, we will still invoke the watchers' `OnResourceChanged()` method with the error, because by definition this timer will fire only when we do not have a version of the resource cached. From 1c463a541344256402b4c716a8a00a10e8e0d19d Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Thu, 2 Jan 2025 21:37:23 +0000 Subject: [PATCH 17/20] add links to C-core impl --- A88-xds-data-error-handling.md | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index 8d40b0f57..b1ce2f70a 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -381,7 +381,11 @@ behavior outside of the XdsClient. ## Implementation -@markdroth will implement in C-core. +C-core implementation: +- XdsClient watcher API change: https://github.com/grpc/grpc/pull/38269 +- Support for "fail_on_data_errors": https://github.com/grpc/grpc/pull/38278 +- Support for "resource_timer_is_transient_failure": https://github.com/grpc/grpc/pull/38377 +- xRFC TP3 support: TBD Will also be implemented in Java, Go, and Node. From 5e68ac316c917551f57c54b80892b68d3e40ab2e Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Fri, 3 Jan 2025 17:00:08 +0000 Subject: [PATCH 18/20] add ambient error handling --- A88-xds-data-error-handling.md | 29 +++++++++++++++++++++++++++-- 1 file changed, 27 insertions(+), 2 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index b1ce2f70a..f7ddd0e5b 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -4,7 +4,7 @@ A88: xDS Data Error Handling * Approver: @ejona86, @dfawley * Status: {Draft, In Review, Ready for Implementation, Implemented} * Implemented in: -* Last updated: 2025-01-02 +* Last updated: 2025-01-03 * Discussion at: https://groups.google.com/g/grpc-io/c/Gg_tItYgQoI ## Abstract @@ -129,6 +129,8 @@ subscribed to. - [gRFC A40: xDS Configuration Dump via Client Status Discovery Service][gRFC A40] - [gRFC A78: OTel Metrics for WRR, Pick First, and XdsClient][gRFC A78] - [gRFC A54: Restrict Control-Plane Status Codes][gRFC A54] +- [gRFC A61: IPv4 and IPv6 Dualstack Backend Support][gRFC A61] +- [gRFC A74: xDS Config Tears][gRFC A74] - [xRFC TP3: Error Propagation][xRFC TP3] [gRFC A53]: A53-xds-ignore-resource-deletion.md @@ -136,6 +138,8 @@ subscribed to. [gRFC A40]: A40-csds-support.md [gRFC A78]: A78-grpc-metrics-wrr-pf-xds.md [gRFC A54]: A54-restrict-control-plane-status-codes.md +[gRFC A61]: A61-IPv4-IPv6-dualstack-backends.md +[gRFC A74]: A74-xds-config-tears.md [xRFC TP3]: https://github.com/cncf/xds/blob/main/proposals/TP3-xds-error-propagation.md ## Proposal @@ -335,6 +339,26 @@ LDS or CDS resource deletion from server | true | true | `OnResourceDoesNotExist [xRFC TP3] error with other status | false | (any) | N/A | `OnResourceChanged(status)`, Fail data plane RPCs [xRFC TP3] error with other status | true | (any) | N/A | `OnAmbientError(status)`, Ignore +### Handling Ambient Errors + +As mentioned above, the purpose of ambient errors is to provide useful +context in error messages that are triggered for reasons not directly +related to the ambient error. For example, if we lose contact with the +xDS server, we continue to use cached resources, but if we subsequently +lose contact with all endpoints, we will fail data plane RPCs. The RPC +failure is triggered by the fact that we've lost contact with all +endpoints, but the status message should include the fact that we've +lost contact with the xDS server, since the outage may be caused by the +fact that we haven't been getting EDS updates from the xDS server. + +To support this, we will change pick_first and the petiole policies (see +[gRFC A61]) to include the `resolution_note` field in all error messages +generated by those policies. The xds resolver will set the resolution +note in the `XdsConfig` (see [gRFC A74]) to include ambient errors for +all resources (LDS, RDS, CDS, and EDS). If none of the resources +have ambient errors, the xds resolver will set the resolution note to +indicate the xDS node ID, which can be useful for debugging. + ### Propagating Status to Data Plane RPCs As noted in [gRFC A54], the set of status codes that gRPC should generate @@ -383,9 +407,10 @@ behavior outside of the XdsClient. C-core implementation: - XdsClient watcher API change: https://github.com/grpc/grpc/pull/38269 +- resolution_note handling: https://github.com/grpc/grpc/pull/38341 - Support for "fail_on_data_errors": https://github.com/grpc/grpc/pull/38278 - Support for "resource_timer_is_transient_failure": https://github.com/grpc/grpc/pull/38377 -- xRFC TP3 support: TBD +- xRFC TP3 support: https://github.com/grpc/grpc/pull/38381 Will also be implemented in Java, Go, and Node. From 85662e2541247097ba401c7c059bf8eefe783c4c Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Sat, 4 Jan 2025 00:44:43 +0000 Subject: [PATCH 19/20] describe cache state more clearly --- A88-xds-data-error-handling.md | 145 ++++++++++++++++++++------------- 1 file changed, 90 insertions(+), 55 deletions(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index f7ddd0e5b..49725b678 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -84,10 +84,11 @@ Unfortunately, our current behavior for data errors does not follow the principles outlined above. Currently, we have special handling for the resource deletion case: by default, we throw away the previously cached resource (i.e., approach 2 above), but we provide an option -in the bootstrap config to inhibit this behavior (see [gRFC A53]). -However, all other data errors are actually treated as transient errors: -we unconditionally stick with the previously cached resource (i.e., -approach 1 above). +in the bootstrap config to inhibit this behavior (see [gRFC A53]), +which confusingly makes the deletion not show up at all in CSDS (see +[gRFC A40]) or XdsClient metrics (see [gRFC A78]). However, all other +data errors are actually treated as transient errors: we unconditionally +stick with the previously cached resource (i.e., approach 1 above). This proposal changes the data error handling behavior to match the principles outlined above. @@ -184,15 +185,15 @@ This proposal replaces those methods with the following two methods: watcher has previously seen a valid resource, the watcher is expected to stop using that previously delivered resource. In this case, the XdsClient will remove the resource from its cache, so that CSDS (see - [gRFC A40]) will not report a resource that the client is not actually - using. + [gRFC A40]) and XdsClient metrics (see [gRFC A78]) will not reflect a + resource that the client is not actually using. - `OnAmbientError(Status status)`: Will be invoked to notify the watcher of an error that occurs after a resource has been received that should not modify the watcher's use of that resource but that may be useful information about the ambient state of the XdsClient. In particular, - the watcher should NOT stop using the previously seen resource, and the - XdsClient will NOT remove the resource from its cache. However, the + the watcher should *not* stop using the previously seen resource, and the + XdsClient will *not* remove the resource from its cache. However, the error message may be useful as additional context to include in errors that are being generated for other reasons. For example, if failing a data plane RPC due to all endpoints being unreachable, it may be useful @@ -210,24 +211,18 @@ new server feature in the bootstrap config called "fail_on_data_errors", which indicates that we cannot depend on the control plane to properly alert human operators. -This server feature affects all data errors, which are the following cases: +This server feature affects all data errors that can occur when a +resource is already cached, which are the following cases: - NACKs - resource deletions in LDS and CDS - [xRFC TP3] errors with status NOT_FOUND or PERMISSION_DENIED -If one of these errors occurs and there is no previously cached version -of the resource, then the XdsClient will call the watchers' -`OnResourceChanged()` method with an appropriate error. - -If one of these errors occurs when there IS a previously cached version -of the resource, then the behavior depends on whether the -"fail_on_data_errors" server feature is present in the bootstrap config. -If the server feature is present, then the XdsClient will remove the -previous version of the resource from its cache and will call the -watchers' `OnResourceChanged()` method with an appropriate error. If -the server feature is NOT present, then the XdsClient will retain the -previous version of the resource in its cache and will call the -watchers' `OnAmbientError()` method with an appropriate error. +When one of these errors occurs, if the "fail_on_data_errors" server +feature is present in the bootstrap config, we will delete the existing +cached resource, if any. Then, if there is an existing cached resource +(which there will not be if we just deleted it), call the watchers' +`OnAmbientError()` method with the error. Otherwise, call the watchers' +`OnResourceChanged()` method with the error. ### Handling Transient Errors @@ -237,12 +232,10 @@ Transient errors include the following cases: [gRFC A57] - [xRFC TP3] errors with any status *except* NOT_FOUND or PERMISSION_DENIED -When one of these errors occurs, if we do NOT have a previously cached -version of the resource, the XdsClient will call the watchers' -`OnResourceChanged()` method to notify them of the error. If we DO have -a previously cached version of the resource, the XdsClient will instead -call the watchers' `OnAmbientError()` method to notify them of the -error. +When one of these errors occurs, if there is an existing cached resource, +we will call the watchers' `OnAmbientError()` method with the error. +Otherwise, we will call the watchers' `OnResourceChanged()` method with +the error. ### Support for Errors Provided by the xDS server as per [xRFC TP3] @@ -252,8 +245,8 @@ unconditionally use them if they are present. When we receive an error for a given resource, we will do the following: 1. Cancel the resource timer, if any. -2. Set the cache entry's status to `RECEIVED_ERROR` and record the error, - so that that information can be reported via CSDS. +2. Set the cache entry's status to `RECEIVED_ERROR` and record the error + in the cache entry. 3. If the status code is NOT_FOUND or PERMISSION_DENIED and the "fail_on_data_errors" server feature is present in the bootstrap config, delete the existing cached resource, if any. @@ -273,13 +266,8 @@ To that end, we will introduce a new server feature in the bootstrap config called "resource_timer_is_transient_error". When this server feature is present, the timer will be set for 30 seconds instead of 15 seconds, and when the timer fires, the resulting error will have status -UNAVAILABLE instead of NOT_FOUND. We will also record this transient -error in the cache, so that subsequent watches started on the XdsClient -will be notified of this error immediately. However, note that this -transient error in the cache will not be visible to CSDS (see [gRFC A40]), -nor will it affect the XdsClient metrics described in [gRFC A78] -- for -both of those purposes, the cache entry will continue to be in "requested" -state. +UNAVAILABLE instead of NOT_FOUND, and the cache state will be set to +`TIMEOUT` instead of `DOES_NOT_EXIST`. Note that regardless of whether this server feature is present, we will still invoke the watchers' `OnResourceChanged()` method with the error, @@ -317,6 +305,68 @@ If the client wants to instead drop existing cached resources and start failing data plane RPCs on data errors, they may update their bootstrap to use the new "fail_on_data_errors" server feature instead. +This proposal removes the requirement from [gRFC A53] to log when +ignoring a resource deletion and then again when the resource comes back +into existence. Instead, we will reflect ignored resource deletion via +both CSDS (see [gRFC A40]) and XdsClient metrics (see [gRFC A78]) by +setting the cache entry state to `NOT_FOUND` without actually deleting +the cached resource. + +### Cache Entry States, CSDS, and XdsClient Metrics + +The cache entry in the XdsClient is used to expose the client's state +via both CSDS ([gRFC A40]) and XdsClient metrics ([gRFC A78]). +Implementations must ensure that the data exposed to watchers is +consistent with the cache, so that CSDS and our metrics accurately +represent the resources used by the client. + +Cache entries will contain both the currently used resource, if any, and +the most recent error seen, if any. A cache entry must be in one of the +following states, the last two of which are new: + +- `REQUESTED`: The client has subscribed to the resource but has not yet + received the resource. In this state, neither the resource nor the + error will be present in the cache entry. +- `DOES_NOT_EXIST`: The resource does not exist, either because the + resource timer has fired and the "resource_timer_is_transient_failure" + server feature is not present, or because of an LDS or CDS resource + deletion. The entry will contain an error indicating how the deletion + occurred. If the "fail_on_data_errors" server feature is not present, + the most recently seen resource may still be present in the cache entry. +- `ACKED`: The resource exists and is valid. In this case, the error + will be unset, and the resource will be present. +- `NACKED`: The most recent version of the resource seen from the + control plane was invalid. The error will indicate the validation + failure. If the "fail_on_data_errors" server feature is not present, + the most recently seen resource may still be present in the cache entry. +- `RECEIVED_ERROR`: The client received an [xRFC TP3] error from the + control plane. The error will be the error received from the control + plane. If the "fail_on_data_errors" server feature is not present, the + most recently seen resource may still be present in the cache entry. +- `TIMEOUT`: The client has subscribed to the resource but did not + receive it before the timer fired, and the + "resource_timer_is_transient_failure" server feature is present. The + resource will be unset and the error will indicate the timeout. + +Those states map directly to the states in the CSDS +`ClientResourceStatus` enum. + +[gRFC A78] defines a set of XdsClient metrics to be exported via +OpenTelemetry. One of the metric labels defined there is the +`grpc.xds.cache_state` label, which is used for the +`grpc.xds_client.resources` metric. In keeping with the new cache +states defined here, we will add the following new possible values for +this label: +- `timeout`: The cache entry is in TIMEOUT state. +- `does_not_exist_but_cached`: The cache entry is in DOES_NOT_EXIST + state but does contain a cached resourc (i.e., we have seen an LDS or + CDS resource deletion but have ignored it). +- `received_error`: The cache entry is in RECEIVED_ERROR state, and no + resource is cached. +- `received_error_but_cached`: The cache entry is in RECEIVED_ERROR + state, and there is a cached resource (i.e., we have ignored a data + error due to "fail_on_data_errors" not being present). + ### Summary of Behavior Changes The following table shows the old and new behavior for each case. @@ -370,21 +420,6 @@ RPCs; instead, they should be careful to set status codes carefully on data plane RPCs on both the client and server side. In most cases, it is expected that UNAVAILABLE is an appropriate code for data plane RPCs. -### XdsClient Metric Labels - -[gRFC A78] defines a set of XdsClient metrics to be exported via -OpenTelemetry. One of the metric labels defined there is the -`grpc.xds.cache_state` label, which is used for the -`grpc.xds_client.resources` metric. In keeping with the new CSDS client -state introduced in [xRFC TP3], we will add two new possible values for -this label: -- `received_error`: The server sent an error for this resource. -- `received_error_but_cached`: There is a version of this resource - cached, but the server subsequently sent an error for this resource. - -By default, the second value will be seen; if the `fail_on_data_errors` -server feature is present, then the first value will be seen. - ### Temporary environment variable protection The new functionality will be guarded by the @@ -399,9 +434,9 @@ default to ignoring resource deletion. We considered an approach whereby we'd notify the watchers about whether the error is a data error or a transient error and let the watcher make the decision about how to respond. However, that would make the watcher -API much harder to understand. It would also break CSDS in that it -would divorce the cache status inside the XdsClient with the watcher -behavior outside of the XdsClient. +API much harder to understand. It would also break CSDS and XdsClient +metrics in that it would divorce the cache status inside the XdsClient +from the watcher behavior outside of the XdsClient. ## Implementation From b98a315f73475af60dabd36bccb4df2ee74a41aa Mon Sep 17 00:00:00 2001 From: "Mark D. Roth" Date: Wed, 8 Jan 2025 20:03:21 +0000 Subject: [PATCH 20/20] add another C-core PR --- A88-xds-data-error-handling.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/A88-xds-data-error-handling.md b/A88-xds-data-error-handling.md index 49725b678..baeaa478b 100644 --- a/A88-xds-data-error-handling.md +++ b/A88-xds-data-error-handling.md @@ -442,7 +442,8 @@ from the watcher behavior outside of the XdsClient. C-core implementation: - XdsClient watcher API change: https://github.com/grpc/grpc/pull/38269 -- resolution_note handling: https://github.com/grpc/grpc/pull/38341 +- resolution_note handling: https://github.com/grpc/grpc/pull/38341 and + https://github.com/grpc/grpc/pull/38411 - Support for "fail_on_data_errors": https://github.com/grpc/grpc/pull/38278 - Support for "resource_timer_is_transient_failure": https://github.com/grpc/grpc/pull/38377 - xRFC TP3 support: https://github.com/grpc/grpc/pull/38381