Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A88: xDS Data Error Handling #466

Open
wants to merge 20 commits into
base: master
Choose a base branch
from

Conversation

markdroth
Copy link
Member

No description provided.

@markdroth markdroth marked this pull request as ready for review December 16, 2024 18:37
@markdroth markdroth requested review from ejona86 and dfawley December 16, 2024 18:37
A88-xds-data-error-handling.md Show resolved Hide resolved
A88-xds-data-error-handling.md Outdated Show resolved Hide resolved
A88-xds-data-error-handling.md Outdated Show resolved Hide resolved
A88-xds-data-error-handling.md Outdated Show resolved Hide resolved
copybara-service bot pushed a commit to grpc/grpc that referenced this pull request Jan 2, 2025
This is the first part of implementing gRFC A88 (grpc/proposal#466).

This introduces the new watcher API but does not change any of the existing behavior.  This table summarizes the API changes and behavior for each case:

Case | Old API | New API | Behavior
------- | ---------- | ------------ | ------------
Resource timer fires | `OnResourceDoesNotExist()` | `OnResourceChanged(NOT_FOUND)` | Fail data plane RPCs
LDS or CDS resource deletion | `OnResourceDoesNotExist()` | `OnResourceChanged(NOT_FOUND)` | Drop resource and fail data plane RPCs
xDS channel reports TRANSIENT_FAILURE | `OnError()` | `OnResourceChanged(status)` if resource NOT already cached; `OnAmbientError(status)` otherwise | Continue using cached resource, if any; otherwise, fail data plane RPCs
ADS stream terminates without receiving a response | `OnError()` | `OnResourceChanged(status)` if resource NOT already cached; `OnAmbientError(status)` otherwise | Continue using cached resource, if any; otherwise, fail data plane RPCs
Invalid resource update (client NACK) | `OnError()` | `OnResourceChanged(status)` if resource NOT already cached; `OnAmbientError(status)` otherwise | Continue using cached resource, if any; otherwise, fail data plane RPCs
Valid resource update | `OnResourceChanged(resource)` | `OnResourceChanged(resource)` | use the new resource

This PR also ensures that we add the node ID to all error messages emitted from the XdsClient.  We were doing this for *most* errors before, but there were a few cases missing.

Finally, this PR also changes the behavior for two error cases that we were handling outside of the XdsClient:
- Receiving a socket listener instead of an API listener on the client side.
- Receiving a RouteConfiguration that does not contain a matching VirtualHost.

In both of those cases, we were previously ignoring the update and sticking with the previous version of the resource, which is not safe, because there's absolutely no feedback about the problem: the XdsClient cache has accepted the new resource, which means that the problem will not show up in the XdsClient metrics or in CSDS, and there will be no NACK sent to the control plane.  With this PR, we now fail data plane RPCs in those cases instead.

Closes #38269

COPYBARA_INTEGRATE_REVIEW=#38269 from markdroth:xds_watcher_api_change 1424fc1
PiperOrigin-RevId: 711506013
copybara-service bot pushed a commit to grpc/grpc that referenced this pull request Jan 8, 2025
…_data_errors" (#38278)

Another piece of gRFC A88 (grpc/proposal#466).

Note that even with the feature disabled, this changes the way that ignored resource deletions are reflected in CSDS and XdsClient metrics.

As a side effect, also fixes one of the edge cases described in #38094.

Closes #38278

COPYBARA_INTEGRATE_REVIEW=#38278 from markdroth:xds_client_fail_on_data_errors 7d31238
PiperOrigin-RevId: 713348335
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants