Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Sotw Cache interface #7

Closed

Conversation

valerian-roche
Copy link
Collaborator

Datadog fork PR matching envoyproxy#584

It has been raised that the addition of StreamState within the Cache interface is breaking the separation of concern between the server side managing the xds protocol itself and the cache which is only in charge of returning responses based on watches
PR envoyproxy#577 has been opened proposing to fully remove any client state from the CreateWatch interface

This PR is an alternate proposal based on a few concerns/aspects:

  • one of the argument on removing it is that the sotw protocol is stateless. It actually is not, and is explicitly visible in this PR: the lastResponse latched value as well as the version hack when the resource list is changed very well show that it is stateful
  • in cases outside of lds and cds, it is clearly stated that the control-plane is not expected to send back all resources when a single one change. This is highly critical in the case of ads with eds, as there is a likelihood of both having a lot of resources and a high churn. The implementation sending everything each time is technically tolerated, but is pushing on the data-plane work that could be done at low cost on the control-plane
  • the current support of both delta and sotw, for both linear and simple cache means that every change has to be done 4 times, with four very different implementations, each one supporting only a subset of use cases (e.g. linear does not support cds/lds as it doesn't handle wildcard in this case, and sotw does not work properly as resource list is improperly maintained). I strongly believed that both implementation should converge. The only remaining differences should be in the messages themselves, abstracted in the server part. Having a common interface for sotw and delta is a first step for this. It would also allow in the future to make snapshot far more efficient by supporting a linear version per snapshot

In this context, this PR is changing:

  • creates a ClientState within the cache package, driven by the data needed from the cache perspective to be able to answer a client request
  • simplifies streamState
    • makes it compatible with the new ClientState interface
    • removes the IsFirst notion, as the protocol states it should come from the nonce in the request
    • removes the duplicate/parallel knownResourceNames and use the delta one instead. It will also properly carry the version in a future PR

valerian-roche and others added 7 commits January 4, 2024 20:34
…interface is breaking the separation of concern between the server side managing the xds protocol itself and the cache which is only in charge of returning responses based on watches

PR envoyproxy#577 has been opened proposing to fully remove any client state from the CreateWatch interface

This commit is updating the Cache interface based on a few concerns/aspects:
- one of the argument on removing it is that the sotw protocol is stateless. It actually is not, and is explicitly visible in this PR: the lastResponse latched value as well as the version hack when the resource list is changed very well show that it is stateful
- in cases outside of lds and cds, it is clearly stated that the control-plane is not expected to send back all resources when a single one change. This is highly critical in the case of ads with eds, as there is a likelihood of both having a lot of resources and a high churn. The implementation sending everything each time is technically tolerated, but is pushing on the data-plane work that could be done at low cost on the control-plane
- the current support of both delta and sotw, for both linear and simple cache means that every change has to be done 4 times, with four very different implementations, each one supporting only a subset of use cases (e.g. linear does not support cds/lds as it doesn't handle wildcard in this case, and sotw does not work properly as resource list is improperly maintained). I strongly believed that both implementation should converge. The only remaining differences should be in the messages themselves, abstracted in the server part. Having a common interface for sotw and delta is a first step for this. It would also allow in the future to make snapshot far more efficient by supporting a linear version per snapshot

In this context, this commit is changing:
- creates a SubscriptionState within the cache package, driven by the data needed from the cache perspective to be able to answer a client request
- simplifies streamState
  - makes it compatible with the new SubscriptionState interface
  - removes the IsFirst notion, as the protocol states it should come from the nonce in the request
  - removes the duplicate/parallel knownResourceNames and use the delta one instead. It will also properly carry the version in a future PR

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
Rename KnownResources to ACKedResources to better reflect the change

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
…ches subscribing to multiple resources

Properly return the request in sotw responses to allow proper handling in callbacks

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
…near cache

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
[sotw][linear] Fix missing watch cleanup in linear cache for sotw watches subscribing to multiple resources
Co-authored-by: Antoine Tollenaere <atollena@gmail.com>
Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
[sotw][issue-540] Return full state when applicable for watches in linear cache
// The versions are:
// - delta protocol: version of the specific resource set in the response
// - sotw protocol: version of the global response when the resource was last ACKed
ReturnedResources() map[string]string

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not AckedResources?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now set the resources when sending the reply. Given the nonce behavior and the lack of support for NACK (handling it without further work would just create an infinite loop), there is no real point in distinguishing them in sotw for now as we explicitly ignore staled requests

}
}

if sub.IsWildcard() && len(req.ResourceNames) == 0 && len(sub.SubscribedResources()) == 0 {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy from this convo: #6 (comment)

Does that properly handle the case of unsubscribing from explicit wildcard? Probably this is never used, but still I think this violates the protocol spec and should be implemented properly. For example two subsequent requests:

Req(resources: ["*"]) -> explicit wildcard subscription -> IsWildcard() = true
...
Req(resources: []) -> IsWilcard = true, req.ResourceName = 0, GetSubscribedResources() == 0,
This will fall into this if statement, even though it should not be considered wildcard and instead should be considered unsubscribing from wildcard.

There seem to be no way to implement the legacy wildcard behavior without storing an additional boolean in what you called SubscriptionState:

For historical reasons, if the client sends a request for a given resource type but has never explicitly subscribed to any resource names

You can't know that without storing whether you've ever seen a resource name on the stream, can you? Without storing this explicitly, there is a high chance that you end up mixing up legacy/implicit and explicit wildcards.

Member
Author
@valerian-roche valerian-roche 7 hours ago
I agree that we do not support the case when the subscription was explicit wildcard and now requests nothing (not wildcard nor resources). The comment below mentions that we do not support this explicit case currently (in sotw or in delta), but I also don't see much value here. I can add this if preferred though

Copy link

@atollena atollena Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it seems a matter of adhering to the protocol, so I think this should arguably be supported, even if data planes do not use this yet, it is not great to ignore this case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll update to handle this case properly

@@ -101,8 +100,8 @@ type DeltaResponseWatch struct {
// Response is the channel to push the delta responses to
Response chan DeltaResponse

// VersionMap for the stream
StreamState stream.StreamState
// State for the type subscription

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Godoc format please, e.g.:

Suggested change
// State for the type subscription
// Subscription stores the current client subscriptions.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

// so we set the initial resource versions if we have any.
// We also set the stream as wildcard based on its legacy meaning (no resource name sent in resource_names_subscribe).
// If the state starts with this legacy mode, adding new resources will not unsubscribe from wildcard.
// We also set the subscription as wildcard based on its legacy meaning (no resource name sent in resource_names_subscribe).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the place where you know whether the stream should use legacy behavior or not. Implementing this properly doesn't seem like a big deal...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done now within the subscription

},
})
// Create a buffered channel the size of the known resource types.
respChan := make(chan cache.Response, types.UnknownType)

process := func(resp cache.Response) error {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just ditch the process variable here, if I read correctly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it's now only send. Will cleanup

@@ -63,7 +52,7 @@ func (s *server) processADS(sw *streamWrapper, reqCh chan *discovery.DiscoveryRe
case <-s.ctx.Done():
return nil
// We only watch the multiplexed channel since all values will come through from process.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(update this comment if you take my suggestion to remove process)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up, as well as process

if w.nonce != "" && req.GetResponseNonce() != w.nonce {
// The request does not match the stream nonce, ignore it as per
// https://www.envoyproxy.io/docs/envoy/v1.28.0/api-docs/xds_protocol#resource-updates
// Ignore this request and wait for the next one

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a link to envoyproxy/envoy#10363 which describes an (unlikely) race that may happen due to this behavior?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. To be honest the change mentioned in the PR would require some potentially major changes, especially if we want to keep the strict ordering required by ADS

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. I think just acknowledging the race is OK, I don't think it can realistically trigger in our environment.

}

w.nonce = out.Nonce
// ToDo(valerian-roche): properly return the resources actually sent to the client

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: others use TODO, not ToDo. fwiw I've never seen ToDo written by anyone else that yourself in my whole career, so it almost makes it redundant with your name :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to TODO

// Only process if we have an existing watch otherwise go ahead and create.
if err := processAllExcept(typeURL); err != nil {
return err
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't quite grasp that part. IIUC here you are making sure there are no outstanding responses for other resource type URL? Why is this needed? Reading the comment above processAllExept and here didn't really elucidate. You are also discarding responses of the type currently being processed. Maybe I just need to spend a bit more trying to understand this.

Also, why do we use a variable for processAllExcept, since it's only used here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is part of the ordered ADS implementation done in another PR. It handles all the previous one to ensure they are in order and the new one is not answered first. I don't remember the exact reason why the current type is just dropped though. It might be to avoid a potential deadlock in the old model, but not sure

@atollena
Copy link

I didn't have time to fully review yet, sorry. Will continue tomorrow.

@valerian-roche valerian-roche force-pushed the vr/interface-dd branch 2 times, most recently from a44daa2 to 53fab26 Compare January 18, 2024 04:07
Copy link
Collaborator Author

@valerian-roche valerian-roche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have replied to most comments, feel free to do a more thorough review. I will have to split this PR given the scope of changes now, especially with the sub naming alignment touching a lot of pieces

if w.nonce != "" && req.GetResponseNonce() != w.nonce {
// The request does not match the stream nonce, ignore it as per
// https://www.envoyproxy.io/docs/envoy/v1.28.0/api-docs/xds_protocol#resource-updates
// Ignore this request and wait for the next one
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. To be honest the change mentioned in the PR would require some potentially major changes, especially if we want to keep the strict ordering required by ADS

@@ -101,8 +100,8 @@ type DeltaResponseWatch struct {
// Response is the channel to push the delta responses to
Response chan DeltaResponse

// VersionMap for the stream
StreamState stream.StreamState
// State for the type subscription
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

// so we set the initial resource versions if we have any.
// We also set the stream as wildcard based on its legacy meaning (no resource name sent in resource_names_subscribe).
// If the state starts with this legacy mode, adding new resources will not unsubscribe from wildcard.
// We also set the subscription as wildcard based on its legacy meaning (no resource name sent in resource_names_subscribe).
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done now within the subscription

@@ -63,7 +52,7 @@ func (s *server) processADS(sw *streamWrapper, reqCh chan *discovery.DiscoveryRe
case <-s.ctx.Done():
return nil
// We only watch the multiplexed channel since all values will come through from process.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up, as well as process

}

w.nonce = out.Nonce
// ToDo(valerian-roche): properly return the resources actually sent to the client
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to TODO

…vior

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
@valerian-roche valerian-roche marked this pull request as draft January 25, 2024 15:08
@valerian-roche valerian-roche force-pushed the dd/sotw-fixes branch 2 times, most recently from 835ef87 to 8d3db09 Compare February 5, 2024 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants