Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework Cache interface #584

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

valerian-roche
Copy link
Contributor

It has been raised that the addition of StreamState within the Cache interface is breaking the separation of concern between the server side managing the xds protocol itself and the cache which is only in charge of returning responses based on watches
PR #577 has been opened proposing to fully remove any client state from the CreateWatch interface

This PR is an alternate proposal based on a few concerns/aspects:

  • one of the argument on removing it is that the sotw protocol is stateless. It actually is not, and is explicitly visible in this PR: the lastResponse latched value as well as the version hack when the resource list is changed very well show that it is stateful
  • in cases outside of lds and cds, it is clearly stated that the control-plane is not expected to send back all resources when a single one change. This is highly critical in the case of ads with eds, as there is a likelihood of both having a lot of resources and a high churn. The implementation sending everything each time is technically tolerated, but is pushing on the data-plane work that could be done at low cost on the control-plane
  • the current support of both delta and sotw, for both linear and simple cache means that every change has to be done 4 times, with four very different implementations, each one supporting only a subset of use cases (e.g. linear does not support cds/lds as it doesn't handle wildcard in this case, and sotw does not work properly as resource list is improperly maintained). I strongly believed that both implementation should converge. The only remaining differences should be in the messages themselves, abstracted in the server part. Having a common interface for sotw and delta is a first step for this. It would also allow in the future to make snapshot far more efficient by supporting a linear version per snapshot

In this context, this PR is changing:

  • creates a ClientState within the cache package, driven by the data needed from the cache perspective to be able to answer a client request
  • simplifies streamState
    • makes it compatible with the new ClientState interface
    • removes the IsFirst notion, as the protocol states it should come from the nonce in the request
    • removes the duplicate/parallel knownResourceNames and use the delta one instead. It will also properly carry the version in a future PR

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale label Sep 24, 2022
@alecholmez alecholmez self-requested a review October 28, 2022 14:36
@alecholmez alecholmez self-assigned this Oct 28, 2022
@alecholmez
Copy link
Contributor

Hey @valerian-roche can you resolve those merge conflicts so I can start taking a look at this?

@valerian-roche
Copy link
Contributor Author

Hey @valerian-roche can you resolve those merge conflicts so I can start taking a look at this?

Hey @alecholmez
Sorry for the delay, rebased on upstream. As the change is only in a log line I rebased the commits rather than merged on top

@valerian-roche
Copy link
Contributor Author

This PR will need to be rebased on main once #657 is merged

@valerian-roche
Copy link
Contributor Author

I rebased this PR on the latest changes (especially the rollback of #615 which was blocking it)

Copy link
Contributor

@jpeach jpeach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a couple of drive-by notes.

I'd also suggest adding more commentary to the commit message to describe the overall change, why the change is needed and the effects of the change on the system (particularly for the rework commit). The reader kinda has to try to stitch that together there from the PR description, and a bit of hand-holing would be helpful :)

GetSubscribedResources() map[string]struct{}

// IsWildcard returns whether the client has a wildcard watch.
// This considers subtilities related to the current migration of wildcard definition within the protocol.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// This considers subtilities related to the current migration of wildcard definition within the protocol.
// This considers subtleties related to the current migration of wildcard definitions within the protocol.

It would also be useful to direct the reader to a description of these subtleties so they can understand what issues they might encounter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add links for those

@@ -50,7 +70,7 @@ type ConfigWatcher interface {
//
// Cancel is an optional function to release resources in the producer. If
// provided, the consumer may call this function multiple times.
CreateWatch(*Request, stream.StreamState, chan Response) (cancel func())
CreateWatch(*Request, ClientState, chan Response) (cancel func())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are breaking compatibility here, we should return (cancel func(), err error) (here, and in other API that creates a watch) so that failures to create a watch can be reported correctly. See discussion in #689.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do

Copy link
Contributor

@alecholmez alecholmez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor thoughts. IMO we should really linger on what we want for this interface here. I like the idea, this is going to be a big part of the state machine between the client/server.

// Though the methods may return mutable parts of the state for performance reasons,
// the cache is expected to consider this state as immutable and thread safe between a watch creation and its cancellation
type ClientState interface {
// GetKnownResources returns the list of resources the clients has ACKed and their associated version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// GetKnownResources returns the list of resources the clients has ACKed and their associated version.
// GetKnownResources returns a list of resources that clients have ACK'd and their associated version.

// The versions are:
// - delta protocol: version of the specific resource set in the response
// - sotw protocol: version of the global response when the resource was last ACKed
GetKnownResources() map[string]string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm beginning to think that using map[string]string everywhere is asking for trouble. How do you feel about defining a data structure that holds some of this info? We could guard it and make it thread safe too. If we're controlling this interface it might be nice to provide some guard rails for consuming users

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do agree. This PR attempted to rework the interface with the smallest possible impact elsewhere, but as we are breaking backward compatibility we might want to go all-in and make it better to fit the model we're attempting to achieve

// This allows proper implementation of stateful aspects of the protocol (e.g. returning only some updated resources)
// Though the methods may return mutable parts of the state for performance reasons,
// the cache is expected to consider this state as immutable and thread safe between a watch creation and its cancellation
type ClientState interface {
Copy link
Contributor

@alecholmez alecholmez Jun 5, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if we just called this State? Or maybe something more abstract? To me it this has always been a data structure that bridges the clients and servers and their state management. My original thought behind StreamState was that it was ambiguous enough to be the middle ground between the client/server but contextual enough to know it's on an individual gRPC stream.

If we want to call it ClientState I'm cool with that, just a thought though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a strong opinion on the naming here. This state represents the state of a given stream for a given object type, so StreamState (though it might be overloaded) might be another possibility

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it to SubscriptionState, but I'm absolutely open to rename it

// GetSubscribedResources returns the list of resources currently subscribed to by the client for the type.
// For delta it keeps track of subscription updates across requests
// For sotw it is a normalized view of the last request resources
GetSubscribedResources() map[string]struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I think I remember why this is needed as a map. Wondering if we that's actually necessary in the cache... I'll take a look

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean here. A main aspect this PR fixes is the previous reuse of KnownResources as SubscribedResources. This creates multiple edge cases and even more issues in the context of the new wildcard model.
I am even thinking that we may want to go to a more advanced model distinguishing NamedResources and WildcardResources in the context of the new wildcard model where odcds can subscribe to wildcard and explicit resources. We currently don't allow Cache writers to take decisions based on this, either in the server API (apart from intercepting requests) or the cache

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The split between IsWildcard and GetSubscribedResources should now allow to remove the old assumption of "no resources" == "wildcard" which is no longer true. Also a request with IsWildcard set to true and explicit resources in GetSubscribedResources carries the meaning of "client requested wildcard and explicit resources", and a cache implementation can act on this

@valerian-roche valerian-roche force-pushed the vr/interface branch 2 times, most recently from adc9151 to 57978c5 Compare June 7, 2023 02:27
@valerian-roche
Copy link
Contributor Author

Rebased the PR on the ads sotw changes. Will address further comments and update the PR

@valerian-roche valerian-roche force-pushed the vr/interface branch 5 times, most recently from 66dce37 to db72c18 Compare June 10, 2023 00:43
…interface is breaking the separation of concern between the server side managing the xds protocol itself and the cache which is only in charge of returning responses based on watches

PR envoyproxy#577 has been opened proposing to fully remove any client state from the CreateWatch interface

This commit is updating the Cache interface based on a few concerns/aspects:
- one of the argument on removing it is that the sotw protocol is stateless. It actually is not, and is explicitly visible in this PR: the lastResponse latched value as well as the version hack when the resource list is changed very well show that it is stateful
- in cases outside of lds and cds, it is clearly stated that the control-plane is not expected to send back all resources when a single one change. This is highly critical in the case of ads with eds, as there is a likelihood of both having a lot of resources and a high churn. The implementation sending everything each time is technically tolerated, but is pushing on the data-plane work that could be done at low cost on the control-plane
- the current support of both delta and sotw, for both linear and simple cache means that every change has to be done 4 times, with four very different implementations, each one supporting only a subset of use cases (e.g. linear does not support cds/lds as it doesn't handle wildcard in this case, and sotw does not work properly as resource list is improperly maintained). I strongly believed that both implementation should converge. The only remaining differences should be in the messages themselves, abstracted in the server part. Having a common interface for sotw and delta is a first step for this. It would also allow in the future to make snapshot far more efficient by supporting a linear version per snapshot

In this context, this commit is changing:
- creates a SubscriptionState within the cache package, driven by the data needed from the cache perspective to be able to answer a client request
- simplifies streamState
  - makes it compatible with the new SubscriptionState interface
  - removes the IsFirst notion, as the protocol states it should come from the nonce in the request
  - removes the duplicate/parallel knownResourceNames and use the delta one instead. It will also properly carry the version in a future PR

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
Rename KnownResources to ACKedResources to better reflect the change

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
@valerian-roche
Copy link
Contributor Author

I fully rebased the PR and replied to comments. I'd like this or #718 to move forward to be able to push major fixes on sotw and sotw-ads which are critical for xds-grpc

…vior

Signed-off-by: Valerian Roche <valerian.roche@datadoghq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants