Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CM: Add on-demand cluster discovery functionality #15857

Closed
wants to merge 45 commits into from

Conversation

krnowak
Copy link
Contributor

@krnowak krnowak commented Apr 6, 2021

Commit Message:
OdCdsApi interface is introduced as an API used for the on-demand discovery. The implementation of it is provided using the CdsApiHelper class, so OdCdsApiImpl handles the discovered cluster in the same way as CdsApiImpl does. On the ClusterManagerImpl side, the discovery manager (ClusterDiscoveryManager) is added to help in deduplicating the requests for the same cluster from within the same worker thread. Further deduplication of the requests coming from different worker threads is done in ClusterManagerImpl in the main thread. Each unique request for a cluster also receives a timeout to catch a case when a discovery fails, thus allowing to let the worker threads to handle the failure.

This also adds an explicit wildcard mode to the XDS protocol, to allow handling both wildcard and on-demand requests for the same resource type on the same stream.

Additional Description:

Risk Level:
High. A new feature not wired up anywhere yet. Changes in wildcard mode.

Testing:
As a part of #15523.

Docs Changes:
API docs, updated the xds-protocol docs about explicit wildcard mode.

Release Notes:

  • xds: added the explicit wildcard mode, which allows subscribing to specific resource names on top of a wildcard subscription. This means that the resource name * is reserved. This may mean that in some cases the initial wildcard subscription request on the stream will not be empty, but will have a non-empty list of resources with a special name among them. See :ref:wildcard mode description <xds_protocol_resource_hints> and :ref:unsubscribing from resources <xds_protocol_unsubscribing_from_resources> for details.

On-demand cluster discovery would be release-notes-worthy, but since it's not wired anywhere yet, it can't be used.

Platform Specific Features:
N/A

[Optional Runtime guard:]
[Optional Fixes #Issue]
[Optional Deprecated:]
[Optional API Considerations:]

Each thread using this functionality will need to instantiate their
own OdCdsApi instance and pass it together with a callback to
ClusterManager's requestOnDemandClusterDiscovery. Cluster manager will
use passed OdCdsApi to perform the discovery. When the discovery is
finished, ClusterManager invokes the callbacks in the thread context.

OdCdsApiImpl starts the subscription on the first update request. The
subsequent request uses the on-demand update functionality of the
subscription. That way, subscriptions are only started when actually
needed.

Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
This is to address an issue raised during a review of the CdsApiHelper
refactor.

Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting with some interface comments/questions. Thank you!

/wait

* @return ClusterDiscoveryCallbackHandlePtr the discovery process handle.
*/
virtual ClusterDiscoveryCallbackHandlePtr
requestOnDemandClusterDiscovery(OdCdsApiSharedPtr odcds, const std::string& name,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting from a very high level: how does one allocate an OdCdsAPI, and why does one allocate it? It seems like this should all be handled by the cluster manager? Is the issue config? Should there be a CM API to allocate an on demand handle or something like that?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Starting from a very high level: how does one allocate an OdCdsAPI, and why does one allocate it?

An extension could have a config source in its configuration, so it could be used to create an OdCdsApi. Extension config creation happens in main thread, so this looks like a good moment to create an instance. The source config could be different from cds_config in bootstrap, which may be the reason why a separate OdCdsApi needs to be allocated.

It seems like this should all be handled by the cluster manager? Is the issue config? Should there be a CM API to allocate an on demand handle or something like that?

Could be I suppose. I can see adding something like:

class OdCdsApiHandle {
public:
  virtual ~OdCdsApiHandle() = default;
};

using OdCdsApiHandleSharedPtr = std::shared_ptr<OdCdsApiHandle>;

…

virtual OdCdsApiHandleSharedPtr ClusterManager::allocateOdCdsApi(…);
virtual ClusterDiscoveryCallbackHandlePtr
  requestOnDemandClusterDiscovery(OdCdsApiHandleSharedPtr odcds, …);

So I'm not sure. It's replacing an interface with one function for another interface with no functions + one more function in CM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ^ is what I would do. It's cleaner because now everything is contained within the CM and the handle is effectively opaque and can be de-duped based on config.

/wait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm halfway through the refactor, so I have couple of notes:

  • I have kept the OdCdsApiImpl class for now, it just does not subclass the already gone OdCdsApi interface. I think I'll rename the class to OdCdsApi later, since it's not an impl of anything anymore.
  • Having an opaque OdCdsApiHandle classes basically forced me to write a function like follows inside cluster_manager_impl.cc:
OdCdsApiImpl& getOdCdsApiFromHandle(OdCdsApiHandleSharedPtr const& handle) {
  auto impl = std::dynamic_pointer_cast<OdCdsApiHandleImpl> (handle);
  ASSERT(impl != nullptr);
  return impl->get();
}

which feels kind of ew, but I have seen std::dynamic_pointer_cast being used in Envoy, so I guess this is ok.

This also means that it probably became difficult to keep the unit tests I wrote (the ODCDTest class in cluster_manager_impl_test.cc), because I can't mock OdCds. Passing a mock of the handle will trigger an assertion. I'll comment them out for now.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You shouldn't need any casts. Just access the functionality from the handle (i.e. move the dynamic fetch functionality to the handle).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. My understanding was that you wanted the handle class to be completely opaque, so no other virtual functions other than the destructor. Adding a dynamic fetch functionality to the handle means renaming the old OdCdsApi interface to OdCdsApiHandle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically what I'm after is for the entire functionality to be exposed by the CM in isolation. So I think API handle -> fetch -> fetch handle makes sense. The API handle should be obtained from the CM itself based on config (which should be de-duped).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only wrinkle I can see here is that you can receive an API handle from CM and then instead of doing cm.requestOnDemandClusterDiscovery(api_handle, …), you can also do api_handle.updateOnDemand(clusterName). So maybe expose the discovery functionality only through the handle and drop the requestOnDemandClusterDiscovery function from CM interface? Or it would more like - move requestOnDemandClusterDiscovery from CM to API handle.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly, the only CM API is to get an API handle, then the request is done through the API handle. This makes sure the entire abstraction is hidden inside the CM, and then it's just an implementation detail of code sharing between CDS and ODCDS.

/**
* File an on-demand request for a cluster.
*/
virtual void updateOnDemand(const std::string& cluster_name) PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Porting my comment over from the other PR: I'm not sure why this is needed at this interface level. From my perspective everything should be encapsulated by the cluster manager. (See above comment)

Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
@krnowak
Copy link
Contributor Author

krnowak commented Apr 8, 2021

I refactored the code, so now we have the handle interface that exposes the discovery functionality, and CM received a function to create those handles. Currently the implementation of the function is simple - it creates a new handle every time it's called. I have kept the OdCdsApi interface so I could mock it for the cluster manager unit tests, but the interface is an implementation detail now.

The code is still building on my computer, so there may still be some lingering build issues to fix, so I could end up pushing some more commits. But I think it's more or less ready for another review round.

Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks from an API perspective this looks right. Some high level comments around readability:

  1. Please move the classes/function definitions into the appropriate visibility section.
  2. The code needs significantly more comments given how complicated it is (across both class level and definition level, as well as within each function). It will make reviewing easier also. So please work on that next and I can take another pass.

Thank you!

/wait

*/
virtual ClusterDiscoveryCallbackHandlePtr
requestOnDemandClusterDiscovery(const std::string& name,
ClusterDiscoveryCallbackWeakPtr callback) PURE;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I commented on this previously, but in general it doesn't make sense to pass a weak ptr here. The callback should exist at the time of the call. I would let the callee store it as a weak ptr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I thought earlier it was only about the odcds weak pointer. Will change it.

*/
virtual OdCdsApiHandleSharedPtr
allocateOdCdsApi(const envoy::config::core::v3::ConfigSource& odcds_config,
const xds::core::v3::ResourceLocator* odcds_resources_locator,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I would use OptRef here, and clarify above that this is use if provided over odcds_config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a pointer, because I copied that from CdsApiImpl. Should CdsApiImpl be fixed too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's easy to fix, yes, otherwise I wouldn't worry about it for now.

source/common/upstream/cluster_manager_impl.h Outdated Show resolved Hide resolved
private:
ClusterDiscoveryManager& parent_;
ClusterDiscoveryCallbackWeakPtr cb_;
std::string name_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

@@ -285,6 +286,42 @@ class ClusterManagerImpl : public ClusterManager, Logger::Loggable<Logger::Id::u
ClusterUpdateCallbacksHandlePtr
addThreadLocalClusterUpdateCallbacks(ClusterUpdateCallbacks&) override;

class OdCdsApiHandleImpl : public std::enable_shared_from_this<OdCdsApiHandleImpl>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Class definitions go at the top of the visibility section. I think this implementation detail can also be private.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be at least protected, so unit test could use it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make the test a friend if you need to test private details, though optimally the tests would just go through the main CM.

Copy link
Contributor Author

@krnowak krnowak Apr 9, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've wanted to make it protected, so the test can construct its own handle instance that contains a mock odcds, instead of the real one. So when calling handle->requestOnDemandClusterDiscovery(…), a real CM code will be tested, but it will call mock's odcds.updateOnDemand. I'll push my updates soon, so you will see what I mean.


private:
ClusterManagerImpl& parent_;
OdCdsApiPtr odcds_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

const

Comment on lines 321 to 323
ClusterDiscoveryCallbackHandlePtr
requestOnDemandClusterDiscovery(OdCdsApiHandleImplSharedPtr odcds_handle, const std::string& name,
ClusterDiscoveryCallbackWeakPtr callback);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be private.

Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
@krnowak
Copy link
Contributor Author

krnowak commented Apr 9, 2021

Updated. Other than addressing the review issues, I have changed the use of deque to list (I can't remember what was the reason to use deque). That made the code a bit simpler and allowed me to move the discovery handle impl class into the discovery manager class.

Other than that - I've got a clang-tidy failure. It complains about some leak in libstdc++ code (search for M_pi): https://dev.azure.com/cncf/4684fb3d-0389-4e0b-8251-221942316e06/_apis/build/builds/71558/logs/144

Abort, Retry, Ignore? :)

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks cool stuff. Flushing out some comments from this round.

/wait

* @param validation_visitor
* @return OdCdsApiHandleSharedPtr the ODCDS handle.
*/
virtual OdCdsApiHandleSharedPtr
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I see why you uses shared_ptr here (to deal with lifetime issues), but in a perfect world from an API perspective this should be a unique_ptr return and any needed shared/weak ptr gymnastics should be done under the hood. From a user perspective I would expect to allocate a handle in my filter, etc. main thread init, use that across different threads, and then destroy it when the filter is destroyed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this makes sense. Thanks for the insight. I have made it an unique_ptr that holds a shared_ptr to OdCdsApi.

Comment on lines 1188 to 1190
// TODO(krnowak): This timeout period is arbitrary. Should it be a parameter or do we keep it,
// but rather with a name?
timer->enableTimer(std::chrono::milliseconds(5000));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes the timeout should be passed as an API paramater. Also I think ClusterDiscoveryStatus should have a Timeout status to differentiate from Missing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will add the timeout parameter.

About the Missing and Timeout statuses - the thing is that the Missing status is only reported when the discovery times out, so I guess we could have Available and Timeout statuses. I'm not sure if there's a way for grpc subscription to tell me that the asked resource does not exist. At least I haven't seen it done in VHDS.

Comment on lines 1168 to 1169
// If this gets called here, it means that we requested a discovery of a cluster without
// checking if that cluster is already known by cluster manager.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this actually true? Couldn't this be a race between threads?

Copy link
Contributor Author

@krnowak krnowak Apr 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, could be a rather long-winded race, true. Worker thread W1 checks for a cluster that is absent, gets preempted for worker thread W2, which does the same and starts the discovery process. Then main thread sends the gRPC request, receives the reply, warms up the cluster and adds it to the thread local CMs. Then W1 gets a chance to run again, and since it decided that the cluster is missing, it starts the discovery and hits this case in main thread. Long-winded to a point of starvation, so I'd say this is a highly unlikely possibility. But who knows what may happen under the really high load. I'll update the comment.

return;
}

if (auto it = pending_cluster_creations_.find(name); it != pending_cluster_creations_.end()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add debug/trace level logging throughout this code/PR. It's going to be very difficult to debug this and understand what is going on. Please also check coverage to make sure all various branches are hit in the new code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added some, but I'm rather unsure about adding logging to the discovery manager - it's a thread local thing, so it might spam the log N times with the same message (especially when the discovery is finished and every worker thread is notified about the fact).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's OK to do that at trace level, and probably still worth it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, will add them then.

// invoke the cluster lifecycle callbacks, that will in turn invoke our callback.
odcds.updateOnDemand(name);
// Setup the discovery timeout timer to avoid keeping callbacks indefinitely.
auto timer_cb = Event::TimerCb([this, name] { notifyExpiredDiscovery(name); });
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just inline this next line

Comment on lines 1204 to 1206
// Defer destroying the timer, so it's not destroyed during its callback. TimerPtr is a
// unique_ptr, which is not copyable, but std::function is copyable, we turn a move-only
// unique_ptr into a copyable shared_ptr and pass that to the std::function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this should be necessary. Timers are one shot and it should be safe to destroy them during callback invocation, so I think it's fine to just destroy the timer as this pending discovery expires. We shouldn't be passing timers between threads in any situation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is expected to be executed in main thread, so the deferred destruction is also meant to happen in the main thread. But anyway, I'm happy to drop this, so I'll get rid of the clang-tidy complaint here too.

Comment on lines 1212 to 1214
if (!cluster_manager.has_value()) {
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's an opt-ref so I thought that the check would be prudent to do. I'll remove it.

source/common/upstream/cluster_manager_impl.cc Outdated Show resolved Hide resolved
Comment on lines 1057 to 1059
auto cb = ClusterAddedCb([this](ThreadLocalCluster& cluster) {
processClusterName(cluster.info()->name(), ClusterDiscoveryStatus::Available);
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just inline next line

Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
@krnowak
Copy link
Contributor Author

krnowak commented Apr 12, 2021

Updated. I'll have a look at the coverage when CI finishes it.

@krnowak
Copy link
Contributor Author

krnowak commented Apr 13, 2021

tsan failure seems unrelated - two tests are flaky, //test/config_test:example_configs_test and //test/integration:integration_test, and some python flaky test reporting tool threw an exception.

Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
@krnowak
Copy link
Contributor Author

krnowak commented Apr 13, 2021

Updated, improved coverage a bit too, but there is one place I couldn't cover with mocking unit tests.

Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
This is mostly for testing the invoker behavior for a handle that was
destroyed when it was already prepared for processing.

Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
…ogress

This had led me to adding rather icky workarounds to simulate multiple
threads with mocks.

Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
Signed-off-by: Krzesimir Nowak <krzesimir@kinvolk.io>
@krnowak
Copy link
Contributor Author

krnowak commented May 25, 2021

@mattklein123, @htuch, @dmitri-d, @markdroth: I think it's ready for review now.

I found it a bit hard to express the semantics of the explicit wildcard mode (as I called it) in the xds-protocol docs, because it feels that the general description of the protocol is done through the SotW point of view. I'll try to make a comparison between delta and SotW here:

  • initial implicit wildcard subscription request:
    • SotW: empty request (no resource names).
    • delta: empty request (no subscribe or unsubscribe resource names).
  • initial explicit wildcard subscription request with an extra resource foo (enabling explicit wildcard mode without adding an extra resource names makes little sense, but it's doable):
    • SotW: request with names * and foo
    • delta: request with subscribe names * and foo
  • subsequent request keeping the wildcard subscription and adding a resource bar:
    • SotW: request with names *, foo, and bar.
    • delta: request with a subscribe name bar.
  • subsequent request cancelling the wildcard subscription only:
    • SotW: request with names foo and bar.
    • delta: request with unsubscribe name *.
  • subsequent request enabling the wildcard subscription again:
    • SotW: request with names *, foo and bar.
    • delta: request with subscribe name *.
  • switching to explicit wildcard mode with an extra resource foo after the initial empty request was sent to enable implicit wildcard mode:
    • SotW: request with names * and foo.
    • delta: request with subscribe name foo (note that lack of *, we already are in the wildcard mode, no need to repeat it again).

@krnowak
Copy link
Contributor Author

krnowak commented Jun 1, 2021

@mattklein123, @htuch, @dmitri-d, @markdroth: Friendly ping for a review. :) Thanks.

I'm not sure if the failure in windows clang_cl CI is related in any way to my changes. The failed test was //test/integration:quic_protocol_integration_test. The name of the failed tests seem to be UpstreamProtocols/DownstreamProtocolIntegrationTest.HittingDecoderFilterLimit/IPv6_HttpDownstream_Http3Upstream and UpstreamProtocols/DownstreamProtocolIntegrationTest.VeryLargeRequestHeadersRejected/IPv6_HttpDownstream_Http3Upstream. But the test output is a bit confusing.

@mattklein123
Copy link
Member

@krnowak, @htuch is out this week. I would suggest trying to have @dmitri-d review this week and @htuch can do the final pass next week. I think the things being worked on right now are out of scope of what I can help with.

@dmitri-d
Copy link
Contributor

dmitri-d commented Jun 1, 2021

I'll take a look today.

if (names_removed_.find("*") != names_removed_.end()) {
// we explicitly cancel the wildcard subscription
mode_ = WildcardMode::Disabled;
} else if (!names_added_.empty()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm not sure about this: an explicit mode requires presence of a * in the names_added list; we only check if that list isn't empty. If another call to updateSubscriptionInterest is made before a request to the server is made, we may end up in a wrong state: for example if names_removed is empty in that call, we'll end up in WildcardMode::Disabled state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I'm not sure about this: an explicit mode requires presence of a * in the names_added list;

Only on the initial request (after the stream reset, for example), or on the first request where we decide to opt-in into the wildcard mode. When switching from implicit to explicit mode, no need to send * since the server should already know that we are in the wildcard mode. I would say that with this PR, the server should not ignore delta requests with a non-empty resource_names_subscribe list from a client that opted-in into wildcard mode.

I'll need to go through the documentation I wrote to make sure that it's clear.

we only check if that list isn't empty. If another call to updateSubscriptionInterest is made before a request to the server is made, we may end up in a wrong state: for example if names_removed is empty in that call, we'll end up in WildcardMode::Disabled state.

Sorry, I'm not following here. There isn't any logic that resets us to WildcardMode::Disabled on an empty names_removed_ list. Could you clarify?

@dmitri-d
Copy link
Contributor

dmitri-d commented Jun 1, 2021

I think "explicit" wildcard changes make sense, but (and I hate to be this guy): could you make these changes a dedicated PR please?

@markdroth
Copy link
Contributor

Sorry for the delay on this!

I found it a bit hard to express the semantics of the explicit wildcard mode (as I called it) in the xds-protocol docs, because it feels that the general description of the protocol is done through the SotW point of view.

I don't actually think it makes sense to consider this a separate mode for wildcard queries. It really isn't a new mode; it's just a new syntax for specifying the existing functionality.

I also don't think it's necessary to describe this totally differently for SotW and delta. The syntax differs between the two protocol variants, but the semantics are ultimately the same: the client determines what set of resources it is subscribed to at any one time, and the set of currently subscribed-to resources is a property of the stream that has to be tracked by both the client and the server. In SotW, the client must include the full set in every request, whereas in delta, the client can send just changes to the existing set, but in both cases, that set is a stable property of the stream in between requests. So I think we can describe the wildcard behavior in terms of that set of subscribed resources.

I think the right way to describe this in the protocol doc is to say that one of the subscribed resources can be *, which is a wildcard subscription. The old-style way of specifying a wildcard query is still supported, but it can be described as basically a syntax alias for the same thing -- in other words, if the set of subscribed resources is empty and has never previously been non-empty, that is a special case meaning that the set of subscribed resources contains only *.

Also, I agree with the comment from @dmitri-d about moving the xDS spec changes into a separate PR.

@krnowak
Copy link
Contributor Author

krnowak commented Jun 7, 2021

Please see #16855 for the separate PR for the explicit wildcard mode.

@github-actions
Copy link

github-actions bot commented Jul 7, 2021

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Jul 7, 2021
@github-actions
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Jul 15, 2021
@krnowak
Copy link
Contributor Author

krnowak commented Jul 15, 2021

I'll update and reopen it when #16855 gets merged.

@krnowak
Copy link
Contributor Author

krnowak commented Jul 16, 2021

I'll pick up this work in August, when I'm back from holidays.

htuch pushed a commit that referenced this pull request Feb 4, 2022
OdCdsApi interface is introduced as an API used for the on-demand discovery. The implementation of it is provided using the CdsApiHelper class, so OdCdsApiImpl handles the discovered cluster in the same way as CdsApiImpl does. On the ClusterManagerImpl side, the discovery manager (ClusterDiscoveryManager) is added to help in deduplicating the requests for the same cluster from within the same worker thread. Further deduplication of the requests coming from different worker threads is done in ClusterManagerImpl in the main thread. Each unique request for a cluster also receives a timeout to catch a case when a discovery fails, thus allowing to let the worker threads to handle the failure.

This is a continuation of #15857 - I could not reopen it, so I'm opening a new PR. I used the opportunity to rebase my changes on top of main.

Risk Level:
Low. A new feature not wired up anywhere yet.

Signed-off-by: Krzesimir Nowak <knowak@microsoft.com>
joshperry pushed a commit to joshperry/envoy that referenced this pull request Feb 13, 2022
OdCdsApi interface is introduced as an API used for the on-demand discovery. The implementation of it is provided using the CdsApiHelper class, so OdCdsApiImpl handles the discovered cluster in the same way as CdsApiImpl does. On the ClusterManagerImpl side, the discovery manager (ClusterDiscoveryManager) is added to help in deduplicating the requests for the same cluster from within the same worker thread. Further deduplication of the requests coming from different worker threads is done in ClusterManagerImpl in the main thread. Each unique request for a cluster also receives a timeout to catch a case when a discovery fails, thus allowing to let the worker threads to handle the failure.

This is a continuation of envoyproxy#15857 - I could not reopen it, so I'm opening a new PR. I used the opportunity to rebase my changes on top of main.

Risk Level:
Low. A new feature not wired up anywhere yet.

Signed-off-by: Krzesimir Nowak <knowak@microsoft.com>
Signed-off-by: Josh Perry <josh.perry@mx.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale stalebot believes this issue/PR has not been touched recently waiting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants