-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make SlotImpl detachable from its owner, and add a new runOnAllThreads interface to Slot. #8135
Conversation
… to it. Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
/assign @mattklein123 |
@htuch FYI. :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, this is an interesting solution. I've flushed out some comments assuming we go with this solution.
Stepping back though, can you refresh my memory on why this complexity is needed vs. just cancelling any potential updates before we start listener shutdown which I think would be serialized correctly? Thank you!
/wait
void ConfigSubscriptionCommonBase::applyConfigUpdate(const ConfigUpdateCb& update_fn, | ||
const Event::PostCb& complete_cb) { | ||
tls_->runOnAllThreads( | ||
[update_fn](ThreadLocal::ThreadLocalObjectSharedPtr previous) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: const ThreadLocal::ThreadLocalObjectSharedPtr&
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to express the idea that: ThreadLocal::Slot::UpdateCb is expected to receive data stored in the slot, and do-whatever-necessary, then return the new/updated version data to be saved in the slot. A non-const ThreadLocalObjectSharedPtr seems better under this notion, WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By passing by value you are essentially just forcing a copy in most cases. Personally, I think it's more common to make a new thing than use the previous, so I would pass by const-ref, and then if a copy needs to be made it can be done inside the function to put the cost there.
main_thread_dispatcher_->post([this]() { | ||
deferred_deletes_.remove_if( | ||
[](std::unique_ptr<SlotHolder>& holder) -> bool { return holder->isRecycleable(); }); | ||
if (!deferred_deletes_.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Prefer a loop vs. recursion.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me update the misleading comment.
This method will reschedule another cleanup task in the main thread dispatcher Q. Not exactly recursive.
We can't use a loop here as it basically means a lock/wait mechanism (on the ongoing update callbacks been finished).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this makes more sense now. Can you add a bunch more comments inside recycle
and this function to describe what is going on and the various interleavings?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping on adding more comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oops, missed this.
I added some more comments around the function header. and also recycle.
holder.reset(); | ||
return; | ||
} | ||
deferred_deletes_.emplace_back(std::move(holder)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nothing tracks when this holder is eventually recyclable, right? Meaning, in this case it will stick around the deferred delete list until something else gets scheduled for cleanup? (Since presumably this schedule cleanup can race with the thing actually being recyclable). Maybe this is fine? Or maybe you should be checking for use count on the worker threads and scheduling cleanup if its detached and ready to be recycled (the last one)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's tracked:
- A cleanup task is scheduled when we put a new item into the deferred delete queue.
- The cleanup task will cleanup recyclable items from the Q, and reschedule another cleanup task(the recursion) if there are any non-recyclable ones.
So once an item is put into the queue, the above event-loop (1->2->2->2->....) will guarantee the items been cleaned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stepping back though, can you refresh my memory on why this complexity is needed vs. just cancelling any potential updates before we start listener shutdown which I think would be serialized correctly? Thank you!
/wait
I think the canceling/waiting approach requires more changes to the code base. Basically, if we want to be lock free, we'd have to split the LDS update (or Server shutdown) into 2 phases:
- set an flag or broadcast canceling signals to workers.
- Wait to be signaled that the 'shutdown' message has been received by everyone.
In the meantime, if it's a LDS update caused ListenerImpl teardown, we have to bookkeep all the TLS resources owned by "this" ListenerImpl, and only wait on them to be cleaned up, as a RDS subscription (or any other subscription is shared between Listeners).
On the other hand, this change limit changes to tls module only, least disturbance to the code base.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Basically, if we want to be lock free, we'd have to split the LDS update (or Server shutdown) into 2 phases
To be clear, this is how it already works. See ListenerManagerImpl::drainListener
which already handles draining and stopping a listener, including posting a completion back and forth. Assuming all xDS updates are stopped before removal (or drain) by definition there can be no destruction race condition. What I'm trying to get at is how hard is it really to just shut down LDS/RDS/etc. updates for a listener on the main thread before doing this? Naively it doesn't seem that hard.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I talked to @stevenzzzz offline and we are going to time box trying to implement explicit shutdown of resources prior to listener shutdown. This might be accomplished via the init manager or some similar structure. If it ends up being too hard or has complications we can come back to this.
/wait
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be in favor of not repeating something like init manager. Init manager is has turned out tobe hard to reason about, with a maze of callbacks, watchers, registration conditions, lifetime considerations and so on. History speaks for itself with all the bugs and races on this one.
I think there is a good argument to be made that the solution in this PR has benefits in that it simplifies the contract around slot ownership, e.g. the runOnAllThreads()
gets the threads out of the business of worrying about tls_
slow ownership and lifetime. It also centralizes all the complexity inside of the SlotImpl
, whereas init manager style solutions diffuse this complexity.
So, my $0.02 is that something like the solution in this PR is a nice direction; basically getting away from having threads worrying about TLS slots directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Throwing out another idea that has come up when chatting with @stevenzzzz; we could have a centralized (owned by Server::Instance
) TLS slot manager. This will be the actual owner of the memory backing all slots and it lives forever. It would hand out slot handles and then have them returned via RAII.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, my $0.02 is that something like the solution in this PR is a nice direction; basically getting away from having threads worrying about TLS slots directly.
I think with this solution there are going to be potentially hard to reason about thread interleavings, and I have no doubt we will have a different set of issues from this direction, though I agree it's a nice solution and it will work in the end. My reason for proposing the use of the existing init manager system is that it already exists. We must register things into the init manager, and I think it would be relatively trivial to also register a shutdown hook of some type at the same time.
With that said, I don't feel that strongly about it, so if you both want to continue with this solution we can do that.
we could have a centralized (owned by Server::Instance) TLS slot manager.
This is the way it already works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good discussion here.
Personally I am not a fan of the init-manager mechanism. It's essentially a barrier that's passed around and hard to track what's been barriered and what's(should) not.
The other problem I want to call out here is that no matter which way we go, we need to let contributors know about this problem, and we should not distributed (by capturing some shared_ptr) ownership of a tls slot or its owner to worker threads.
That has the side effect of destructing the delicate dependency graph of objects in ServerImpl and cause crash during server shutdown (IIUC, per my experience though we drain the libevent message queue when shutdown a worker, we don't prevent new callbacks been posted to Dispatcher's callback queue). An example would be post a
a callback which captures a shared_ptr to workers. On destruction, The singletonManager is destructed first, and then the RdsConfigSubscription owned by RdsRouteConfigProvider will try to remove it self from the destructed RdsConfigProviderManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other problem I want to call out here is that no matter which way we go, we need to let contributors know about this problem, and we should not distributed (by capturing some shared_ptr) ownership of a tls slot or its owner to worker threads.
This is my point exactly. Although I agree Init/ShutdownManager is problematic in certain ways, its extremely explicit on what needs to happen. Basically, don't do any more updates. We are trading that for rules around how this can be used which are hard to understand and enforce. With that, I will stop there. @htuch I assume you want to move forward with this solution?
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
@stevenzzzz thanks for the fixes, but before we continue the review can we discuss my comment here:
I want to check and make sure we are picking the correct high level solution. |
*/ | ||
void applyConfigUpdate( | ||
const ConfigUpdateCb& update_fn, const Event::PostCb& complete_cb = []() {}) { | ||
tls_->runOnAllThreads( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we remove tls_ as a member?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow your question, may I ask why?
tls_ is per subscription in config-provider framework, depends on config type (delta/non-delta), it creates per-worker/x-worker-shared ConfigImpl, which is shared among providers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx!
*/ | ||
void applyConfigUpdate( | ||
const ConfigUpdateCb& update_fn, const Event::PostCb& complete_cb = []() {}) { | ||
tls_->runOnAllThreads( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure I follow your question, may I ask why?
tls_ is per subscription in config-provider framework, depends on config type (delta/non-delta), it creates per-worker/x-worker-shared ConfigImpl, which is shared among providers.
holder.reset(); | ||
return; | ||
} | ||
deferred_deletes_.emplace_back(std::move(holder)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good discussion here.
Personally I am not a fan of the init-manager mechanism. It's essentially a barrier that's passed around and hard to track what's been barriered and what's(should) not.
The other problem I want to call out here is that no matter which way we go, we need to let contributors know about this problem, and we should not distributed (by capturing some shared_ptr) ownership of a tls slot or its owner to worker threads.
That has the side effect of destructing the delicate dependency graph of objects in ServerImpl and cause crash during server shutdown (IIUC, per my experience though we drain the libevent message queue when shutdown a worker, we don't prevent new callbacks been posted to Dispatcher's callback queue). An example would be post a
a callback which captures a shared_ptr to workers. On destruction, The singletonManager is destructed first, and then the RdsConfigSubscription owned by RdsRouteConfigProvider will try to remove it self from the destructed RdsConfigProviderManager.
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks. Flushing out a few more comments.
/wait
main_thread_dispatcher_->post([this]() { | ||
deferred_deletes_.remove_if( | ||
[](std::unique_ptr<SlotHolder>& holder) -> bool { return holder->isRecycleable(); }); | ||
if (!deferred_deletes_.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks this makes more sense now. Can you add a bunch more comments inside recycle
and this function to describe what is going on and the various interleavings?
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
/assign alyssawilk @alyssawilk PTAL, this PR is inspired by you suggested options. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM with some remaining comments.
Also, is it possible for you to add an integration test that crashes without this fix? That would be optimal. Thank you!
/wait
main_thread_dispatcher_->post([this]() { | ||
deferred_deletes_.remove_if( | ||
[](std::unique_ptr<SlotHolder>& holder) -> bool { return holder->isRecycleable(); }); | ||
if (!deferred_deletes_.empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ping on adding more comments.
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM modulo clang-tidy fixes and in integration test for the original problem. Thank you!
/wait
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
So previously the scoped_rds_integration_test is actually flaky, it fails around 1/3000 as there was deletion of a RDS resource immediately after a update to it, but the race window is really small. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for iterating here and fully discussing the solution. Will defer to @alyssawilk for further review.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Bookeeper is inspiring WRT extending the ability of Slot.
I am not sure if the SlotHolder is too complex since it is kind of re-inventing shared_ptr. What's more the Bookeeper combining with SlotHolder is duplicating the deleter of shared_ptr.
My understanding of the bookkeeper is to track
- how many runOnAllThread on the fly
- The last of Bookkeeper destruction and
runOnallThread
should destruct the real slot
I propose to replace SlotHolder by shared_ptr, and bookeeper should
1 ref the shared_ptr until destruction (shared_ptr as a member)
2 increase the ref of shared_ptr at runOnAllThread (no more complex than capture it)
3. choose one from 3.1 and 3.2
3.1 BookKeeper set the deleter of the shared_ptr to moving to defer delete queue
3.2 rewrite the BookKeeper so that runOnAllThread
enforce a oncomplete_cb to decrease the shared_ptr on main thread. See https://github.com/envoyproxy/envoy/pull/7011/files . 3.2 doesn't even need a defer delete queue.
My proposal not decrease the complexity of SlotHolder.
With 3.2 Bookkeeper::runOnAllThread could cut the overhead and the complexity of operation on deferred deletion.
WDYT?
bool isRecycleable() { return ref_count_.use_count() == 1; } | ||
|
||
const std::unique_ptr<SlotImpl> slot_; | ||
const std::shared_ptr<int> ref_count_{new int(0)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The shared_ptr as ref_count reminds me of the pattern I eliminated in envoy before. See #7011
if (deferred_deletes_.empty()) { | ||
return; | ||
} | ||
main_thread_dispatcher_->post([this]() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel like this ought to be avoidable, but perhaps I'm missing something. We're using ref_count_ not as an integer, but just to have an object with a use count right?
If we made refcount_ a class, and had Bookkeeper have a reference to it (along with runOnAllThreads calls), and the refcount destructor did a dispatcher post(delete)) couldn't we avoid this extra layer of clean up complexity? We'd still delete when both the Bookkeeper and all thread references were gone, and we'd still delete in the original thread.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the problem with "posting back to main-thread when the propagation is done", there again is this race-window, where the posted callback in main-thread may gets run first, and then the "copy of lambda" in the worker thread may held the last reference of the refcount_.
IMHO, We should also avoid to move the ownership of a slotImpl or its owner, by posting a callback/complete_cb which shares a portion of the object as well. That may cause complication when ServerImpl tears down, I went into a crash with that, where I posted shared_ptr back to main thread dispatcher queue, during shutdown, the SingletonManager is teared down ahead of the DispatcherImpl in ServerImpl, which includes the RouteConfigProviderManager that's referenced by the RdsRouteCnofigProvider in the dispatcher-msg-queue.
The Bookeeper is inspiring WRT extending the ability of Slot.
I am not sure if the SlotHolder is too complex since it is kind of re-inventing shared_ptr. What's more the Bookeeper combining with SlotHolder is duplicating the deleter of shared_ptr.
My understanding of the bookkeeper is to track
- how many runOnAllThread on the fly
- The last of Bookkeeper destruction and
runOnallThread
should destruct the real slotI propose to replace SlotHolder by shared_ptr, and bookeeper should
1 ref the shared_ptr until destruction (shared_ptr as a member)
2 increase the ref of shared_ptr at runOnAllThread (no more complex than capture it)
3. choose one from 3.1 and 3.2
3.1 BookKeeper set the deleter of the shared_ptr to moving to defer delete queue
3.2 rewrite the BookKeeper so thatrunOnAllThread
enforce a oncomplete_cb to decrease the shared_ptr on main thread. See https://github.com/envoyproxy/envoy/pull/7011/files . 3.2 doesn't even need a defer delete queue.My proposal not decrease the complexity of SlotHolder.
With 3.2 Bookkeeper::runOnAllThread could cut the overhead and the complexity of operation on deferred deletion.WDYT?
The Bookeeper is inspiring WRT extending the ability of Slot.
I am not sure if the SlotHolder is too complex since it is kind of re-inventing shared_ptr. What's more the Bookeeper combining with SlotHolder is duplicating the deleter of shared_ptr.
My understanding of the bookkeeper is to track
- how many runOnAllThread on the fly
- The last of Bookkeeper destruction and
runOnallThread
should destruct the real slotI propose to replace SlotHolder by shared_ptr, and bookeeper should
1 ref the shared_ptr until destruction (shared_ptr as a member)
2 increase the ref of shared_ptr at runOnAllThread (no more complex than capture it)
3. choose one from 3.1 and 3.2
3.1 BookKeeper set the deleter of the shared_ptr to moving to defer delete queue
3.2 rewrite the BookKeeper so thatrunOnAllThread
enforce a oncomplete_cb to decrease the shared_ptr on main thread. See https://github.com/envoyproxy/envoy/pull/7011/files . 3.2 doesn't even need a defer delete queue.My proposal not decrease the complexity of SlotHolder.
With 3.2 Bookkeeper::runOnAllThread could cut the overhead and the complexity of operation on deferred deletion.WDYT?
Yeah, SlotHolder is basically the obj that holds a slot and its outgoing ref-count. I made a class instead using a shared pointer as I thought it may be more "readable". the ferre
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, SingletonManager early destruction. With this restriction I don't have a good opinion...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talked w/ Alyssa offline, get the idea now. I believe this matches what lamndai suggested as well.
The "schedule a cleanup callback" logic is removed and each Slot Bookkeeper fires a cleanup callback on the ref-count's destruction.
…t callback on destruction of the ref-count shared_ptr Signed-off-by: Xin Zhuang <stevenzzz@google.com>
ASSERT(std::this_thread::get_id() == main_thread_id_); | ||
ASSERT(slot != nullptr); | ||
auto* slot_addr = slot.get(); | ||
deferred_deletes_.insert({slot_addr, std::move(slot)}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on why we need this. Can't we have both Bookkeeper and threads have pointers to ref_count_, and have scheduleCleanup either straight up delete slot or schedule deletion via post? Why do we need the intermediate map?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when we destruct the Bookkeeper, we can't delete the SlotImpl yet if the ref_count > 1, we need to detach the SlotImpl instance from its owner, and wait for the ref_count to go to 0 and make the post-call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked offline, Alyssa suggested to remove the intermediate storage (deferred_deletes_ ) as well, IIUC, by: releasing the SlotImpl from the unique_ptr, having the ref_count shared_ptr post a callback on destruction to the Main dispatcher. There is a risk of leaking at shutdown time though: in Server::InstanceImpl::terminate(), workers are stopped after main dispatcher loop is exited, if a worker posts a cleanup callback, the callback will not be run, the SlotImpl instance pointed to by the raw pointer will be leaked. So the intermediate container is kinda unavoidable.
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
…achable-slot Signed-off-by: Xin Zhuang <stevenzzz@google.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM modulo the spelling error.
As an aside, I think we need to do some infrastructure work to make the cross-thread deletion need less rigmarole, but I think that's outside of the scope of this PR. Thanks for all the cleanup you were willing to do!
…achable-slot Signed-off-by: Xin Zhuang <stevenzzz@google.com>
Signed-off-by: Xin Zhuang <stevenzzz@google.com>
/wait |
…llThreads interface to Slot. (envoyproxy#8135) See the issue in envoyproxy#7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error. More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb. Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. } Risk Level: Medium Testing: unit test Docs Changes: N/A Release Notes: N/A [Optional Fixes #Issue] envoyproxy#7902 Signed-off-by: Xin Zhuang <stevenzzz@google.com>
…llThreads interface to Slot. (envoyproxy#8135) See the issue in envoyproxy#7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error. More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb. Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. } Risk Level: Medium Testing: unit test Docs Changes: N/A Release Notes: N/A [Optional Fixes #Issue] envoyproxy#7902 Signed-off-by: Xin Zhuang <stevenzzz@google.com>
…llThreads interface to Slot. (envoyproxy#8135) See the issue in envoyproxy#7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error. More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb. Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. } Risk Level: Medium Testing: unit test Docs Changes: N/A Release Notes: N/A [Optional Fixes #Issue] envoyproxy#7902 Signed-off-by: Xin Zhuang <stevenzzz@google.com>
…llThreads interface to Slot. (envoyproxy#8135) See the issue in envoyproxy#7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error. More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb. Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. } Risk Level: Medium Testing: unit test Docs Changes: N/A Release Notes: N/A [Optional Fixes #Issue] envoyproxy#7902 Signed-off-by: Xin Zhuang <stevenzzz@google.com> Signed-off-by: John Plevyak <jplevyak@gmail.com>
* ext_authz: add metadata_context to ext_authz filter (envoyproxy#7818) This adds the ability to specify dynamic metadata (by namespace) to send with the ext_authz check request. This allows one filter to specify information that can be then used in evaluating an authorization decision. Risk Level: Medium. Optional feature/extension of existing filter Testing: Unit testing Docs Changes: Inline in attribute_context.proto and ext_authz.proto Fixes envoyproxy#7699 Signed-off-by: Ben Plotnick <plotnick@yelp.com> * fuzz: codec impl timeout fix + speed ups (envoyproxy#7963) Some speed-ups and validations for codec impl fuzz test: * validate actions aren't empty (another approach would be to scrub / clean these) * limit actions to 1024 * require oneofs Fixes OSS-Fuzz Issue: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=16481 Testing: local asan/libfuzzer exec/sec go from 25 to 50 Signed-off-by: Asra Ali <asraa@google.com> * docs: more detail about tracking down deprecated features (envoyproxy#7972) Risk Level: n/a (docs only) Testing: n/a Docs Changes: yes Release Notes: no envoyproxy#7945 Signed-off-by: Alyssa Wilk <alyssar@chromium.org> * Fix the alignement in optval of setsockopt when compiled with libc++. (envoyproxy#7958) Description: libc++ std::string may inline the data which results the memory is not aligned to `void*`. Use vector instead to store the optval. Detected by UBSAN with libc++ config. Preparation for envoyproxy#4251 Risk Level: Low Testing: unittest locally Docs Changes: N/A Release Notes: N/A Fixes envoyproxy#7968 Signed-off-by: Lizan Zhou <lizan@tetrate.io> * security: some intra-entity and 3rd party embargo clarifications. (envoyproxy#7977) * security: some intra-entity and 3rd party embargo clarifications. These came up in the last set of CVEs. Signed-off-by: Harvey Tuch <htuch@google.com> * protobuf: IWYU (envoyproxy#7989) Include What You Use fix for source/common/protobuf/message_validator_impl.h. Signed-off-by: Andres Guedez <aguedez@google.com> * api: add name into filter chain (envoyproxy#7966) Signed-off-by: Yuchen Dai <silentdai@gmail.com> * rds: validate config in depth before update config dump (envoyproxy#7956) Route config need deep validation for virtual host duplication check, regex check, per filter config validation etc, which PGV wasn't enough. Risk Level: Low Testing: regression test Docs Changes: N/A Release Notes: N/A Fixes envoyproxy#7939 Signed-off-by: Lizan Zhou <lizan@tetrate.io> * tls: maintain a free slot index set in TLS InstanceImpl to allocate in O(1… (envoyproxy#7979) Signed-off-by: Xin Zhuang <stevenzzz@google.com> * redis: handle invalid ip address from cluster slots and added tests (envoyproxy#7984) Signed-off-by: Henry Yang <hyang@lyft.com> * protobuf: report field numbers for unknown fields. (envoyproxy#7978) Since binary proto won't have field names, report at least the field numbers, as per https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.unknown_field_set#UnknownField. Also fix minor typo encountered while doing this work. Risk level: Low Testing: Unit tests added/updated. Fixes envoyproxy#7937 Signed-off-by: Harvey Tuch <htuch@google.com> * Content in envoy docs does not cover whole page (envoyproxy#7993) Signed-off-by: Manish Kumar <manishjpiet@gmail.com> * stats: Add option to switch between fake and real symbol-tables on the command-line. (envoyproxy#7882) * Add option to switch between fake and real symbol-tables on the command-line. Signed-off-by: Joshua Marantz <jmarantz@google.com> * api config: add build rules for go protos (envoyproxy#7987) Some BUILD files are missing build rules to generate go protos. envoyproxy/go-control-plane depends on these protos, so they should be exposed publicly. Added build rules to generate *.pb.go files. Risk Level: Low Testing: These rules were copied to google3 and tested internally. Unfortunately, I am having a bit of trouble with bazel build directly on these targets ("Package is considered deleted due to --deleted_packages"). Please let me know if there is a better way to test this change. Signed-off-by: Teju Nareddy <nareddyt@google.com> * test: don't use <experimental/filesystem> on macOS. (envoyproxy#8000) Xcode 11 requires at least macOS 10.15 (upcoming) in order to use either <experimental/filesystem> or C++17 <filesystem>. Signed-off-by: Piotr Sikora <piotrsikora@google.com> * event: adding the capability of creating an alarm with a given scope (envoyproxy#7920) Precursor to envoyproxy#7782 Adding scope tracking functionality to the basic alarm functions. Risk Level: Medium (should be a no-op but is a large enough refactor) Testing: new unit tests Docs Changes: n/a Release Notes: n/a Signed-off-by: Alyssa Wilk <alyssar@chromium.org> * ext authz: add dns san support for ext authz service (envoyproxy#7948) Adds support for DNS SAN in ext authz peer validation Risk Level: Low Testing: Added Docs Changes: Added Release Notes: N/A Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * accesslog: don't open log file with read flag (envoyproxy#7998) Description: File access log shouldn't need read access for a file. Risk Level: Low Testing: local in mac, CI Docs Changes: Release Notes: Fixes envoyproxy#7997 Signed-off-by: Lizan Zhou <lizan@tetrate.io> * protobuf: towards unifying PGV, deprecated and unknown field validation. (envoyproxy#8002) This is part of envoyproxy#7980; basically, we want to leverage the recursive pass that already exists for the deprecated check. This PR does not implement the recursive behavior yet for unknown fields though, because there is a ton of churn, so this PR just has the mechanical bits. We switch plumbing of validation visitor into places such as anyConvert() and instead pass this to MessageUtil::validate. There are a bunch of future followups planned in additional PRs: * Combine the recursive pass for unknown/deprecated check in MessageUtil::validate(). * Add mitigation for envoyproxy#5965 by copying to a temporary before recursive expansion. * [Future] consider moving deprecated reporting into a message validation visitor handler. Risk level: Low Testing: Some new //test/common/protobuf::utility_test unit test. Signed-off-by: Harvey Tuch <htuch@google.com> * http: forwarding x-forwarded-proto from trusted proxies (envoyproxy#7995) Trusting the x-forwarded-proto header from trusted proxies. If Envoy is operating as an edge proxy but has a trusted hop in front, the trusted proxy should be allowed to set x-forwarded-proto and its x-forwarded-proto should be preserved. Guarded by envoy.reloadable_features.trusted_forwarded_proto, default on. Risk Level: Medium (L7 header changes) but guarded Testing: new unit tests Docs Changes: n/a Release Notes: inline Fixes envoyproxy#4496 Signed-off-by: Alyssa Wilk <alyssar@chromium.org> * build: adding an option to hard-fail when deprecated config is used. (envoyproxy#7962) Adding a build option to default all deprecated protos off, and using it on the debug build. Risk Level: Low Testing: new UT Docs Changes: inline Release Notes: n/a Fixes envoyproxy#7548 Signed-off-by: Alyssa Wilk <alyssar@chromium.org> * envoy_cc_library: add export of foo_with_external_headers (envoyproxy#8005) Add a parallel native.cc_library to envoy_cc_library for external projects that consume Envoy's libraries. This allows the consuming project to disambiguate overlapping include paths when repository overlaying is used, as it can now include envoy headers via external/envoy/... Risk Level: Low Testing: N/A Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com> * ci: add fuzz test targets to ci (envoyproxy#7949) Builds fuzz targets with asan+libfuzzer and runs them against their corpora. Our native bazel builds work, this PR integrates the asan+libfuzzer builds in to CI. The fuzz target binaries will be in your envoy docker build directory. Invoke with the following for all fuzz targets, or a specified one. ./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.fuzz' ./ci/run_envoy_docker.sh './ci/do_ci.sh bazel.fuzz //test/common/common:utility_fuzz_test' Risk level: low Signed-off-by: Asra Ali asraa@google.com Signed-off-by: Asra Ali <asraa@google.com> * tls: support BoringSSL private key async functionality (envoyproxy#6326) This PR adds BoringSSL private key API abstraction, as discussed in envoyproxy#6248. All comments and discussion is welcomed to get the API sufficient for most private key API tasks. The PR contains the proposed API and the way how it can be used from ssl_socket.h. Also there is some code showing how the PrivateKeyMethodProvider is coming from TLS certificate config. Two example private key method providers are included in the tests. Description: tls: support BoringSSL private key async functionality Risk Level: medium Testing: two basic private key provider implementation Docs Changes: TLS arch doc, cert.proto doc Signed-off-by: Ismo Puustinen <ismo.puustinen@intel.com> * use SymbolTableCreator rather than fakes in a few stray places. (envoyproxy#8006) stats: use SymbolTableCreator rather than fakes in a few stray places. (envoyproxy#8006) Signed-off-by: Joshua Marantz <jmarantz@google.com> * [router] Add SRDS configUpdate impl (envoyproxy#7451) This PR contains changes on the xRDS side for SRDS impl, cribbed from http://go/gh/stevenzzzz/pull/8/files#diff-2071ab0887162eac1fd177e89d83175a * Add onConfigUpdate impl for SRDS subscription * Remove scoped_config_manager as it's not used now. * Move ScopedConfigInfo to scoped_config_impl.h/cc * Add a hash to scopeKey and scopeKeyFragment, so we can look up scopekey by hash value in constant time when SRDS has many scopes. * Add a initManager parameter to RDS createRdsRouteConfigProvider API interface, when creating RouteConfigProvider after listener/server warmed up, we need to specify a different initManager than the one from factoryContext to avoid an assertion failure. see related:envoyproxy#7617 This PR only latches a SRDS provider into the connection manager, the "conn manager using SRDS to make route decision" plus integration tests will be covered in a following PR. Risk Level: LOW [not fully implemented]. Testing: unit tests Signed-off-by: Xin Zhuang <stevenzzz@google.com> * Fix version history (envoyproxy#8021) Follow-up for envoyproxy#7995. Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com> * tools: sync tool for envoyproxy/assignable team. (envoyproxy#8015) Bulk update of team to match envoyproxy organization. While at it, cleaned up some venv stuff in shell_utils.sh. Risk level: Low Testing: Synced 157 members from envoyproxy to envoyproxy/assignable. Signed-off-by: Harvey Tuch <htuch@google.com> * redis: fix onHostHealthUpdate got called before the cluster is resolved. (envoyproxy#8018) Signed-off-by: Henry Yang <hyang@lyft.com> * api/build: migrate UDPA proto tree to external cncf/udpa repository. (envoyproxy#8017) This is a one-time movement of all UDPA content from envoyproxy/envoy to cncf/udpa. The permanent home of UDPA will be https://github.com/cncf/udpa. Risk level: Low Testing: Added UDPA service entry to build_test. Signed-off-by: Harvey Tuch <htuch@google.com> * http: tracking active session under L7 timers (envoyproxy#7782) Signed-off-by: Alyssa Wilk <alyssar@chromium.org> * upstream: remove thread local cluster after triggering call backs (envoyproxy#8004) Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * upstream: Introducing close_connections_on_host_set_change property (envoyproxy#7675) Signed-off-by: Kateryna Nezdolii <nezdolik@spotify.com> * upstream: delete stale TODO (envoyproxy#8028) This was fixed in envoyproxy#7877 Signed-off-by: Matt Klein <mklein@lyft.com> * Enhance comment about MonotonicTime (envoyproxy#8011) Depending on the execution environment in which envoy is being run, it is possible that some of the assumption on the clock are maybe not holding as previously commented. With some sandboxing technologies the clock does not reference the machine boot time but the sandbox boot time. This invalidates the assumtpion that the first update in the cluster_manager will most likely fall out of the windows and ends up showing a non intuitive behavior difficult to catch. This PR simply adds a comment that will allow the reader to consider this option while reading to the code. Signed-off-by: Flavio Crisciani <f.crisciani@gmail.com> * build: some missing dep fixups for Google import. (envoyproxy#8026) Signed-off-by: Harvey Tuch <htuch@google.com> * introduce safe regex matcher based on re2 engine (envoyproxy#7878) The libstdc++ std::regex implementation is not safe in all cases for user provided input. This change deprecates the used of std::regex in all user facing paths and introduces a new safe regex matcher with an explicitly configurable engine, right now limited to Google's re2 regex engine. This is not a drop in replacement for std::regex as all language features are not supported. As such we will go through a deprecation period for the old regex engine. Fixes envoyproxy#7728 Signed-off-by: Matt Klein <mklein@lyft.com> * docs: reorganize configuration tree (envoyproxy#8027) This is similar to what I did for the arch overview a while ago as this section is also getting out of control. Signed-off-by: Matt Klein <mklein@lyft.com> * build: missing regex include. (envoyproxy#8032) Signed-off-by: Harvey Tuch <htuch@google.com> * [headermap] speedup for appending data (envoyproxy#8029) For debug builds, performance testing and fuzzers reveal that when appending to a header, we scan both the existing value and the data to append for invalid characters. This PR moves the validation check to just the data that is appended, to avoid hangups on re-scanning long header values multiple times. Testing: Added corpus entry that reveals time spent in validHeaderString Signed-off-by: Asra Ali <asraa@google.com> * eds: avoid send too many ClusterLoadAssignment requests (envoyproxy#7976) During initializing secondary clusters, for each initialized cluster, a ClusterLoadAssignment request is sent to istio pilot with the cluster's name appended into request's resource_names list. With a huge number of clusters(e.g 10k clusters), this behavior slows down Envoy's initialization and consumes ton of memory. This change pauses ADS mux for ClusterLoadAssignment to avoid that. Risk Level: Medium Testing: tiny change, no test case added Fixes envoyproxy#7955 Signed-off-by: lhuang8 <lhuang8@ebay.com> * Set the bazel verison to 0.28.1 explicitly (envoyproxy#8037) In theopenlab/openlab-zuul-jobs#622 , the OpenLab add the ability to set the bazel to specific version explicitly. This patch add the bazel role for the envoy job. Signed-off-by: Yikun Jiang <yikunkero@gmail.com> * Read_policy is not set correctly. (envoyproxy#8034) Add more integration test and additional checks in the unit tests. Signed-off-by: Henry Yang <hyang@lyft.com> * admin: fix /server_info hot restart version (envoyproxy#8022) Signed-off-by: Matt Klein <mklein@lyft.com> * test: adding debug hints for integration test config failures (envoyproxy#8038) Risk Level: n/a (test only) Testing: manual Docs Changes: n/a Release Notes: n/a Signed-off-by: Alyssa Wilk <alyssar@chromium.org> * udp_listener: refactor ActiveUdpListener creation (envoyproxy#7884) Signed-off-by: Dan Zhang <danzh@google.com> * accesslog: implement TCP gRPC access logger (envoyproxy#7941) Description: Initial implementation for TCP gRPC access logger. Risk Level: Low (extension only) Testing: integration test Docs Changes: Added Release Notes: Added Signed-off-by: Lizan Zhou <lizan@tetrate.io> * tracing: add OpenCensus agent exporter support to OpenCensus driver. (envoyproxy#8023) Signed-off-by: Emil Mikulic <g-easy@users.noreply.github.com> * Exporting platform_impl_lib headers (envoyproxy#8045) This allows consuming projects using repository overlaying to disambiguate overlapping include paths when it comes to platform_impl.h by going through envoy/external/... Addendum to envoyproxy#8005 Risk Level: Low Testing: N/A Signed-off-by: Otto van der Schaaf <oschaaf@we-amp.com> * access_log: minimal log file error handling (envoyproxy#7938) Rather than ASSERT for a reasonably common error condition (e.g. disk full) record a stat that indicates log file writing failed. Also fixes a test race condition. Risk Level: low Testing: added stats checks Docs Changes: documented new stat Release Notes: updated Signed-off-by: Stephan Zuercher <zuercher@gmail.com> * tracing: add grpc-status and grpc-message to spans (envoyproxy#7996) Signed-off-by: Caleb Gilmour <caleb.gilmour@datadoghq.com> * fuzz: add bounds to statsh flush interval (envoyproxy#8043) Add PGV bounds to the stats flush interval (greater than 1ms and less than 5000ms) to prevent Envoy from hanging from too small of a flush time. Risk Level: Low Testing: Corpus Entry added Fixes OSS-Fuzz issue https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=16300 Signed-off-by: Asra Ali <asraa@google.com> * Improve tools/stack_decode.py (envoyproxy#8041) Adjust tools/stack_decode.py to more obviously be Python 2 (not 3), and to work on stack traces that don't include the symbol names. Risk Level: Low Testing: Manually tested on a stack trace that one of our users sent us Signed-off-by: Luke Shumaker <lukeshu@datawire.io> * build: tell googletest to use absl stacktrace (envoyproxy#8047) Description: https://github.com/google/googletest/blob/d7003576dd133856432e2e07340f45926242cc3a/BUILD.bazel#L42 Risk Level: Low (test only) Testing: CI Docs Changes: Release Notes: Signed-off-by: Lizan Zhou <lizan@tetrate.io> * Update references to local scripts to enable using build container for filter repos (envoyproxy#7907) Description: This change enables using run_envoy_docker.sh to build envoy-filter-example Risk Level: Low Testing: Manually tested building envoy-filter-example using: envoy/ci/run_envoy_docker.sh './ci/do_ci.sh build' Docs Changes: N/A Release Notes: N/A Signed-off-by: Santosh Kumar Cheler <scheler@arubanetworks.com> * bazel: patch gRPC to fix Envoy builds with glibc v2.30 (envoyproxy#7971) Description: the latest glibc (v2.30) declares its own `gettid()` function (see [0]) and this creates a naming conflict in gRPC which has a function with the same name. Apply to gRPC [a patch](grpc/grpc#18950) which renames `gettid()` to `sys_gettid()`. [0] https://sourceware.org/git/?p=glibc.git;a=commit;h=1d0fc213824eaa2a8f8c4385daaa698ee8fb7c92 Risk Level: low Testing: unit tests Docs Changes: n/a Release Notes: n/a Signed-off-by: Dmitry Rozhkov <dmitry.rozhkov@linux.intel.com> * build: link C++ stdlib dynamically in sanitizer runs (envoyproxy#8019) Description: Sanitizers doesn't support static link, reverts envoyproxy#7929 and link lib(std)c++ dynamically in sanitizer runs. Addresses test issue for envoyproxy#4251. Added workaround in ASAN for envoyproxy#7647. Risk Level: Low (test only) Testing: CI, local libc++ runs Docs Changes: N/A Release Notes: N/A Fixes envoyproxy#7928 * test: cleaning up test runtime (envoyproxy#8012) Using the new runtime utility to clean up a bunch of test gorp. Yay utils! Risk Level: n/a (test only) Testing: tests pass Docs Changes: n/a Release Notes: n/a Signed-off-by: Alyssa Wilk <alyssar@chromium.org> * test: improved coverage and handling of deprecated config (envoyproxy#8057) Making ENVOY_DISABLE_DEPRECATED_FEATURES work for unit tests without runtime configured. Fixing up a handful of unit tests to remove legacy code or use the handy DEPRECATED_FEATURE_TEST macro Adding back coverage of cors.enabled() and redis.catch_all_route() Risk Level: Low (test only) Testing: new unit tests Docs Changes: n/a Release Notes: n/a Fixes envoyproxy#8013 Fixes envoyproxy#7548 Signed-off-by: Alyssa Wilk <alyssar@chromium.org> * [Docs typo] Remote Executioon -> Remote Execution (envoyproxy#8061) Fixes mispelling of `Executioon` -> `Execution` Signed-off-by: Colin Schoen <schoen@yelp.com> * api: Fix duplicate java_outer_classname declarations (envoyproxy#8059) The java_outer_classname is unintentionally duplicated in the new udp_listener_config and regex proto files. This changes them to unique names that match the predominant naming scheme. Signed-off-by: Bryce Anderson <banderson@twitter.com> * http: making the behavior of the response Server header configurable (envoyproxy#8014) Default behavior remains unchanged, but now Envoy can override, override iff there's no server header from upstream, or always leave the server header (or lack thereof) unmodified. Risk Level: low (config guarded change) Testing: new unit tests Docs Changes: n/a Release Notes: inline Fixes envoyproxy#6716 Signed-off-by: Alyssa Wilk <alyssar@chromium.org> * use bazelversion for filter-example too (envoyproxy#8069) Signed-off-by: Lizan Zhou <lizan@tetrate.io> * grpc-httpjson-transcode: Update for RFC2045 support (envoyproxy#8065) RFC2045 (MIME) Base64 decoding support has been fixed upstream Description: The grpc transcoding filter has been updated to support RFC2045 (MIME) based inputs for protobuf type "Bytes". This is important since Base64 is often using the RFC2045 format for inputs. Also see: grpc-ecosystem/grpc-httpjson-transcoding#34 Risk Level: Low Testing: Integration / Manual Tests Docs Changes: N/A Release Notes: N/A Signed-off-by: Hans Viken Duedal <hans.duedal@visma.com> * stats: Clean up all calls to Scope::counter() et al in production code. (envoyproxy#7842) * Convert a few more counter() references to use the StatName interface. Signed-off-by: Joshua Marantz <jmarantz@google.com> * tls_inspector: inline the recv in the onAccept (envoyproxy#7951) Description: As discussed in envoyproxy#7864 this PR is the attempt to peek the socket at the invoke of onAccept. Usually client_hello packet should be in the buffer when tls_inspector is peeking, we could save a poll cycle for this connection. Once we agree on the solution I can apply to http_inspector as well. The expecting latency improvement especially when poll cycle is large. Benchmark: Env: hardware Intel(R) Xeon(R) CPU @ 2.20GHz envoy: concurrency = 1, tls_inspector as listener filter. One tls filter chain, and one plain text filter chain. load background: a [sniper](https://github.com/lubia/sniper) client with concurrency = 5 hitting the server with tls handshake, aiming to hit using the tls_filter chain. The qps is about 170/s Another load client hitting the plain text filter chain but would go through tls_inspector with concurrency = 1 This PR: TransactionTime: 10.3 - 11.0 ms(mean) Master TransactionTime: 12.3 - 12.8 ms(mean) Risk Level: Med (ActiveSocket code is affected to adopt the side effect of onAccept) Testing: Docs Changes: Release Notes: Fixes envoyproxy#7864 Signed-off-by: Yuchen Dai <silentdai@gmail.com> * Fixes gcc 8.3.1 build failure due to FilterChainBenchmarkFixture::SetUp hiding base-class virtual functions (envoyproxy#8071) Description: I'm seeing "bazel-out/k8-fastbuild/bin/external/com_github_google_benchmark/_virtual_includes/benchmark/benchmark/benchmark.h:1071:16: error: 'virtual void benchmark::Fixture::SetUp(benchmark::State&)' was hidden" when running tests. This resolves the issue with hiding of the base-class functions. Risk Level: low Testing: Docs Changes: Release Notes: Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com> * test: fix ups for various deprecated fields (envoyproxy#8068) Takeaways: we've lost the ability to do empty regex (which was covered in router tests and is proto constraint validated on the new safe regex) as well as negative lookahead (also covered in tests) along with a host of other things conveniently documented as not supported here: https://github.com/google/re2/wiki/Syntax Otherwise split up a bunch of tests, duplicated and tagged a bunch of tests, and cleaning up after we finally can remove deprecated fields again will be an order of magnitude easier. Also fixing a dup relnote from envoyproxy#8014 Risk Level: n/a (test only) Testing: yes. yes there is. Docs Changes: no Release Notes: no Signed-off-by: Alyssa Wilk <alyssar@chromium.org> * include: add log dependency header to connection_handler.h (envoyproxy#8072) Signed-off-by: Teju Nareddy <nareddyt@google.com> * quiche: Update QUICHE dep (envoyproxy#8044) Update QUICHE tar ball to 4abb566fbbc63df8fe7c1ac30b21632b9eb18d0c. Add some new impl's for newly added api. Risk Level: low Testing: using quiche build in tests. Part of envoyproxy#2557 Signed-off-by: Dan Zhang <danzh@google.com> * tools: deprecated field check in Route Checker tool (envoyproxy#8058) We need a way to run the deprecated field check on the RouteConfiguration. Today the schema check tool validates the bootstrap config. This change will help achieve similar functionality on routes served from rds. Risk Level: Low Testing: Manual testing Docs Changes: included Release Notes: included Signed-off-by: Jyoti Mahapatra <jmahapatra@lyft.com> * tracing: Add support for sending data in Zipkin v2 format (envoyproxy#6985) Description: This patch supports sending a list of spans as JSON v2 and protobuf message over HTTP to Zipkin collector. [Sending protobuf](https://github.com/openzipkin/zipkin-api/blob/0.2.1/zipkin.proto) is considered to be more efficient than JSON, even compared to the v2's JSON (openzipkin/zipkin#2589 (comment)). This is an effort to rework envoyproxy#6798. The approach is by serializing the v1 model to both v2 JSON and protobuf. Risk Level: Low, since the default is still HTTP-JSON v1 based on https://github.com/openzipkin/zipkin-api/blob/0.2.2/zipkin-api.yaml. Testing: Unit testing, manual integration test with real Zipkin collector server. Docs Changes: Added Release Notes: Added Fixes: envoyproxy#4839 Signed-off-by: Dhi Aurrahman <dio@tetrate.io> Signed-off-by: José Carlos Chávez <jcchavezs@gmail.com> * Route Checker tool Fix code coverage bug in proto based schema (envoyproxy#8101) Signed-off-by: Jyoti Mahapatra <jmahapatra@lyft.com> * [hcm] Add scoped RDS routing into HCM (envoyproxy#7762) Description: add Scoped RDS routing logic into HCM. Changes include: * in ActiveStream constructor latch a ScopedConfig impl to the activeStream if SRDS is enabled * in the beginning of ActiveStream::decodeHeaders(headers, end_stream), get routeConfig from latched ScopedConfig impl. This PR is the 3rd in the srds impl PR chain: [envoyproxy#7704, envoyproxy#7451, this]. Risk Level: Medium Testing: unit test and integration tests. Release Notes: Add scoped RDS routing support into HCM. Signed-off-by: Xin Zhuang <stevenzzz@google.com> * owners: add @asraa and @lambdai to OWNERS. (envoyproxy#8110) * @asraa is joining Envoy OSS security team. * @lambdai is joining Friends of Envoy as v2 xDS point. Signed-off-by: Harvey Tuch <htuch@google.com> * protobuf: recursively validate unknown fields. (envoyproxy#8094) This PR unifies the recursive traversal of deprecated fields with that of unknown fields. It doesn't deal with moving to a validator visitor model for deprecation; this would be a nice cleanup that we track at envoyproxy#8092. Risk level: Low Testing: New nested unknown field test added. Fixes envoyproxy#7980 Signed-off-by: Harvey Tuch <htuch@google.com> * Fuzz reuse (envoyproxy#8119) This PR allows the envoy_cc_fuzz_test rule to be used when pulling in envoy. which can be useful when you're writing filters for envoy, and want to reuse the fuzzing architecture envoy has already built. other rules already allow for this (see envoy_cc_test in this same file for example). Risk Level: Low Testing: Testing the Old Rule Still Works It is possible to test the old rules still work (even without specifying a repository), by simply choosing your favorite fuzz test, and choosing to run bazel test on it. For example: bazel test //test/common/router:header_parser_fuzz_test. Any envoy_cc_fuzz_test rule should do. Testing New Rules Work I've done testing inside my own repository, but if you want to create your own test rule you can probably do the following in envoy-filter-example: Checkout envoy-filter-example, and update the envoy submodule to this pr. Follow the directions in: test/fuzz/README.md to define a envoy_cc_fuzz_test rule. Make sure to add a line for: repository = "@envoy" which is the new argument being added. You should be able to run the fuzz test. Signed-off-by: Cynthia Coan <ccoan@instructure.com> * Set INCLUDE_DIRECTORIES so libcurl can find local urlapi.h (envoyproxy#8113) Fixes envoyproxy#8112 Signed-off-by: John Millikin <jmillikin@stripe.com> * cleanup: move test utility methods in ScopedRdsIntegrationTest to base class HttpIntegrationTest (envoyproxy#8108) Fixes envoyproxy#8050 Risk Level: LOW [refactor only] Signed-off-by: Xin Zhuang <stevenzzz@google.com> * upstream: fix invalid access of ClusterMap iterator during warming cluster modification (envoyproxy#8106) Risk Level: Medium Testing: New unit test added. Fix verified via --config=asan. Signed-off-by: Andres Guedez <aguedez@google.com> * api:Add a flag to disable overprovisioning in ClusterLoadAssignment (envoyproxy#8080) * api:Add a flag to disable overprovisioning in ClusterLoadAssignment Signed-off-by: Jie Chen <jiechen@google.com> * api:Add [#next-major-version and [#not-implemented-hide to the comment for field of disable_overprovisioning in ClusterLoadAssignment Signed-off-by: Jie Chen <jiechen@google.com> * api:Refine comments for the new added bool flag as suggested. Signed-off-by: Jie Chen <jiechen@google.com> * api: clone v2[alpha] to v3alpha. (envoyproxy#8125) This patch establishes a v3alpha baseline API, by doing a simple copy of v2[alpha] dirs and some sed-style heuristic fixups of BUILD dependencies and proto package namespaces. The objective is provide a baseline which we can compare the output from tooling described in envoyproxy#8083 in later PRs, providing smaller visual diffs. The core philosophy of the API migration is that every step will be captured in a script (at least until the last manual steps), api/migration/v3alpha.sh. This script will capture deterministic migration steps, allowing v2[alpha] to continue to be updated until we finalize v3. There is likely to be significant changes, e.g. in addition to the work scoped for v3, we might want to reduce the amount of API churn by referring back to v2 protos where it makes sense. This will be done via tooling in later PRs. Part of envoyproxy#8083. Risk level: Low Testing: build @envoy_api//... Signed-off-by: Harvey Tuch <htuch@google.com> * dubbo: Fix heartbeat packet parsing error (envoyproxy#8103) Description: The heartbeat packet may carry data, and it is treated as null data when processing the heartbeat packet, causing some data to remain in the buffer. Risk Level: low Testing: Existing unit test Docs Changes: N/A Release Notes: N/A Fixes envoyproxy#7970 Signed-off-by: tianqian.zyf <tianqian.zyf@alibaba-inc.com> * stats: Shared cluster isolated stats (envoyproxy#8118) * shared the main symbol-table with the isolated stats used for cluster info. Signed-off-by: Joshua Marantz <jmarantz@google.com> * protodoc: upgrade to Python 3. (envoyproxy#8129) Risk level: Low Testing: Rebuilt docs, manual inspection of some example generated files. Signed-off-by: Harvey Tuch <htuch@google.com> * protodoc: single source-of-truth for doc protos. (envoyproxy#8132) This avoids having to list new docs protos in both docs/build.sh and api/docs/BUILD. This technical debt cleanup is helpful in v3 proto work to simplify collecting proto artifacts from a Bazel aspect. Risk level: Low Testing: docs/build.sh, visual inspection of docs. Signed-off-by: Harvey Tuch <htuch@google.com> * api: organize go_proto_libraries (envoyproxy#8003) Fixes envoyproxy#7982 Defines a package level proto library and its associated internal go_proto_library. Deletes all existing api_go_proto_library, api_go_grpc_library, and go_package annotations in protos (they are not required and pollute the sources). I deliberately avoided touching anything under udpa since it's being moved to another repository. Risk Level: low Testing: build completes Signed-off-by: Kuat Yessenov <kuat@google.com> * api: straggler v2alpha1 -> v3alpha clone. (envoyproxy#8133) These were missed in envoyproxy#8125. Signed-off-by: Harvey Tuch <htuch@google.com> * docs: remove extraneous escape (envoyproxy#8150) Old versions of bash (e.g. on macOS) don't handle ${P/:/\/} the same way as modern versions. In particular, the expanded parameter on macOS includes a backslash, causing subsequent use of the string as a filename to include a slash (/) instead of treating the slash as a directory separator. Both versions of bash accept ${P/://} as a way to substitute : with /. Verified that this change does not alter the generated docs when running under Linux. Risk Level: low Testing: generated docs under linux & macOS Signed-off-by: Stephan Zuercher <zuercher@gmail.com> * Do not 503 on Upgrade: h2c instead remove the header and ignore. (envoyproxy#7981) Description: When a request comes in on http1 with "upgrade: h2c", the current behavior is to 503. Instead we should ignore the upgrade and remove the header and continue with the request as http1. Risk Level: Medium Testing: Unit test. Hand test with ithub.com/rdsubhas/java-istio client server locally. Docs Changes: N/A Release Notes: http1: ignore and remove Upgrade: h2c. Fixes istio/istio#16391 Signed-off-by: John Plevyak <jplevyak@gmail.com> * docs: add line on installing xcode for macOS build flow (envoyproxy#8139) Because of rules_foreign_cc in bazelbuild, Envoy will not compile successfully when following the instructions in the build docs due to how the tools are referenced. One fix for this is installing Xcode from the App Store (see bazel-contrib/rules_foreign_cc#185). Risk Level: Low Testing: N/A (docs change) Docs Changes: see Description Release Notes: N/A Signed-off-by: Lisa Lu <lisalu@lyft.com> * docs: note which header expressions cannot be used for request headers (envoyproxy#8138) As discussed in envoyproxy#8127, some custom header expressions evaluate as empty when used in request headers. Risk Level: low, docs only Testing: n/a Docs Changes: updated Release Notes: n/a Signed-off-by: Stephan Zuercher <zuercher@gmail.com> * api: use traffic_direction over operation_name if specified (envoyproxy#7999) Use the listener-level field for the tracing direction over the per-filter field. Unfortunately, the per filter did not provide an "unspecified" default, so this appears to be the right approach to deprecate the per-filter field with minimal impact. Risk Level: low (uses a newly introduce field traffic_direction) Testing: unit test Docs Changes: proto docs Signed-off-by: Kuat Yessenov <kuat@google.com> * add more diagnostic logs (envoyproxy#8153) Istio sets listener filter timeout to 10ms by default but requests fail from time to tome. It's very difficult to debug. Even though downstream_pre_cx_timeout_ is exposed to track the number of timeouts, it would be better to have some debug logs. Description: add more diagnostic logs Risk Level: low Signed-off-by: crazyxy <yxyan@google.com> * http conn man: add tracing config for path length in tag (envoyproxy#8095) This PR adds a configuration option for controlling the length of the request path that is included in the HttpUrl span tag. Currently, this length is hard-coded to 256. With this PR, that length will be configurable (defaulting to the old value). Risk Level: Low Testing: Unit Docs Changes: Inline with the API proto. Documented new field. Release Notes: Added in the PR. Related issue: istio/istio#14563 Signed-off-by: Douglas Reid <douglas-reid@users.noreply.github.com> * cds: Add general-purpose LB policy configuration (envoyproxy#7744) This PR adds fields to CDS that allow for general-purpose LB policy configuration. Risk Level: Low Testing: None (but if anything is needed, please let me know) Docs Changes: Inline with API protos Release Notes: N/A Signed-off-by: Mark D. Roth <roth@google.com> * thrift_proxy: fix crash on invalid transport/protocol (envoyproxy#8143) Transport/protocol decoder errors that occur before the connection manager initializes an ActiveRPC to track the request caused a crash. Modifies the connection manager to handle this case, terminating the downstream the connection. Risk Level: low Testing: test case that triggers crash Docs Changes: n/a Release Notes: added Signed-off-by: Stephan Zuercher <zuercher@gmail.com> * api: strip gogoproto annotations (envoyproxy#8163) Remove gogoproto annotations. They can be replaced with a custom gogoproto compiler (e.g. something like https://github.com/gogo/googleapis/tree/master/protoc-gen-gogogoogleapis). I have an experimental version of it to validate that it's possible to re-apply important annotations in the compiler. Risk Level: low Testing: builds Signed-off-by: Kuat Yessenov <kuat@google.com> * hotrestart: remove dynamic_resources from server config used by hotrestart_test (envoyproxy#8162) In the server config file `test/config/integration/server.yaml` used by //test/integration:hotrestart_test, `dynamic_resources` includes `lds_config` and `cds_config` definitions, which use HTTP API to fetch config, but CDS and LDS service do not exist, so the initial fetch will be failed with a connection failure, then Envoy server continue startup. Envoy server shouldn't continue startup because connection failure, see issue envoyproxy#8046. For this test, `dynamic_resources` is not needed, this change clean it up. Signed-off-by: lhuang8 <lhuang8@ebay.com> * clang-tidy: misc-unused-using-decls (envoyproxy#8159) Description: clang-tidy check to flag unused using statements. There's a lot in test code that's just copy pasta, and it's hard to manually review whether it's being used, especially for things like using testing::_; Risk Level: low Testing: existing Docs Changes: N/A Release Notes: N/A Signed-off-by: Derek Argueta <dereka@pinterest.com> * build: curl with c-ares, nghttp2 and zlib (envoyproxy#8154) Build curl dependency with async DNS resolver c-ares avoiding potential crashes due to longjmp on modern kernels. Add zlib and nghttp2. Use Envoy's version of all of the above libraries. Signed-off-by: Taras Roshko <troshko@netflix.com> * log: add upstream TLS info (envoyproxy#7911) Description: add upstream TLS info for logging purposes Refactor SSL connection info to be a shared pointer. Use read-only interface. Cache computed values in the SSL info object (this allows transition to remove the underlying SSL object if necessary). Risk Level: medium due to use of bssl::SSL to back ConnectionInfo Testing: unit Docs Changes: none Release Notes: add upstream TLS info Signed-off-by: Kuat Yessenov <kuat@google.com> * fix windows implementation of PlatformImpl (envoyproxy#8169) Add missing destructor to class declaration. Fix copy/paste errors. These errors were apparently introduced in e1cd4cc. Risk Level: Low Testing: Passed Windows testing locally Docs Changes: n/a Release Notes: n/a Signed-off-by: William Rowe wrowe@pivotal.io Signed-off-by: Yechiel Kalmenson <ykalmenson@pivotal.io> * Update Opencensus SHA (envoyproxy#8173) Signed-off-by: Pengyuan Bian <bianpengyuan@google.com> * Outlier Detection: use gRPC status code for detecting failures (envoyproxy#7942) Signed-off-by: ZhouyihaiDing <ddyihai@google.com> * fix build (envoyproxy#8177) Signed-off-by: Derek Argueta <dereka@pinterest.com> * docs: improving websocket docs (envoyproxy#8156) Making it clear H2 websockets don't work by default Risk Level: n/a Testing: n/a Docs Changes: yes Release Notes: no envoyproxy#8147 Signed-off-by: Alyssa Wilk <alyssar@chromium.org> * Upstream WebAssembly VM and Null VM from envoyproxy/envoy-wasm. (envoyproxy#8020) Description: Upstream from envoyproxy/envoy-wasm the WebAssembly VM support along with the Null VM support and tests. This is the first PR dealing with WebAssembly filter support in envoy. See https://github.com/envoyproxy/envoy-wasm/blob/master/WASM.md and https://github.com/envoyproxy/envoy-wasm/blob/master/docs/root/api-v2/config/wasm/wasm.rst for details. Risk Level: Medium Testing: Unit tests. Docs Changes: N/A Release Notes: N/A Part of envoyproxy#4272 Signed-off-by: John Plevyak <jplevyak@gmail.com> * quiche: implement Envoy Quic stream and connection (envoyproxy#7721) Implement QuicStream|Session|Disptacher in Envoy. Weir up QUIC stream and connection with HCM callbacks. Risk Level: low, not in use Testing: Added unit tests for all new classes Part of envoyproxy#2557 Signed-off-by: Dan Zhang <danzh@google.com> * protodoc/api_proto_plugin: generic API protoc plugin framework. (envoyproxy#8157) Split out the generic plugin and FileDescriptorProto traversal bits from protodoc. This is in aid of the work in envoyproxy#8082 ad envoyproxy#8083, where additional protoc plugins will be responsible for v2 -> v3alpha API migrations and translation code generation. This is only the start really of the api_proto_plugin framework. I anticipate additional bits of protodoc will move here later, including field type analysis and oneof handling. In some respects, this is a re-implementation of some of https://github.com/lyft/protoc-gen-star in Python. The advantage is that this is super lightweight, has few dependencies and can be easily hacked. We also embed various bits of API business logic, e.g. annotations, in the framework (for now). Risk level: Low Testing: diff -ru against previous protodoc.py RST output, identical modulo some trivial whitespace that doesn't appear in generated HTML. There are no real tests yet, I anticipate adding some golden proto style tests. Signed-off-by: Harvey Tuch <htuch@google.com> * adaptive concurrency: Gradient algorithm implementation (envoyproxy#7908) Signed-off-by: Tony Allen <tallen@lyft.com> * ext_authz: Check for cluster before sending HTTP request (envoyproxy#8144) Signed-off-by: Dhi Aurrahman <dio@tetrate.io> * make getters const-ref (envoyproxy#8192) Description: Follow-up to envoyproxy#7911 to make cached values be exposed as const-references, saving on a copy of a string during retrieval. Risk Level: low Testing: updated mocks to return references Docs Changes: none Release Notes: none Signed-off-by: Kuat Yessenov <kuat@google.com> * test: add curl features check (envoyproxy#8194) Add a test ensuring curl was built with the expected features. Description: Add a test ensuring curl was built with the expected features. Risk Level: Low. Testing: n/a. Docs Changes: n/a. Release Notes: n/a. Signed-off-by: Taras Roshko <troshko@netflix.com> * subset lb: allow ring hash/maglev LB to work with subsets (envoyproxy#8030) * subset lb: allow ring hash/maglev LB to work with subsets Skip initializing the thread aware LB for a cluster when the subset load balancer is enabled. Also adds some extra checks for LB policies that are incompatible with the subset load balancer. Risk Level: low Testing: test additional checks Docs Changes: updated docs w.r.t subset lb compatibility Release Notes: n/a Fixes: envoyproxy#7651 Signed-off-by: Stephan Zuercher <zuercher@gmail.com> * redis: add a request time metric to redis upstream (envoyproxy#7890) Signed-off-by: Nicolas Flacco <nflacco@lyft.com> * bazel: update bazel to 0.29.1 (envoyproxy#8198) Description: Upgrade bazel to 0.29.1 and bazel-toolchains to corresponding version. Risk Level: Low Testing: CI Docs Changes: N/A Release Notes: N/A Signed-off-by: Lizan Zhou <lizan@tetrate.io> * upstream: Add ability to disable host selection during panic (envoyproxy#8024) Previously, when in a panic state, requests would be routed to all hosts. In some cases it is instead preferable to not route any requests. Add a configuration option for zone-aware load balancers which switches from routing to all hosts to no hosts. Closes envoyproxy#7550. Signed-off-by: James Forcier jforcier@grubhub.com Risk Level: Low Testing: 2 new unit tests written; manual testing Docs Changes: Note about new configuration option added Release Notes: added Signed-off-by: James Forcier <jforcier@grubhub.com> * metrics service: flush histogram buckets (envoyproxy#8180) Signed-off-by: Rama Chavali <rama.rao@salesforce.com> * tracing: fix random sample fraction percent (envoyproxy#8205) Signed-off-by: Pengyuan Bian <bianpengyuan@google.com> * stats: Add per-host memory usage test case to stats_integration_test (envoyproxy#8189) Signed-off-by: Antonio Vicente <avd@google.com> * router check tool: add flag for only printing failed tests (envoyproxy#8160) Signed-off-by: Lisa Lu <lisalu@lyft.com> * fix link to runtime docs (envoyproxy#8204) Description: Looks like the runtime docs moved under operations/. The PR fixes the link. Risk Level: low Testing: existing Docs Changes: this Release Notes: n/a Signed-off-by: Derek Argueta <dereka@pinterest.com> * config: make SlotImpl detachable from its owner, and add a new runOnAllThreads interface to Slot. (envoyproxy#8135) See the issue in envoyproxy#7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error. More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb. Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. } Risk Level: Medium Testing: unit test Docs Changes: N/A Release Notes: N/A [Optional Fixes #Issue] envoyproxy#7902 Signed-off-by: Xin Zhuang <stevenzzz@google.com> * test: remove static config from subset lb integration test (envoyproxy#8203) Build the config programmatically to make future API changes less onerous. Risk Level: low (test change only) Testing: n/a Doc Changes: n/a Release Notes: n/a Signed-off-by: Stephan Zuercher <zuercher@gmail.com> * cleanup: clarify Cluster.filters and Dispatcher::createClientConnection (envoyproxy#8186) Signed-off-by: Fred Douglas <fredlas@google.com> * redis: health check is not sending the auth command on its connection (envoyproxy#8166) Signed-off-by: Henry Yang <hyang@lyft.com> * redis: mirroring should work when default value is zero, not just greater than zero (envoyproxy#8089) Signed-off-by: Nicolas Flacco <nflacco@lyft.com> * tools: regularize pip/venv for format_python_tools.py. (envoyproxy#8176) As well as being a nice cleanup, this fixes some issues I had with local Docker use of fix_format as a non-root user. Signed-off-by: Harvey Tuch <htuch@google.com> * absl: Absl hash hook in a couple of places rather than hash functors (envoyproxy#8179) Signed-off-by: Joshua Marantz <jmarantz@google.com> * Update dependency: jwt_verify_lib (envoyproxy#8212) Signed-off-by: Daniel Grimm <dgrimm@redhat.com> * upstream: add failure percentage-based outlier detection (envoyproxy#8130) Description: Add a new outlier detection mode which compares each host's rate of request failure to a configured fixed threshold. Risk Level: Low Testing: 2 new unit tests added. Docs Changes: New mode and config options described. Release Notes: white_check_mark Fixes envoyproxy#8105 Signed-off-by: James Forcier <jforcier@grubhub.com> * Replace deprecated thread annotations macros. (envoyproxy#8237) Abseil thread annotation macros are now prefixed by ABSL_. There is no semantic change; this is just a rename. Signed-off-by: Yan Avlasov <yavlasov@google.com> * Update protoc-gen-validate (PGV) (envoyproxy#8234) This picks up fixes for the Windows build and a C preprocessor defect Signed-off-by: Yechiel Kalmenson <ykalmenson@pivotal.io> Signed-off-by: William Rowe <wrowe@pivotal.io> * upstream: use named constants for outlier detection config defaults (envoyproxy#8221) Signed-off-by: James Forcier <jforcier@grubhub.com> * server: add a post init lifecycle stage (envoyproxy#8217) Signed-off-by: Jose Nino <jnino@lyft.com> * docs: document access control conditions and attributes (envoyproxy#8230) Signed-off-by: Kuat Yessenov <kuat@google.com> * server: return processContext as optional reference (envoyproxy#8238) Signed-off-by: Elisha Ziskind <eziskind@google.com> * Update envoy.yaml in Redis proxy example (envoyproxy#8220) Description: Make Redis example use catch_all_route. Risk Level: Low. Testing: Done. docker-compose up --build brings up envoy proxy and I was able to run Redis commands using redis-cli. Signed-off-by: Raju Kadam <rkadam@atlassian.com> * quiche: implement ActiveQuicListener (envoyproxy#7896) Signed-off-by: Dan Zhang <danzh@google.com> * srds: allow SRDS pass on scope-not-found queries to filter-chain (issue envoyproxy#8236). (envoyproxy#8239) Description: Allow a no-scope request to pass through the filter chain, so that some special queries (e.g., data plane health-check ) can be processed by the customized filter-chain. By default, the behavior is the same (404). Risk Level: LOW Testing: unit test and integration test. Docs Changes: N/A Release Notes: N/A Fixes envoyproxy#8236 Signed-off-by: Xin Zhuang <stevenzzz@google.com> * Updated to new envoyproxy master branch. Signed-off-by: John Plevyak <jplevyak@gmail.com> * Remove offending go proto option. Signed-off-by: John Plevyak <jplevyak@gmail.com> * Fix format/tidy issues. Signed-off-by: John Plevyak <jplevyak@gmail.com>
Description:
See the issue in #7902, this PR is to make the SlotImpl detachable from its owner, by introducing a Booker object wraps around a SlotImpl, which bookkeeps all the on-the-fly update callbacks. And on its destruction, if there are still on-the-fly callbacks, move the SlotImpl to an deferred-delete queue, instead of destructing the SlotImpl which may cause an SEGV error.
More importantly, introduce a new runOnAllThreads(ThreadLocal::UpdateCb cb) API to Slot, which requests a Slot Owner to not assume that the Slot or its owner will out-live (in Main thread) the fired on-the-fly update callbacks, and should not capture the Slot or its owner in the update_cb.
Introducing an new runOnAllThreads(UpdateCb) api to Slot interface, for which a update callback is
expected to receive the currently stored ThreadLocal object, the callback may update or return a new ThreadLocal object that will be stored in the slot.
A Bookkeeper wraps around SlotImpl, which bookkeeps how many outgoing callbacks are on-the-fly. On its destruction been called in main thread, if the number doesn't go to zero, it will put the wrapped SlotImpl and its use count into a deffered delete queue.
In TLS InstanceImpl, add a deferred delete queue, a scheduleCleanup method will post a callback to main dispatcher. The callback deletes SlotImpl from the deferred delete queue, and reschedule another cleanup if the queue is not empty.
Picked RDS and config-providers-framework as examples to demonstrate that this change works. {i.e., changed from the runOnAllThreads(Event::PostCb) to the new runOnAllThreads(TLS::UpdateCb) interface. }
If the PR looks good, I will migrate existing (probably not all?) call-sites of runOnAllThreads to the new form in a follow up PR.
Risk Level: Medium
Testing: unit test
Docs Changes: N/A
Release Notes: N/A
[Optional Fixes #Issue] #7902