dynamic fwd proxy: fix race condition in filter destructor #32692

cpakulski · 2024-03-04T19:48:18Z

Commit Message:
dynamic fwd proxy: fix race condition in filter destructor

Additional Description:
Dynamic fwd proxy is specific comparing to other filters because it controls upstream cluster. The filter registers itself (this pointer) in ClusterManager as cluster callbacks: https://github.com/envoyproxy/envoy/blob/release/v1.29/source/extensions/filters/http/dynamic_forward_proxy/proxy_filter.h#L37.
In the filter's destructor, tls slot destructor will remove the registered callbacks. The problem is that tls slots are not destructed atomically. Instructions to release slots are posted on each thread and main thread continues with filter destructor. This means that there is a short period of time when the filter is gone (memory was released) but callbacks on each worker thread still point to the deleted callbacks. If at that time a worker thread iterates over callbacks, like here: https://github.com/envoyproxy/envoy/blob/release/v1.29/source/common/upstream/cluster_manager_impl.cc#L1314, it will access a callback pointer which has already been released.

I observed a crash when I reconfigured a dynamic fwd proxy 2 times per second. The code crashed after ~72 hrs. When I reconfigure the filter 50 timer per second the code crashes within 10 minutes.

The fix is to introduce some order when destructor is called. When callbacks are added to each thread, the added counter is increased. That counter is decremented on each thread when callbacks are deregistered. The destructor waits on condition variable until this pointer has been removed from all threads and the counter drops to zero.

Risk Level: Low
Testing: Manual. It is difficult to simulate thread switching in unit/integration tests.
Docs Changes: No
Release Notes: No
Platform Specific Features: No
Fixes #32427

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-03-04T22:10:24Z

/retest

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

repokitteh-read-only · 2024-03-05T18:45:08Z

CC @envoyproxy/coverage-shephards: FYI only for changes made to (test/per_file_coverage.sh).
envoyproxy/coverage-shephards assignee is @alyssawilk

🐱

Caused by: #32692 was synchronize by cpakulski.

see: more, trace.

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski · 2024-03-07T13:56:44Z

/retest

alyssawilk

couple of thoughts
first, please don't publish crash fixes without checking with envoy-setec. I think this is hard enough to exploit we don't have to treat as a zero day but let's err on the side of caution.
second, looks like this may be a generic problem. having any "main thread" object register for thread local callbacks, seems like a recipe for deletion races. I'd lean towards a solution that works beyond proxy filter - I'd be inclined to change the function signature for adding the callback to be a shared pointer (here to a wrapper shim, that we can detatch from the DFP config) such that the shim will self-delete once no worker threads are pointed to it.
thoughts?
/wait

cpakulski · 2024-03-07T18:46:32Z

@alyssawilk Thanks a lot reviewing.

first, please don't publish crash fixes without checking with envoy-setec. I think this is hard enough to exploit we don't have to treat as a zero day but let's err on the side of caution.

Unless I am missing something, I believe that the protocol of reporting crashes was followed. The crash was originally reported to setec group and the reply was to log it as public issue: #32427.

second, looks like this may be a generic problem. having any "main thread" object register for thread local callbacks, seems like a recipe for deletion races. I'd lean towards a solution that works beyond proxy filter - I'd be inclined to change the function signature for adding the callback to be a shared pointer (here to a wrapper shim, that we can detatch from the DFP config) such that the shim will self-delete once no worker threads are pointed to it.
thoughts?

That sounds like a good idea. Let me experiment with this. The other option would be to modify TLSSlot destructor to wait for all slots to be destructed before continuing with main thread.

cpakulski · 2024-03-08T20:52:51Z

@alyssawilk I created another PR: #32798 which solves the problem at the much lower level and is probably more elegant.

github-actions · 2024-04-08T00:02:58Z

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

cpakulski · 2024-04-10T13:19:40Z

Closing. Fixed in #33303.

dynamic fwd proxy: fix race condition in filter destructor

42ab696

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski requested review from mattklein123 and alyssawilk as code owners March 4, 2024 19:48

KBaichoo assigned alyssawilk Mar 5, 2024

cpakulski added 2 commits March 5, 2024 18:43

Lowered coverage threshold for dynamic fwd proxy.

1393148

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Merge remote-tracking branch 'upstream/main' into issue/32427_1

86b411b

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

cpakulski added 2 commits March 6, 2024 18:57

Merge remote-tracking branch 'upstream/main' into issue/32427_1

20c1bc9

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

Merge remote-tracking branch 'upstream/main' into issue/32427_1

713cbad

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>

alyssawilk reviewed Mar 7, 2024

View reviewed changes

repokitteh-read-only bot added the waiting label Mar 7, 2024

cpakulski mentioned this pull request Mar 8, 2024

Thread local: wait on main thread until worker's slots are released #32798

Closed

cpakulski mentioned this pull request Apr 4, 2024

DFP: move CM callbacks to thread local objects #33303

Merged

github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 8, 2024

cpakulski closed this Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dynamic fwd proxy: fix race condition in filter destructor #32692

dynamic fwd proxy: fix race condition in filter destructor #32692

cpakulski commented Mar 4, 2024 •

edited

Loading

cpakulski commented Mar 4, 2024

repokitteh-read-only bot commented Mar 5, 2024

cpakulski commented Mar 7, 2024

alyssawilk left a comment

cpakulski commented Mar 7, 2024

cpakulski commented Mar 8, 2024

github-actions bot commented Apr 8, 2024

cpakulski commented Apr 10, 2024

dynamic fwd proxy: fix race condition in filter destructor #32692

dynamic fwd proxy: fix race condition in filter destructor #32692

Conversation

cpakulski commented Mar 4, 2024 • edited Loading

cpakulski commented Mar 4, 2024

repokitteh-read-only bot commented Mar 5, 2024

cpakulski commented Mar 7, 2024

alyssawilk left a comment

Choose a reason for hiding this comment

cpakulski commented Mar 7, 2024

cpakulski commented Mar 8, 2024

github-actions bot commented Apr 8, 2024

cpakulski commented Apr 10, 2024

cpakulski commented Mar 4, 2024 •

edited

Loading