Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dynamic fwd proxy: fix race condition in filter destructor #32692

Closed
wants to merge 5 commits into from

Merge remote-tracking branch 'upstream/main' into issue/32427_1

713cbad
Select commit
Loading
Failed to load commit list.
Closed

dynamic fwd proxy: fix race condition in filter destructor #32692

Merge remote-tracking branch 'upstream/main' into issue/32427_1
713cbad
Select commit
Loading
Failed to load commit list.
CI (Envoy) / Mobile/CC skipped Mar 7, 2024 in 0s

Check was skipped

This check was not triggered in this CI run

Details

Request (pr/32692/main@713cbad)

cpakulski @cpakulski 713cbad #32692 merge main@0bd73e3

dynamic fwd proxy: fix race condition in filter destructor

Commit Message:
dynamic fwd proxy: fix race condition in filter destructor

Additional Description:
Dynamic fwd proxy is specific comparing to other filters because it controls upstream cluster. The filter registers itself (this pointer) in ClusterManager as cluster callbacks: https://github.com/envoyproxy/envoy/blob/release/v1.29/source/extensions/filters/http/dynamic_forward_proxy/proxy_filter.h#L37.
In the filter's destructor, tls slot destructor will remove the registered callbacks. The problem is that tls slots are not destructed atomically. Instructions to release slots are posted on each thread and main thread continues with filter destructor. This means that there is a short period of time when the filter is gone (memory was released) but callbacks on each worker thread still point to the deleted callbacks. If at that time a worker thread iterates over callbacks, like here: https://github.com/envoyproxy/envoy/blob/release/v1.29/source/common/upstream/cluster_manager_impl.cc#L1314, it will access a callback pointer which has already been released.

I observed a crash when I reconfigured a dynamic fwd proxy 2 times per second. The code crashed after ~72 hrs. When I reconfigure the filter 50 timer per second the code crashes within 10 minutes.

The fix is to introduce some order when destructor is called. When callbacks are added to each thread, the added counter is increased. That counter is decremented on each thread when callbacks are deregistered. The destructor waits on condition variable until this pointer has been removed from all threads and the counter drops to zero.

Risk Level: Low
Testing: Manual. It is difficult to simulate thread switching in unit/integration tests.
Docs Changes: No
Release Notes: No
Platform Specific Features: No
Fixes #32427

Environment

Request variables

Key Value
ref 49b10e5f2b49cb9cc2de36302ebd1e549b4d914a
sha 713cbad
pr 32692
base-sha 0bd73e3
actor cpakulski @cpakulski
message dynamic fwd proxy: fix race condition in filter destructor...
started 1709775037.880623
target-branch main
trusted false
Build image

Container image/s (as used in this CI run)

Key Value
default envoyproxy/envoy-build-ubuntu:0ca52447572ee105a4730da5e76fe47c9c5a7c64
mobile envoyproxy/envoy-build-ubuntu:mobile-0ca52447572ee105a4730da5e76fe47c9c5a7c64
Version

Envoy version (as used in this CI run)

Key Value
major 1
minor 30
patch 0
dev true