Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DFP: move CM callbacks to thread local objects #33303

Merged
merged 3 commits into from
Apr 10, 2024

Conversation

cpakulski
Copy link
Contributor

@cpakulski cpakulski commented Apr 3, 2024

Commit Message:
DFP: move CM callbacks to thread local objects
Additional Description:
This is third attempt to fix #32427. (The previous are #32692 and #32798).
The solution simply moves callbacks out of main object to thread local objects. Thread local objects do not hold any references to the main object, so both object types are independent and may have different lifespans.

This fixes the reported crash, but is not a generic solution.
I also played with combination of shared/weak pointers and modified signature of the function registering callbacks in CM. That also fixed the problem, but affected other modules (CM itself, aggregate cluster, redis proxy and udp proxy).
At this moment I believe that it is possible to build a generic mechanism which will take care of difference in lifespans of parent object and callbacks, but it will take me few weeks to build and test one and adjust other modules. I will open another PR when such framework is ready for review.

Risk Level: Low
Testing: Manual. Code which used to crash within 15 minutes runs without a crash.
Docs Changes: No
Release Notes: No
Platform Specific Features: No
Fixes #32427

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
…ain DFP config object..

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Copy link

As a reminder, PRs marked as draft will not be automatically assigned reviewers,
or be handled by maintainer-oncall triage.

Please mark your PR as ready when you want it to be reviewed!

🐱

Caused by: #33303 was opened by cpakulski.

see: more, trace.

@cpakulski
Copy link
Contributor Author

/retest

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
@cpakulski cpakulski marked this pull request as ready for review April 4, 2024 13:41
Copy link
Contributor

@alyssawilk alyssawilk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh! Well that's a better workaround than either of us had thought of before. Very very nice find!
Question, would it be possible to add a rapid config reload test which would flake before this change, or is that relatively implausible?
LGTM and throwing over to Matt for second pass either way

@cpakulski
Copy link
Contributor Author

Thanks for reviewing @alyssawilk.

As for some sort of test, it is not trivial with the current code. The most problematic part is achieving certainty that code is robust when the test passes. I use 8 worker threads and reload config 100 times per second. Unit/integration tests use max 2 threads. Such test could run for many minutes and not crash. The proper way would be to orchestrate specific thread switching (for example pend on mutex, delete a resource on other thread and release mutex, etc). While my memory is fresh, I am willing to write a wrapper framework, which will take care of relationship between TLS object and parent object and invoke callbacks only when parent is valid. As part of that framework I can add some hooks (only in _DEBUG build), so test routines can orchestrate certain actions on each thread and delete parent object and then wake up worker thread.

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice LGTM!

@mattklein123
Copy link
Member

While my memory is fresh, I am willing to write a wrapper framework, which will take care of relationship between TLS object and parent object and invoke callbacks only when parent is valid. As part of that framework I can add some hooks (only in _DEBUG build), so test routines can orchestrate certain actions on each thread and delete parent object and then wake up worker thread.

Note that we have this: https://github.com/envoyproxy/envoy/blob/main/source/common/common/thread_synchronizer.h

Is there any way to use this to add a test or do you want to work on this as a follow up?

/wait-any

@cpakulski
Copy link
Contributor Author

cpakulski commented Apr 9, 2024

@mattklein123 Thanks for reviewing! I honestly think that writing a test for this particular case is not trivial. Additionally, there are 4-5 other filters which use similar pattern of registering callbacks in CM, so it would be nice to add tests for them as well. I think the the best way forward is to merge this PR as is. I know that this bug affects some users and they are waiting for the fix. I will backport it to the last 4 releases.
I will also open an issue to track development of generic framework (as I described above) and will use thread synchronizer (thanks for idea!) to test that framework. The scope of that generic framework is going to be fairly small and writing a test with a specific thread switching sequence should be much simpler.

@mattklein123
Copy link
Member

@mattklein123 Thanks for reviewing! I honestly think that writing a test for this particular case is not trivial. Additionally, there are 4-5 other filters which use similar pattern of registering callbacks in CM, so it would be nice to add tests for them as well. I think the the best way forward is to merge this PR as is. I know that this bug affects some users and they are waiting for the fix. I will backport it to the last 4 releases.
I will also open an issue to track development of generic framework (as I described above) and will use thread synchronizer (thanks for idea!) to test that framework. The scope of that generic framework is going to be fairly small and writing a test with a specific thread switching sequence should be much simpler.

Sounds good!

@mattklein123 mattklein123 merged commit 59da9ee into envoyproxy:main Apr 10, 2024
52 checks passed
@alyssawilk
Copy link
Contributor

+1, sounds great. Thanks for all your work here @cpakulski !

cpakulski added a commit to cpakulski/envoy that referenced this pull request Apr 10, 2024
…33303)

Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
cpakulski added a commit to cpakulski/envoy that referenced this pull request Apr 10, 2024
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
cpakulski added a commit to cpakulski/envoy that referenced this pull request Apr 10, 2024
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
phlax pushed a commit that referenced this pull request Apr 12, 2024
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
phlax pushed a commit that referenced this pull request Apr 12, 2024
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
phlax pushed a commit that referenced this pull request Apr 12, 2024
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
alyssawilk pushed a commit to alyssawilk/envoy that referenced this pull request Apr 29, 2024
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Envoy occasionally segfaults from v1.27 in ProxyFilterConfig::onClusterAddOrUpdate
3 participants