-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DFP: move CM callbacks to thread local objects #33303
Conversation
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
…ain DFP config object.. Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
/retest |
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh! Well that's a better workaround than either of us had thought of before. Very very nice find!
Question, would it be possible to add a rapid config reload test which would flake before this change, or is that relatively implausible?
LGTM and throwing over to Matt for second pass either way
Thanks for reviewing @alyssawilk. As for some sort of test, it is not trivial with the current code. The most problematic part is achieving certainty that code is robust when the test passes. I use 8 worker threads and reload config 100 times per second. Unit/integration tests use max 2 threads. Such test could run for many minutes and not crash. The proper way would be to orchestrate specific thread switching (for example pend on mutex, delete a resource on other thread and release mutex, etc). While my memory is fresh, I am willing to write a wrapper framework, which will take care of relationship between TLS object and parent object and invoke callbacks only when parent is valid. As part of that framework I can add some hooks (only in _DEBUG build), so test routines can orchestrate certain actions on each thread and delete parent object and then wake up worker thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice LGTM!
Note that we have this: https://github.com/envoyproxy/envoy/blob/main/source/common/common/thread_synchronizer.h Is there any way to use this to add a test or do you want to work on this as a follow up? /wait-any |
@mattklein123 Thanks for reviewing! I honestly think that writing a test for this particular case is not trivial. Additionally, there are 4-5 other filters which use similar pattern of registering callbacks in CM, so it would be nice to add tests for them as well. I think the the best way forward is to merge this PR as is. I know that this bug affects some users and they are waiting for the fix. I will backport it to the last 4 releases. |
Sounds good! |
+1, sounds great. Thanks for all your work here @cpakulski ! |
…33303) Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Signed-off-by: Christoph Pakulski <paker8848@gmail.com>
Commit Message:
DFP: move CM callbacks to thread local objects
Additional Description:
This is third attempt to fix #32427. (The previous are #32692 and #32798).
The solution simply moves callbacks out of main object to thread local objects. Thread local objects do not hold any references to the main object, so both object types are independent and may have different lifespans.
This fixes the reported crash, but is not a generic solution.
I also played with combination of shared/weak pointers and modified signature of the function registering callbacks in CM. That also fixed the problem, but affected other modules (CM itself, aggregate cluster, redis proxy and udp proxy).
At this moment I believe that it is possible to build a generic mechanism which will take care of difference in lifespans of parent object and callbacks, but it will take me few weeks to build and test one and adjust other modules. I will open another PR when such framework is ready for review.
Risk Level: Low
Testing: Manual. Code which used to crash within 15 minutes runs without a crash.
Docs Changes: No
Release Notes: No
Platform Specific Features: No
Fixes #32427