-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race condition in Salt loader. #50655
Conversation
This commit introduces salt.utils.msgpack modifies all places in the code that either use json or msgpack to use salt.utils.json or salt.utils.msgpack respectively. While this change itself does not have any effect, it is important to allow for centrally dealing with objects that cannot be directly serialied via json or msgpack.
There was a race condition in the salt loader when injecting global values (e.g. "__pillar__" or "__salt__") into modules. One effect of this race condition was that in a setup with multiple threads, some threads may see pillar data intended for other threads or the pillar data seen by a thread might even change spuriously. There have been earlier attempts to fix this problem (saltstack#27937, saltstack#29397). These patches tried to fix the problem by storing the dictionary that keeps the relevant data in a thread-local variable and referencing this thread-local variable from the variables that are injected into the modules. These patches did not fix the problem completely because they only work when a module is loaded through a single loader instance only. When there is more than one loader, there is more than one thread-local variable and the variable injected into a module is changed to point to another thread-local variable when the module is loaded again. Thus, the problem resurfaced while working on saltstack#39670. This patch attempts to solve the problem from a slightly different angle, complementing the earlier patches: The value injected into the modules now is a proxy that internally uses a thread-local variable to decide to which object it points. This means that when loading a module again through a different loader (possibly passing different pillar data), the data is actually only changed in the thread in which the loader is used. Other threads are not affected by such a change. This means that it will work correctly in the current situation where loaders are possibly created by many different modules and these modules do not necessary know in which context they are executed. Thus it is much more flexible and reliable than the more explicit approach used by the two earlier patches. Unfortunately, the stand JSON and Msgpack serialization code cannot handle proxied objects, so they have to be unwrapped before passing them to that code. The salt.utils.json and salt.utils.msgpack modules have been modified to take care of unwrapping objects that are proxied using the ThreadLocalProxy.
When accessing msgpack.Unpacker, we cannot use salt.utils.msgpack, but we do not have to either.
The jenkins/pr/py2-centos-7 run succeeds now. The other ones still fail, but I do not think that this is related to my changes. @cachedout: Could you maybe have a look and tell me whether any of the failed tests might be related to my changes? Just a few comments about what's different in comparison to the earlier versions of this PR: Instead if trying to "unproxy" potentially proxied objects before passing them to serialization code, I now install a handler that is called when the serialization code (json or msgpack) encounters an object of an unsupported type. This is much cleaner than the There are three commits: The first commit introduces The second commit includes the actual thread-safety fix. It introduces The third commit basically reverts some changes from the first commit. There are a few places where using the |
@smarsching Something went quite wrong with the test suite there. I'll re-run these. |
I'm a little concerned about this failure: |
@cachedout You were right to be concerned, but for the wrong reasons. 😜 The code in salt.serializers.yamlex would indeed fail if a proxied object was passed to it. I pushed a commit that fixes this. I do not think that there currently is any code that tries to serialize a proxied object to YAML, but it is still better to be prepared for that case. However, this is not the reason why the test fails. I cannot say for sure why it fails, but I can say for sure that it is not because of a proxy object: The test fails even when run on the vanilla develop branch, so it is a preexisting bug. What I found during debugging is that it only fails when PyYAML is installed with bindings to libyaml (C code). That is probably a reason why it only fails in some of the environments. I could not reproduce the problem in my own Ubuntu 16.04 test environment until I specifically installed PyYAML with the C extensions. |
We think we have this yamlex issue addressed here: #50741 |
@cachedout Thanks for the heads-up. It lookes like that PR has been merged into the I could certainly cherry-pick the changes in order to check that they fix the failing tests in this branch, but I am afraid this might cause merge conflicts later. What do you think is the best option to get this changes into my branch for testing them? |
@smarsching We'll get the merge-forward done ASAP. Let's plan on waiting for that and then I'll just come over to this PR and rebase it to include the changes. Hopefully the tests all pass after that. :) |
The merge forward is done. I updated the branch and if tests pass we can merge |
module's variable without acquiring the lock and only acquires the lock | ||
if a new proxy has to be created and injected. | ||
''' | ||
from salt.utils.thread_local_proxy import ThreadLocalProxy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the re-import?
@s0undt3ch do you think these test failures are related? |
They seem to, yes. |
This PR fixes a very legit issue, but I don't know why it is causing the states to fail like this |
The failing state tests are
Both are failing on |
@thatch45 removing the NamespacedDict wrapper seems to address the issue. I don't really see what purpose it was originally serving, contextDict and ThreadLocalProxy seem to aim to achieve similar things wrt locking access to thread local variables. Will check in on jenkins in the morning. @cachedout saw you in the git blame for namespacedDictWrapper, do you remember any context for what the original intent was? |
alright from reading further it seems the idea was to carry the ref of the root stack context instead of the inner dicts for use with tornado.stackcontext. Will try to sort out why thats not playing nice with ThreadLocalProxy |
That goes back all the way to @jacksontj IIRC. |
when lazyLoader was getting instantiated from inside already lazyloaded modules (ie state), pillar/grain (and probably other) globals were already populated. In this case they weren't getting dereferenced properly when wrapped via the NamespacedDict wrapper, causing an infinite loop. This should fix that scenario. Fixes saltstack#50655 (comment)
when lazyLoader was getting instantiated from inside already lazyloaded modules (ie state), pillar/grain (and probably other) globals were already populated. In this case they weren't getting dereferenced properly when wrapped via the NamespacedDict wrapper, causing an infinite loop. This should fix that scenario. Fixes saltstack#50655 (comment)
when lazyLoader was getting instantiated from inside already lazyloaded modules (ie state), pillar/grain (and probably other) globals were already populated. In this case they weren't getting dereferenced properly when wrapped via the NamespacedDict wrapper, causing an infinite loop. This should fix that scenario. Fixes saltstack#50655 (comment)
when lazyLoader was getting instantiated from inside already lazyloaded modules (ie state), pillar/grain (and probably other) globals were already populated. In this case they weren't getting dereferenced properly when wrapped via the NamespacedDict wrapper, causing an infinite loop. This should fix that scenario. Fixes saltstack#50655 (comment)
What does this PR do?
This PR is a new version of #39948, #48598, and #50648 so the description has been copied from there. This PR will hopefully resolve the problems that existed in the original PRs. The code is identical to #50648, but this PR is based on the develop branch instead of the 2018.3.3 branch because Jenkins obviously cannot run tests for branches based on the 2018.3.3 branch.
There was a race condition in the salt loader when injecting global values (e.g. "pillar" or "salt") into modules. One effect of this race condition was that in a setup with multiple threads, some threads may see pillar data intended for other threads or the pillar data seen by a thread might even change spuriously.
There have been earlier attempts to fix this problem (#27937, #29397). These patches tried to fix the problem by storing the dictionary that keeps the relevant data in a thread-local variable and referencing this thread-local variable from the variables that are injected into the modules.
These patches did not fix the problem completely because they only work when a module is loaded through a single loader instance only. When there is more than one loader, there is more than one thread-local variable and the variable injected into a module is changed to point to another thread-local variable when the module is loaded again. Thus, the problem resurfaced while working on #39670.
This patch attempts to solve the problem from a slightly different angle, complementing the earlier patches: The value injected into the modules now is a proxy that internally uses a thread-local variable to decide to which object it points. This means that when loading a module
again through a different loader (possibly passing different pillar data), the data is actually only changed in the thread in which the loader is used. Other threads are not affected by such a change.
This means that it will work correctly in the current situation where loaders are possibly created by many different modules and these modules do not necessary know in which context they are executed. Thus it is much more flexible and reliable than the more explicit approach
used by the two earlier patches.
What issues does this PR fix or reference?
This PR fixes problems that surfaced when developing the parallel runners feature (#39670), but is also related to problems reported earlier (#23373). The problems described in #29028 might also be related, though this needs further confirmation.
Previous Behavior
Changes to pillar data in one thread might have influenced pillar data in a different thread. Whether or when this problem actually appeared was unpredictable (due to the nature of a race condition).
New Behavior
Changes to pillar data in one thread will never affect the pillar data in a different thread.
Tests written?
Yes
Regression Potential
The change to the loader code itself is rather small, but it affects a central component. So if there is a problem, the impact might affect nearly all components.