Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race condition in Salt loader. #50655

Merged
merged 15 commits into from
Jan 22, 2019

Conversation

smarsching
Copy link
Contributor

What does this PR do?

This PR is a new version of #39948, #48598, and #50648 so the description has been copied from there. This PR will hopefully resolve the problems that existed in the original PRs. The code is identical to #50648, but this PR is based on the develop branch instead of the 2018.3.3 branch because Jenkins obviously cannot run tests for branches based on the 2018.3.3 branch.

There was a race condition in the salt loader when injecting global values (e.g. "pillar" or "salt") into modules. One effect of this race condition was that in a setup with multiple threads, some threads may see pillar data intended for other threads or the pillar data seen by a thread might even change spuriously.

There have been earlier attempts to fix this problem (#27937, #29397). These patches tried to fix the problem by storing the dictionary that keeps the relevant data in a thread-local variable and referencing this thread-local variable from the variables that are injected into the modules.

These patches did not fix the problem completely because they only work when a module is loaded through a single loader instance only. When there is more than one loader, there is more than one thread-local variable and the variable injected into a module is changed to point to another thread-local variable when the module is loaded again. Thus, the problem resurfaced while working on #39670.

This patch attempts to solve the problem from a slightly different angle, complementing the earlier patches: The value injected into the modules now is a proxy that internally uses a thread-local variable to decide to which object it points. This means that when loading a module
again through a different loader (possibly passing different pillar data), the data is actually only changed in the thread in which the loader is used. Other threads are not affected by such a change.

This means that it will work correctly in the current situation where loaders are possibly created by many different modules and these modules do not necessary know in which context they are executed. Thus it is much more flexible and reliable than the more explicit approach
used by the two earlier patches.

What issues does this PR fix or reference?

This PR fixes problems that surfaced when developing the parallel runners feature (#39670), but is also related to problems reported earlier (#23373). The problems described in #29028 might also be related, though this needs further confirmation.

Previous Behavior

Changes to pillar data in one thread might have influenced pillar data in a different thread. Whether or when this problem actually appeared was unpredictable (due to the nature of a race condition).

New Behavior

Changes to pillar data in one thread will never affect the pillar data in a different thread.

Tests written?

Yes

Regression Potential

The change to the loader code itself is rather small, but it affects a central component. So if there is a problem, the impact might affect nearly all components.

This commit introduces salt.utils.msgpack modifies all places in the
code that either use json or msgpack to use salt.utils.json or
salt.utils.msgpack respectively.

While this change itself does not have any effect, it is important to
allow for centrally dealing with objects that cannot be directly
serialied via json or msgpack.
There was a race condition in the salt loader when injecting global
values (e.g. "__pillar__" or "__salt__") into modules. One effect of
this race condition was that in a setup with multiple threads, some
threads may see pillar data intended for other threads or the pillar
data seen by a thread might even change spuriously.

There have been earlier attempts to fix this problem (saltstack#27937, saltstack#29397).
These patches tried to fix the problem by storing the dictionary that
keeps the relevant data in a thread-local variable and referencing this
thread-local variable from the variables that are injected into the
modules.

These patches did not fix the problem completely because they only
work when a module is loaded through a single loader instance only.
When there is more than one loader, there is more than one
thread-local variable and the variable injected into a module is
changed to point to another thread-local variable when the module is
loaded again. Thus, the problem resurfaced while working on saltstack#39670.

This patch attempts to solve the problem from a slightly different
angle, complementing the earlier patches: The value injected into the
modules now is a proxy that internally uses a thread-local variable to
decide to which object it points. This means that when loading a module
again through a different loader (possibly passing different pillar
data), the data is actually only changed in the thread in which the
loader is used. Other threads are not affected by such a change.

This means that it will work correctly in the current situation where
loaders are possibly created by many different modules and these
modules do not necessary know in which context they are executed. Thus
it is much more flexible and reliable than the more explicit approach
used by the two earlier patches.

Unfortunately, the stand JSON and Msgpack serialization code cannot
handle proxied objects, so they have to be unwrapped before passing them
to that code.

The salt.utils.json and salt.utils.msgpack modules have been modified to
take care of unwrapping objects that are proxied using the
ThreadLocalProxy.
@smarsching smarsching requested a review from a team as a code owner November 27, 2018 20:41
@ghost ghost requested review from a team November 27, 2018 20:41
When accessing msgpack.Unpacker, we cannot use salt.utils.msgpack, but
we do not have to either.
@smarsching
Copy link
Contributor Author

smarsching commented Nov 27, 2018

The jenkins/pr/py2-centos-7 run succeeds now. The other ones still fail, but I do not think that this is related to my changes.

@cachedout: Could you maybe have a look and tell me whether any of the failed tests might be related to my changes?

Just a few comments about what's different in comparison to the earlier versions of this PR:

Instead if trying to "unproxy" potentially proxied objects before passing them to serialization code, I now install a handler that is called when the serialization code (json or msgpack) encounters an object of an unsupported type. This is much cleaner than the unproxy_recursive function from the original PR (that I removed now). More importantly, it fixes the problem that unproxy_recursive was not working as expected in some cases.

There are three commits: The first commit introduces salt.utils.msgpack (which is similar to salt.utils.json) so that there is a single place where the serialization code can be intercepted.

The second commit includes the actual thread-safety fix. It introduces salt.utils.thread_local_proxy which is then used by the Salt loader to proxy injected objects and modifies salt.utils.json and salt.utils.msgpack to unproxy such objects when a proxy is encountered during serialization.

The third commit basically reverts some changes from the first commit. There are a few places where using the msgpack module directly is actually correct, so I made sure that these pieces of code do not (completely) replace msgpack with salt.utils.msgpack.

@cachedout
Copy link
Contributor

@smarsching Something went quite wrong with the test suite there. I'll re-run these.

@smarsching
Copy link
Contributor Author

smarsching commented Nov 29, 2018

@cachedout You were right to be concerned, but for the wrong reasons. 😜

The code in salt.serializers.yamlex would indeed fail if a proxied object was passed to it. I pushed a commit that fixes this. I do not think that there currently is any code that tries to serialize a proxied object to YAML, but it is still better to be prepared for that case.

However, this is not the reason why the test fails. I cannot say for sure why it fails, but I can say for sure that it is not because of a proxy object: The test fails even when run on the vanilla develop branch, so it is a preexisting bug.

What I found during debugging is that it only fails when PyYAML is installed with bindings to libyaml (C code). That is probably a reason why it only fails in some of the environments. I could not reproduce the problem in my own Ubuntu 16.04 test environment until I specifically installed PyYAML with the C extensions.

@cachedout
Copy link
Contributor

We think we have this yamlex issue addressed here: #50741

@smarsching
Copy link
Contributor Author

@cachedout Thanks for the heads-up. It lookes like that PR has been merged into the fluorine branch, not the develop branch, so merging the changes from the develop branch into my branch probably will not help.

I could certainly cherry-pick the changes in order to check that they fix the failing tests in this branch, but I am afraid this might cause merge conflicts later.

What do you think is the best option to get this changes into my branch for testing them?

@cachedout
Copy link
Contributor

@smarsching We'll get the merge-forward done ASAP. Let's plan on waiting for that and then I'll just come over to this PR and rebase it to include the changes. Hopefully the tests all pass after that. :)

@thatch45
Copy link
Contributor

thatch45 commented Jan 4, 2019

The merge forward is done. I updated the branch and if tests pass we can merge

module's variable without acquiring the lock and only acquires the lock
if a new proxy has to be created and injected.
'''
from salt.utils.thread_local_proxy import ThreadLocalProxy
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the re-import?

@thatch45
Copy link
Contributor

thatch45 commented Jan 5, 2019

@s0undt3ch do you think these test failures are related?

@s0undt3ch
Copy link
Collaborator

They seem to, yes.
State module test failures on several of the tested platforms. It really seems related.

@thatch45
Copy link
Contributor

thatch45 commented Jan 5, 2019

This PR fixes a very legit issue, but I don't know why it is causing the states to fail like this

@DmitryKuzmenko
Copy link
Contributor

The failing state tests are

integration.modules.test_state.StateModuleTest.test_exclude
integration.modules.test_state.StateModuleTest.test_issue_2068_template_st

Both are failing on develop too. At least on my machine.
Analyzing.

@mattp-
Copy link
Contributor

mattp- commented Feb 5, 2019

@thatch45 removing the NamespacedDict wrapper seems to address the issue. I don't really see what purpose it was originally serving, contextDict and ThreadLocalProxy seem to aim to achieve similar things wrt locking access to thread local variables. Will check in on jenkins in the morning. @cachedout saw you in the git blame for namespacedDictWrapper, do you remember any context for what the original intent was?

@mattp-
Copy link
Contributor

mattp- commented Feb 5, 2019

alright from reading further it seems the idea was to carry the ref of the root stack context instead of the inner dicts for use with tornado.stackcontext. Will try to sort out why thats not playing nice with ThreadLocalProxy

@cachedout
Copy link
Contributor

That goes back all the way to @jacksontj IIRC.

mattp- added a commit to bloomberg/salt that referenced this pull request Feb 5, 2019
when lazyLoader was getting instantiated from inside already lazyloaded
modules (ie state), pillar/grain (and probably other) globals were
already populated. In this case they weren't getting dereferenced
properly when wrapped via the NamespacedDict wrapper, causing an
infinite loop. This should fix that scenario.

Fixes saltstack#50655 (comment)
sbrennan4 pushed a commit to sbrennan4/salt that referenced this pull request Sep 17, 2019
when lazyLoader was getting instantiated from inside already lazyloaded
modules (ie state), pillar/grain (and probably other) globals were
already populated. In this case they weren't getting dereferenced
properly when wrapped via the NamespacedDict wrapper, causing an
infinite loop. This should fix that scenario.

Fixes saltstack#50655 (comment)
@Akm0d Akm0d mentioned this pull request Oct 4, 2019
oeuftete pushed a commit to oeuftete/salt that referenced this pull request Dec 4, 2019
when lazyLoader was getting instantiated from inside already lazyloaded
modules (ie state), pillar/grain (and probably other) globals were
already populated. In this case they weren't getting dereferenced
properly when wrapped via the NamespacedDict wrapper, causing an
infinite loop. This should fix that scenario.

Fixes saltstack#50655 (comment)
Akm0d pushed a commit to Akm0d/salt that referenced this pull request Jan 14, 2020
when lazyLoader was getting instantiated from inside already lazyloaded
modules (ie state), pillar/grain (and probably other) globals were
already populated. In this case they weren't getting dereferenced
properly when wrapped via the NamespacedDict wrapper, causing an
infinite loop. This should fix that scenario.

Fixes saltstack#50655 (comment)
@smarsching smarsching deleted the thread-safe-loader-3 branch December 9, 2022 23:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants