Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] thread safety in the loader namespacing issues #57765

Closed
sagetherage opened this issue Jun 23, 2020 · 4 comments
Closed

[BUG] thread safety in the loader namespacing issues #57765

sagetherage opened this issue Jun 23, 2020 · 4 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior Magnesium Mg release after Na prior to Al severity-critical top severity, seen by most users, serious issues
Milestone

Comments

@sagetherage
Copy link
Contributor

Description
Many issues exist in the salt loader and thread safety must be addressed

WIP PR: #56513 -- may not be the implementation we want, unknown at this time

From Closed PR from Akmod: #54878
Merge PR from develop #50655 which was a new version of #39948, #48598, and #50648 so the description has been copied from there. This PR will hopefully resolve the problems that existed in the original PRs. The code is identical to #50648, but this PR is based on the develop branch instead of the 2018.3.3 branch because Jenkins obviously cannot run tests for branches based on the 2018.3.3 branch.

There was a race condition in the salt loader when injecting global values (e.g. "pillar" or "salt") into modules. One effect of this race condition was that in a setup with multiple threads, some threads may see pillar data intended for other threads or the pillar data seen by a thread might even change spuriously.

There have been earlier attempts to fix this problem (#27937, #29397). These patches tried to fix the problem by storing the dictionary that keeps the relevant data in a thread-local variable and referencing this thread-local variable from the variables that are injected into the modules.

These patches did not fix the problem completely because they only work when a module is loaded through a single loader instance only. When there is more than one loader, there is more than one thread-local variable and the variable injected into a module is changed to point to another thread-local variable when the module is loaded again. Thus, the problem resurfaced while working on #39670.

This patch attempts to solve the problem from a slightly different angle, complementing the earlier patches: The value injected into the modules now is a proxy that internally uses a thread-local variable to decide to which object it points. This means that when loading a module
again through a different loader (possibly passing different pillar data), the data is actually only changed in the thread in which the loader is used. Other threads are not affected by such a change.

This means that it will work correctly in the current situation where loaders are possibly created by many different modules and these modules do not necessary know in which context they are executed. Thus it is much more flexible and reliable than the more explicit approach
used by the two earlier patches.
What issues does this PR fix or reference?

This PR fixes problems that surfaced when developing the parallel runners feature (#39670), but is also related to problems reported earlier (#23373). The problems described in #29028 might also be related, though this needs further confirmation.
Previous Behavior

Changes to pillar data in one thread might have influenced pillar data in a different thread. Whether or when this problem actually appeared was unpredictable (due to the nature of a race condition).
New Behavior

Changes to pillar data in one thread will never affect the pillar data in a different thread.
Setup
(Please provide relevant configs and/or SLS files (be sure to remove sensitive info).

Steps to Reproduce the behavior
(Include debug logs if possible and relevant)

Expected behavior
A clear and concise description of what you expected to happen.

Screenshots
If applicable, add screenshots to help explain your problem.

Versions Report

salt --versions-report (Provided by running salt --versions-report. Please also mention any differences in master/minion versions.)
PASTE HERE

Additional context
Add any other context about the problem here.

@sagetherage sagetherage added Bug broken, incorrect, or confusing behavior Magnesium Mg release after Na prior to Al labels Jun 23, 2020
@sagetherage sagetherage added this to the Approved milestone Jun 23, 2020
@sagetherage sagetherage added the severity-critical top severity, seen by most users, serious issues label Jun 25, 2020
@ari
Copy link
Contributor

ari commented Jun 25, 2020

If this is the problem affecting us (and I think it is), the issue got much worse in our system either when we upgraded to 3000 or from python 3.6 to 3.7. jinja variables are randomly wrong all through our pillar and it is causing major issues.

@ari
Copy link
Contributor

ari commented Jun 25, 2020

Unfortunately my attempt to work around the problem failed like this:

17:32:23,580 [WARNING ] The 'worker_threads' setting in '/usr/local/etc/salt/master' cannot be lower than 3. Resetting it to the default value of 3.

Is there a way to make salt_master obey going single threaded?

@ari
Copy link
Contributor

ari commented Jul 1, 2020

I see this ticket has Magnesium (is that 3002?) labelled. Which I guess means it will be released in Q4. I suggest that the problems this causes are huge. For us, this took our SaaS service offline for 15 minutes. For others, it will cause leakage of certificates and other critical data to minions that should not have it.

Even if the fix needs to wait a while, could you please look at what mitigations might be possible, such as running salt master single threaded, in the short term. The problem is trivial to reproduce for us, so let me know if you need more info. A combination of pillar which uses jinja 'import from' and a minion id will make it happen.

@sagetherage sagetherage modified the milestones: Approved, Magnesium Jul 14, 2020
@dwoz
Copy link
Contributor

dwoz commented Sep 4, 2020

Closing this because we have not been able to reproduce any security concerns.

We will track the progress of fixing general loader threading issues here: #58369

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior Magnesium Mg release after Na prior to Al severity-critical top severity, seen by most users, serious issues
Projects
None yet
Development

No branches or pull requests

3 participants