Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]Pillar not available randomly #57696

Closed
angeloudy opened this issue Jun 17, 2020 · 14 comments
Closed

[BUG]Pillar not available randomly #57696

angeloudy opened this issue Jun 17, 2020 · 14 comments
Assignees
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix Magnesium Mg release after Na prior to Al Pillar severity-critical top severity, seen by most users, serious issues
Milestone

Comments

@angeloudy
Copy link
Contributor

Description
My pillar will become unavailable sometimes.
It happens randomly.
salt-call pillar.get mypillar will return empty, then after a few seconds, it returns my data.

I used the following script to catch the time when pillar is not available

for i in {1..10000}; do 
TIMESTAMP=`date`
 salt-call pillar.get mypillar 2>/dev/null  |grep myvalue >/dev/null|| echo "getting empty pillar at ${TIMESTAMP}"
sleep 1;
done

After running for a few hours, the script returned

getting empty pillar at Wed 17 Jun 2020 12:56:18 AEST
getting empty pillar at Wed 17 Jun 2020 13:40:48 AEST
getting empty pillar at Wed 17 Jun 2020 14:38:19 AEST
getting empty pillar at Wed 17 Jun 2020 15:38:58 AEST
getting empty pillar at Wed 17 Jun 2020 17:41:06 AEST
getting empty pillar at Wed 17 Jun 2020 19:35:52 AEST
getting empty pillar at Wed 17 Jun 2020 19:40:04 AEST
getting empty pillar at Wed 17 Jun 2020 19:51:37 AEST
getting empty pillar at Wed 17 Jun 2020 21:40:23 AEST
getting empty pillar at Wed 17 Jun 2020 21:55:28 AEST
getting empty pillar at Wed 17 Jun 2020 22:07:34 AEST
getting empty pillar at Wed 17 Jun 2020 22:09:33 AEST
getting empty pillar at Wed 17 Jun 2020 22:21:39 AEST
getting empty pillar at Wed 17 Jun 2020 22:30:36 AEST
getting empty pillar at Wed 17 Jun 2020 22:37:42 AEST

It's so inconsisten I am not sure how to debug.

Setup:

Salt Version:
           Salt: 3000.3

Dependency Versions:
           cffi: 1.14.0
       cherrypy: Not Installed
       dateutil: 2.8.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.2
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: 1.4.6
      pycparser: 2.20
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.7.7 (default, May 13 2020, 19:19:21)
   python-gnupg: Not Installed
         PyYAML: 5.3.1
          PyZMQ: 19.0.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.1

System Versions:
           dist:
         locale: UTF-8
        machine: amd64
        release: 12.1-RELEASE-p1
         system: FreeBSD
        version: Not Installed

@angeloudy angeloudy added the Bug broken, incorrect, or confusing behavior label Jun 17, 2020
@krionbsd
Copy link
Contributor

Thank you for the report. Could you provide more information, did you try to trace network connectivity? Any reason to call pillar every second?

@angeloudy
Copy link
Contributor Author

@krionbsd
There are no connectivity issues because I also tried with the minion that runs on the same machine as the master. I run it every second to see how often and how many times it happens.
We noticed this issue because quite often our state.apply fails, but succeeds on retry.

@krionbsd
Copy link
Contributor

Did you run the state with -l debug?
Please read carefully https://docs.saltstack.com/en/latest/topics/troubleshooting/index.html and try to use those recipes.

@xeacott xeacott added the info-needed waiting for more info label Jun 18, 2020
@xeacott xeacott added this to the Blocked milestone Jun 18, 2020
@angeloudy
Copy link
Contributor Author

@krionbsd I've already tried that, with no luck, because it happens so randomly that I can't reproduce it by hand. And I think debugging the state.apply is useless because it fails due to lack of pillar data.

@angeloudy
Copy link
Contributor Author

After some testing I found most pillar that are missing have jinja conditions around them.
For example,

{% set region = grains['id'].split('.')[0] %}
{% if region == "test" %}
mykey: https://xxxxx
{% else %}
mykey: http://192.168.xx:8307/usi
{% endif %}

These kind of pillars get missing randomly

@ari
Copy link
Contributor

ari commented Jun 25, 2020

I'm pretty sure I've narrowed down the problem. We have code like this:

  {% from "roles.sls" import roles, region with context %}
  {% do salt.log.info("*** Minion {}. Region: {}. Roles: {}".format(grains.id, region, roles)) %}
  {% for r in roles %}
    - roles.{{ r }}
  {% endfor %}

And the problem is that the variable "roles" is randomly picking up values from other minions. So this is not only a serious bug, but also a significant security risk.

@ari
Copy link
Contributor

ari commented Jun 25, 2020

Possibly this is related to #57765

@angeloudy
Copy link
Contributor Author

I just found that every time you access a variable from the imported file, it renders the jinja.
For example, if I move {% do salt.log.info("*** Minion {}. Region: {}. Roles: {}".format(grains.id, region, roles)) %} into roles.sls, I can see it logs multiple times; if the minion has 4 roles, I can see the log 4 times.

I don't see why this is necessary. It's a waste of resources and make it more prone to race conditions.

@ari
Copy link
Contributor

ari commented Jun 28, 2020

@angeloudy I think you are not correct about the way that variable is accessed. We call that import 4 times. Anyhow, let's focus on the real bug and not cloud the issue with other things. We are getting data from other minions in this variable. That breaks everything.

@sagetherage sagetherage added needs-triage and removed info-needed waiting for more info labels Aug 19, 2020
@sagetherage sagetherage removed this from the Blocked milestone Aug 19, 2020
@sagetherage sagetherage assigned Ch3LL and unassigned xeacott Sep 1, 2020
@Ch3LL
Copy link
Contributor

Ch3LL commented Sep 1, 2020

can you try this fix? #56513 and see if that helps. Thanks

@Ch3LL Ch3LL added severity-critical top severity, seen by most users, serious issues fixed-pls-verify fix is linked, bug author to confirm fix and removed needs-triage labels Sep 1, 2020
@Ch3LL Ch3LL added this to the Magnesium milestone Sep 1, 2020
@ari
Copy link
Contributor

ari commented Sep 1, 2020

Unfortunately with recent workload I'm unlikely to be able to get much time to work on this any time soon. But it looks 90% likely that the threading issue in the task you point to is the cause of the issues here, so happy if you want to close this ticket.

@Ch3LL
Copy link
Contributor

Ch3LL commented Sep 2, 2020

I have not been able to replicate this so i would like to see if @angeloudy has any luck?

@sagetherage sagetherage added the Magnesium Mg release after Na prior to Al label Sep 10, 2020
@sagetherage
Copy link
Contributor

@angeloudy are you able to test this?

@sagetherage sagetherage removed this from the Magnesium milestone Nov 3, 2020
@sagetherage sagetherage modified the milestones: Approved, Magnesium Nov 3, 2020
@sagetherage
Copy link
Contributor

ok closing the ticket, please mention me if this arises again or isn't fully fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug broken, incorrect, or confusing behavior fixed-pls-verify fix is linked, bug author to confirm fix Magnesium Mg release after Na prior to Al Pillar severity-critical top severity, seen by most users, serious issues
Projects
None yet
Development

No branches or pull requests

6 participants