[BUG]Pillar not available randomly #57696

angeloudy · 2020-06-17T12:57:37Z

Description
My pillar will become unavailable sometimes.
It happens randomly.
salt-call pillar.get mypillar will return empty, then after a few seconds, it returns my data.

I used the following script to catch the time when pillar is not available

for i in {1..10000}; do 
TIMESTAMP=`date`
 salt-call pillar.get mypillar 2>/dev/null  |grep myvalue >/dev/null|| echo "getting empty pillar at ${TIMESTAMP}"
sleep 1;
done

After running for a few hours, the script returned

getting empty pillar at Wed 17 Jun 2020 12:56:18 AEST
getting empty pillar at Wed 17 Jun 2020 13:40:48 AEST
getting empty pillar at Wed 17 Jun 2020 14:38:19 AEST
getting empty pillar at Wed 17 Jun 2020 15:38:58 AEST
getting empty pillar at Wed 17 Jun 2020 17:41:06 AEST
getting empty pillar at Wed 17 Jun 2020 19:35:52 AEST
getting empty pillar at Wed 17 Jun 2020 19:40:04 AEST
getting empty pillar at Wed 17 Jun 2020 19:51:37 AEST
getting empty pillar at Wed 17 Jun 2020 21:40:23 AEST
getting empty pillar at Wed 17 Jun 2020 21:55:28 AEST
getting empty pillar at Wed 17 Jun 2020 22:07:34 AEST
getting empty pillar at Wed 17 Jun 2020 22:09:33 AEST
getting empty pillar at Wed 17 Jun 2020 22:21:39 AEST
getting empty pillar at Wed 17 Jun 2020 22:30:36 AEST
getting empty pillar at Wed 17 Jun 2020 22:37:42 AEST

It's so inconsisten I am not sure how to debug.

Setup:

Salt Version:
           Salt: 3000.3

Dependency Versions:
           cffi: 1.14.0
       cherrypy: Not Installed
       dateutil: 2.8.1
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
         Jinja2: 2.11.2
        libgit2: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.6.2
   mysql-python: 1.4.6
      pycparser: 2.20
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.7.7 (default, May 13 2020, 19:19:21)
   python-gnupg: Not Installed
         PyYAML: 5.3.1
          PyZMQ: 19.0.1
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.5.3
            ZMQ: 4.3.1

System Versions:
           dist:
         locale: UTF-8
        machine: amd64
        release: 12.1-RELEASE-p1
         system: FreeBSD
        version: Not Installed

The text was updated successfully, but these errors were encountered:

krionbsd · 2020-06-17T13:28:12Z

Thank you for the report. Could you provide more information, did you try to trace network connectivity? Any reason to call pillar every second?

angeloudy · 2020-06-18T05:07:29Z

@krionbsd
There are no connectivity issues because I also tried with the minion that runs on the same machine as the master. I run it every second to see how often and how many times it happens.
We noticed this issue because quite often our state.apply fails, but succeeds on retry.

krionbsd · 2020-06-18T07:14:34Z

Did you run the state with -l debug?
Please read carefully https://docs.saltstack.com/en/latest/topics/troubleshooting/index.html and try to use those recipes.

angeloudy · 2020-06-18T22:54:09Z

@krionbsd I've already tried that, with no luck, because it happens so randomly that I can't reproduce it by hand. And I think debugging the state.apply is useless because it fails due to lack of pillar data.

angeloudy · 2020-06-22T12:02:18Z

After some testing I found most pillar that are missing have jinja conditions around them.
For example,

{% set region = grains['id'].split('.')[0] %}
{% if region == "test" %}
mykey: https://xxxxx
{% else %}
mykey: http://192.168.xx:8307/usi
{% endif %}

These kind of pillars get missing randomly

ari · 2020-06-25T06:51:30Z

I'm pretty sure I've narrowed down the problem. We have code like this:

  {% from "roles.sls" import roles, region with context %}
  {% do salt.log.info("*** Minion {}. Region: {}. Roles: {}".format(grains.id, region, roles)) %}
  {% for r in roles %}
    - roles.{{ r }}
  {% endfor %}

And the problem is that the variable "roles" is randomly picking up values from other minions. So this is not only a serious bug, but also a significant security risk.

ari · 2020-06-25T06:52:08Z

Possibly this is related to #57765

angeloudy · 2020-06-26T10:04:08Z

I just found that every time you access a variable from the imported file, it renders the jinja.
For example, if I move {% do salt.log.info("*** Minion {}. Region: {}. Roles: {}".format(grains.id, region, roles)) %} into roles.sls, I can see it logs multiple times; if the minion has 4 roles, I can see the log 4 times.

I don't see why this is necessary. It's a waste of resources and make it more prone to race conditions.

ari · 2020-06-28T12:26:48Z

@angeloudy I think you are not correct about the way that variable is accessed. We call that import 4 times. Anyhow, let's focus on the real bug and not cloud the issue with other things. We are getting data from other minions in this variable. That breaks everything.

Ch3LL · 2020-09-01T18:25:42Z

can you try this fix? #56513 and see if that helps. Thanks

ari · 2020-09-01T22:43:30Z

Unfortunately with recent workload I'm unlikely to be able to get much time to work on this any time soon. But it looks 90% likely that the threading issue in the task you point to is the cause of the issues here, so happy if you want to close this ticket.

Ch3LL · 2020-09-02T11:06:42Z

I have not been able to replicate this so i would like to see if @angeloudy has any luck?

sagetherage · 2020-10-20T21:37:04Z

@angeloudy are you able to test this?

sagetherage · 2020-11-03T18:29:37Z

ok closing the ticket, please mention me if this arises again or isn't fully fixed.

angeloudy added the Bug broken, incorrect, or confusing behavior label Jun 17, 2020

krionbsd added the Pillar label Jun 17, 2020

xeacott added the info-needed waiting for more info label Jun 18, 2020

xeacott added this to the Blocked milestone Jun 18, 2020

sagetherage added needs-triage and removed info-needed waiting for more info labels Aug 19, 2020

sagetherage removed this from the Blocked milestone Aug 19, 2020

sagetherage assigned xeacott Aug 19, 2020

sagetherage assigned Ch3LL and unassigned xeacott Sep 1, 2020

Ch3LL added severity-critical top severity, seen by most users, serious issues fixed-pls-verify fix is linked, bug author to confirm fix and removed needs-triage labels Sep 1, 2020

Ch3LL added this to the Magnesium milestone Sep 1, 2020

dwoz mentioned this issue Sep 4, 2020

[BUG] Thread safety of loader #58369

Closed

sagetherage added the Magnesium Mg release after Na prior to Al label Sep 10, 2020

sagetherage removed this from the Magnesium milestone Nov 3, 2020

sagetherage modified the milestones: Approved, Magnesium Nov 3, 2020

sagetherage closed this as completed Nov 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]Pillar not available randomly #57696

[BUG]Pillar not available randomly #57696

angeloudy commented Jun 17, 2020

krionbsd commented Jun 17, 2020

angeloudy commented Jun 18, 2020

krionbsd commented Jun 18, 2020

angeloudy commented Jun 18, 2020

angeloudy commented Jun 22, 2020

ari commented Jun 25, 2020

ari commented Jun 25, 2020

angeloudy commented Jun 26, 2020

ari commented Jun 28, 2020

Ch3LL commented Sep 1, 2020

ari commented Sep 1, 2020

Ch3LL commented Sep 2, 2020

sagetherage commented Oct 20, 2020

sagetherage commented Nov 3, 2020

[BUG]Pillar not available randomly #57696

[BUG]Pillar not available randomly #57696

Comments

angeloudy commented Jun 17, 2020

krionbsd commented Jun 17, 2020

angeloudy commented Jun 18, 2020

krionbsd commented Jun 18, 2020

angeloudy commented Jun 18, 2020

angeloudy commented Jun 22, 2020

ari commented Jun 25, 2020

ari commented Jun 25, 2020

angeloudy commented Jun 26, 2020

ari commented Jun 28, 2020

Ch3LL commented Sep 1, 2020

ari commented Sep 1, 2020

Ch3LL commented Sep 2, 2020

sagetherage commented Oct 20, 2020

sagetherage commented Nov 3, 2020