-
Notifications
You must be signed in to change notification settings - Fork 968
Work around a supervisord bug causing the provisioning to hang intermittently. #2557
Work around a supervisord bug causing the provisioning to hang intermittently. #2557
Conversation
Looks reasonable to me, but I'm not the most qualified to review this - I wouldn't know if e.g. the |
What version of celery are you running, we ran into a similar error with an older version of celery. Celery would deadlock and our worker timeout is so high that it was causing issues. |
@feanil This problem occurred when running the edx_platform.yml playbook, so it was using whatever version of celery edx-platform installs (3.1.18 at the moment). |
@bradenmacdonald The edxapp_worker group is defined by iterating over |
bea92b4
to
e8e3154
Compare
Thanks for the pull request, @smarnach! It looks like you're a member of a company that does contract work for edX. If you're doing this work as part of a paid contract with edX, you should talk to edX about who will review this pull request. If this work is not part of a paid contract with edX, then you should ensure that there is an OSPR issue to track this work in JIRA, so that we don't lose track of your pull request. To automatically create an OSPR issue for this pull request, just visit this link: https://openedx-webhooks.herokuapp.com/github/process_pr?repo=edx%2Fconfiguration&number=2557 |
The issue linked above (Supervisor/supervisor#131) was closed and the changes were released in Supervisor 3.2.0 (November 30, 2015). |
@feanil Could you please take another look into this? We just ran into this problem again. I can't really tell whether it's caused by Supervisor or Celery, but this fix seems to resolve it. Would it be possible to merge this just for its merits in simplifying the code? ;-) |
e8e3154
to
bbc294f
Compare
I just rebased on top of current master. |
Taking a look now. I'll let you know once we have tested it here. |
Looks good to me. @fredsmith second review? |
@fredsmith ^ friendly ping re: providing a second review. @feanil Thanks for the review! |
👍 |
Work around a supervisord bug causing the provisioning to hang intermittently.
We've experienced an intermittent problem with supervisorctl hanging when restarting the edxapp celery workers on our sandboxes. When the problem occurred, the Ansible task that starts (or checks) the workers hung indefintely on one of the workers. When logging in to the machine, calling
supervisorctl status
on the respective procees showed in the state "STOPPING". The corresponding child process of supervisord sometimes has zombie child processes, and the supervisorctl process was hanging. Killing either supervisorctl or the right child process of supervisord resolved the situation (and caused the Ansible task to fail). The exact symptoms of the problem varied across multiple executions with exactly the same configuration, and one incarnation of the problem was reported on the openedx-ops mailing list a while ago (I don't have the link at the moment).Similar problems with supervisord have been reported on superuser.com, various Linux distribution bug trackers and the bug tracker of supervisord itself, e.g. this bug. To me, it looks like some kind of deadlock, either in the communication between supervisorctl and supervisord or in the communication between supervisord and the celery worker. Some sources indicate that the problem is fixed in newer versions of supervisord.
We usually only encountered the problem intermittently when under memory pressure, but during the tests for #2527 it occurred consistently even with swap enabled, which gave me the opportunity to test work-arounds. The patch in this PR resulted in the provisioning consistently working, while it consistently failed without it. Since it is also a code simplification, I hope it qualifies for inclusion in spite of the elusive nature of the bug it works around.
(Chesterton's fence: Support for group names was added to Ansible in version 1.6, which explains the current version of the code.)