Work around a supervisord bug causing the provisioning to hang intermittently. #2557

smarnach · 2015-12-06T15:17:30Z

We've experienced an intermittent problem with supervisorctl hanging when restarting the edxapp celery workers on our sandboxes. When the problem occurred, the Ansible task that starts (or checks) the workers hung indefintely on one of the workers. When logging in to the machine, calling supervisorctl status on the respective procees showed in the state "STOPPING". The corresponding child process of supervisord sometimes has zombie child processes, and the supervisorctl process was hanging. Killing either supervisorctl or the right child process of supervisord resolved the situation (and caused the Ansible task to fail). The exact symptoms of the problem varied across multiple executions with exactly the same configuration, and one incarnation of the problem was reported on the openedx-ops mailing list a while ago (I don't have the link at the moment).

Similar problems with supervisord have been reported on superuser.com, various Linux distribution bug trackers and the bug tracker of supervisord itself, e.g. this bug. To me, it looks like some kind of deadlock, either in the communication between supervisorctl and supervisord or in the communication between supervisord and the celery worker. Some sources indicate that the problem is fixed in newer versions of supervisord.

We usually only encountered the problem intermittently when under memory pressure, but during the tests for #2527 it occurred consistently even with swap enabled, which gave me the opportunity to test work-arounds. The patch in this PR resulted in the provisioning consistently working, while it consistently failed without it. Since it is also a code simplification, I hope it qualifies for inclusion in spite of the elusive nature of the bug it works around.

(Chesterton's fence: Support for group names was added to Ansible in version 1.6, which explains the current version of the code.)

bradenmacdonald · 2015-12-08T08:09:36Z

Looks reasonable to me, but I'm not the most qualified to review this - I wouldn't know if e.g. the edxapp_worker: group ever contains other worker definitions beyond those included in edxapp_workers (I'm assuming it doesn't).

feanil · 2015-12-08T16:16:17Z

What version of celery are you running, we ran into a similar error with an older version of celery. Celery would deadlock and our worker timeout is so high that it was causing issues.

smarnach · 2015-12-08T16:32:25Z

@feanil This problem occurred when running the edx_platform.yml playbook, so it was using whatever version of celery edx-platform installs (3.1.18 at the moment).

smarnach · 2015-12-08T16:38:38Z

@bradenmacdonald The edxapp_worker group is defined by iterating over edxapp_workers, so it will indeed never contain any additional workers beyond those in that list.

openedx-webhooks · 2015-12-14T21:13:28Z

Thanks for the pull request, @smarnach! It looks like you're a member of a company that does contract work for edX. If you're doing this work as part of a paid contract with edX, you should talk to edX about who will review this pull request. If this work is not part of a paid contract with edX, then you should ensure that there is an OSPR issue to track this work in JIRA, so that we don't lose track of your pull request.

To automatically create an OSPR issue for this pull request, just visit this link: https://openedx-webhooks.herokuapp.com/github/process_pr?repo=edx%2Fconfiguration&number=2557

mnaberez · 2015-12-21T20:58:31Z

Similar problems with supervisord have been reported on superuser.com, various Linux distribution bug trackers and the bug tracker of supervisord itself, e.g. this bug. To me, it looks like some kind of deadlock, either in the communication between supervisorctl and supervisord or in the communication between supervisord and the celery worker. Some sources indicate that the problem is fixed in newer versions of supervisord.

The issue linked above (Supervisor/supervisor#131) was closed and the changes were released in Supervisor 3.2.0 (November 30, 2015).

smarnach · 2016-02-15T13:21:06Z

@feanil Could you please take another look into this? We just ran into this problem again. I can't really tell whether it's caused by Supervisor or Celery, but this fix seems to resolve it. Would it be possible to merge this just for its merits in simplifying the code? ;-)

…ittently.

smarnach · 2016-02-15T13:22:54Z

I just rebased on top of current master.

feanil · 2016-02-16T15:38:04Z

Taking a look now. I'll let you know once we have tested it here.

feanil · 2016-02-16T16:26:36Z

Looks good to me. @fredsmith second review?

bradenmacdonald · 2016-02-29T02:07:53Z

@fredsmith ^ friendly ping re: providing a second review.

@feanil Thanks for the review!

fredsmith · 2016-02-29T13:57:14Z

👍

Work around a supervisord bug causing the provisioning to hang intermittently.

smarnach force-pushed the smarnach/supervisord-workaround branch from bea92b4 to e8e3154 Compare December 11, 2015 16:51

Work around a supervisord bug causing the provisioning to hang interm…

bbc294f

…ittently.

smarnach force-pushed the smarnach/supervisord-workaround branch from e8e3154 to bbc294f Compare February 15, 2016 13:22

fredsmith pushed a commit that referenced this pull request Feb 29, 2016

Merge pull request #2557 from open-craft/smarnach/supervisord-workaround

b3a7dc0

Work around a supervisord bug causing the provisioning to hang intermittently.

fredsmith merged commit b3a7dc0 into openedx-unsupported:master Feb 29, 2016

bradenmacdonald deleted the smarnach/supervisord-workaround branch February 29, 2016 16:56

smarnach mentioned this pull request Mar 15, 2016

Merge dogwood to master #2772

Merged

e-kolpakov mentioned this pull request Apr 29, 2016

Configurable timeouts for celery workers #2993

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Work around a supervisord bug causing the provisioning to hang intermittently. #2557

Work around a supervisord bug causing the provisioning to hang intermittently. #2557

smarnach commented Dec 6, 2015

bradenmacdonald commented Dec 8, 2015

feanil commented Dec 8, 2015

smarnach commented Dec 8, 2015

smarnach commented Dec 8, 2015

openedx-webhooks commented Dec 14, 2015

mnaberez commented Dec 21, 2015

smarnach commented Feb 15, 2016

smarnach commented Feb 15, 2016

feanil commented Feb 16, 2016

feanil commented Feb 16, 2016

bradenmacdonald commented Feb 29, 2016

fredsmith commented Feb 29, 2016

Work around a supervisord bug causing the provisioning to hang intermittently. #2557

Work around a supervisord bug causing the provisioning to hang intermittently. #2557

Conversation

smarnach commented Dec 6, 2015

bradenmacdonald commented Dec 8, 2015

feanil commented Dec 8, 2015

smarnach commented Dec 8, 2015

smarnach commented Dec 8, 2015

openedx-webhooks commented Dec 14, 2015

mnaberez commented Dec 21, 2015

smarnach commented Feb 15, 2016

smarnach commented Feb 15, 2016

feanil commented Feb 16, 2016

feanil commented Feb 16, 2016

bradenmacdonald commented Feb 29, 2016

fredsmith commented Feb 29, 2016