Skip to content
This repository has been archived by the owner on May 6, 2024. It is now read-only.

Work around a supervisord bug causing the provisioning to hang intermittently. #2557

Conversation

smarnach
Copy link
Contributor

@smarnach smarnach commented Dec 6, 2015

We've experienced an intermittent problem with supervisorctl hanging when restarting the edxapp celery workers on our sandboxes. When the problem occurred, the Ansible task that starts (or checks) the workers hung indefintely on one of the workers. When logging in to the machine, calling supervisorctl status on the respective procees showed in the state "STOPPING". The corresponding child process of supervisord sometimes has zombie child processes, and the supervisorctl process was hanging. Killing either supervisorctl or the right child process of supervisord resolved the situation (and caused the Ansible task to fail). The exact symptoms of the problem varied across multiple executions with exactly the same configuration, and one incarnation of the problem was reported on the openedx-ops mailing list a while ago (I don't have the link at the moment).

Similar problems with supervisord have been reported on superuser.com, various Linux distribution bug trackers and the bug tracker of supervisord itself, e.g. this bug. To me, it looks like some kind of deadlock, either in the communication between supervisorctl and supervisord or in the communication between supervisord and the celery worker. Some sources indicate that the problem is fixed in newer versions of supervisord.

We usually only encountered the problem intermittently when under memory pressure, but during the tests for #2527 it occurred consistently even with swap enabled, which gave me the opportunity to test work-arounds. The patch in this PR resulted in the provisioning consistently working, while it consistently failed without it. Since it is also a code simplification, I hope it qualifies for inclusion in spite of the elusive nature of the bug it works around.

(Chesterton's fence: Support for group names was added to Ansible in version 1.6, which explains the current version of the code.)

@bradenmacdonald
Copy link
Contributor

Looks reasonable to me, but I'm not the most qualified to review this - I wouldn't know if e.g. the edxapp_worker: group ever contains other worker definitions beyond those included in edxapp_workers (I'm assuming it doesn't).

@feanil
Copy link
Contributor

feanil commented Dec 8, 2015

What version of celery are you running, we ran into a similar error with an older version of celery. Celery would deadlock and our worker timeout is so high that it was causing issues.

@smarnach
Copy link
Contributor Author

smarnach commented Dec 8, 2015

@feanil This problem occurred when running the edx_platform.yml playbook, so it was using whatever version of celery edx-platform installs (3.1.18 at the moment).

@smarnach
Copy link
Contributor Author

smarnach commented Dec 8, 2015

@bradenmacdonald The edxapp_worker group is defined by iterating over edxapp_workers, so it will indeed never contain any additional workers beyond those in that list.

@smarnach smarnach force-pushed the smarnach/supervisord-workaround branch from bea92b4 to e8e3154 Compare December 11, 2015 16:51
@openedx-webhooks
Copy link

Thanks for the pull request, @smarnach! It looks like you're a member of a company that does contract work for edX. If you're doing this work as part of a paid contract with edX, you should talk to edX about who will review this pull request. If this work is not part of a paid contract with edX, then you should ensure that there is an OSPR issue to track this work in JIRA, so that we don't lose track of your pull request.

To automatically create an OSPR issue for this pull request, just visit this link: https://openedx-webhooks.herokuapp.com/github/process_pr?repo=edx%2Fconfiguration&number=2557

@mnaberez
Copy link

Similar problems with supervisord have been reported on superuser.com, various Linux distribution bug trackers and the bug tracker of supervisord itself, e.g. this bug. To me, it looks like some kind of deadlock, either in the communication between supervisorctl and supervisord or in the communication between supervisord and the celery worker. Some sources indicate that the problem is fixed in newer versions of supervisord.

The issue linked above (Supervisor/supervisor#131) was closed and the changes were released in Supervisor 3.2.0 (November 30, 2015).

@smarnach
Copy link
Contributor Author

@feanil Could you please take another look into this? We just ran into this problem again. I can't really tell whether it's caused by Supervisor or Celery, but this fix seems to resolve it. Would it be possible to merge this just for its merits in simplifying the code? ;-)

@smarnach smarnach force-pushed the smarnach/supervisord-workaround branch from e8e3154 to bbc294f Compare February 15, 2016 13:22
@smarnach
Copy link
Contributor Author

I just rebased on top of current master.

@feanil
Copy link
Contributor

feanil commented Feb 16, 2016

Taking a look now. I'll let you know once we have tested it here.

@feanil
Copy link
Contributor

feanil commented Feb 16, 2016

Looks good to me. @fredsmith second review?

@bradenmacdonald
Copy link
Contributor

@fredsmith ^ friendly ping re: providing a second review.

@feanil Thanks for the review!

@fredsmith
Copy link
Contributor

👍

fredsmith pushed a commit that referenced this pull request Feb 29, 2016
Work around a supervisord bug causing the provisioning to hang intermittently.
@fredsmith fredsmith merged commit b3a7dc0 into openedx-unsupported:master Feb 29, 2016
@bradenmacdonald bradenmacdonald deleted the smarnach/supervisord-workaround branch February 29, 2016 16:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants