Skip to content
This repository has been archived by the owner on May 6, 2024. It is now read-only.

Configurable timeouts for celery workers #2993

Merged

Conversation

e-kolpakov
Copy link
Contributor

@e-kolpakov e-kolpakov commented Apr 29, 2016

Description: This PR makes edxapp celery worker timeouts configurable (both global default and per-worker).
Background: Provisions often fail with

2016-03-14 10:15:31-0700 | INFO | instance.models.instance  | instance=sandbox | TASK: [edxapp | ensure edxapp_workers has started] ****************************
2016-03-14 10:15:31-0700 | INFO | instance.models.instance  | instance=sandbox | failed: [149.202.174.205] => {"failed": true}
2016-03-14 10:15:31-0700 | INFO | instance.models.instance  | instance=sandbox | msg: edxapp_worker:cms_default_1: ERROR (abnormal termination)
2016-03-14 10:15:31-0700 | INFO | instance.models.instance  | instance=sandbox | 
2016-03-14 10:15:31-0700 | INFO | instance.models.instance  | instance=sandbox | 
2016-03-14 10:15:31-0700 | INFO | instance.models.instance  | instance=sandbox | FATAL: all hosts have already failed -- aborting

or just timeout around the same lines. Discoveries point out that edxapp_worker:cms_default_1 fails to restart properly (or fails to signal supervisor that they are restarted). Unlike other approaches to fix this (listed below), lowering restart timeout seems to reliably fix/workaround the problem.
Related:
First noticed: comment thread.
Previous attempts to fix the issue: #2557, #2871, #2875

JIRA: https://openedx.atlassian.net/browse/OSPR-1252
EMail threads: https://groups.google.com/forum/#!topic/openedx-ops/5hblv9LGeR8 - potentailly related.

Author concerns:

Major concern is that the problem this patch aims to fix/workaround was an intermittent one - about 50% builds succeeded without the patch. So it might be that probabilities aligned in my favor this time and all the deployments I did to verify the fix just never actually had a root cause expressed.

@openedx-webhooks
Copy link

Thanks for the pull request, @e-kolpakov! It looks like you're a member of a company that does contract work for edX. If you're doing this work as part of a paid contract with edX, you should talk to edX about who will review this pull request. If this work is not part of a paid contract with edX, then you should ensure that there is an OSPR issue to track this work in JIRA, so that we don't lose track of your pull request.

To automatically create an OSPR issue for this pull request, just visit this link: https://openedx-webhooks.herokuapp.com/github/process_pr?repo=edx%2Fconfiguration&number=2993

@openedx-webhooks
Copy link

Thanks for the pull request, @e-kolpakov! I've created OSPR-1252 to keep track of it in JIRA. JIRA is a place for product owners to prioritize feature reviews by the engineering development teams.

Feel free to add as much of the following information to the ticket:

  • supporting documentation
  • edx-code email threads
  • timeline information ("this must be merged by XX date", and why that is)
  • partner information ("this is a course on edx.org")
  • any other information that can help Product understand the context for the PR

All technical communication about the code itself will still be done via the GitHub pull request interface. As a reminder, our process documentation is here.

If you like, you can add yourself to the AUTHORS file for this repo, though that isn't required. Please see the CONTRIBUTING file for more information.

@openedx-webhooks openedx-webhooks added open-source-contribution PR author is not from Axim or 2U needs triage labels Apr 29, 2016
@e-kolpakov e-kolpakov force-pushed the ekolpakov/celery-fix branch from b26a42b to e6f33a8 Compare May 2, 2016 09:41
@smarnach
Copy link
Contributor

smarnach commented May 2, 2016

Changes look good to me. 👍 once the sandbox provisioning using the test branch succeeds. Once we have the option to configure the timeout, we can still decide on our side what a good timeout to use would be.

@@ -599,6 +599,7 @@ edxapp_git_ssh: "/tmp/edxapp_git_ssh.sh"
edxapp_legacy_course_data_dir: "{{ edxapp_app_dir }}/data"

edxapp_workers: "{{ EDXAPP_CELERY_WORKERS }}"
edxapp_worker_default_stopwaitsecs: 432000
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since it is meant to be externally configurable let's make this an ALL_CAPS variable name.

@e-kolpakov e-kolpakov force-pushed the ekolpakov/celery-fix branch from e6f33a8 to 230be33 Compare May 10, 2016 10:49
@e-kolpakov
Copy link
Contributor Author

@feanil thank you for review - note addressed.

@feanil
Copy link
Contributor

feanil commented May 10, 2016

@edx/devops change looks good to me, can I get a second review?

@maxrothman
Copy link
Contributor

👍

@e-kolpakov e-kolpakov merged commit a94a256 into openedx-unsupported:master May 11, 2016
@bradenmacdonald bradenmacdonald deleted the ekolpakov/celery-fix branch May 12, 2016 16:53
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants