Skip to content
This repository has been archived by the owner on May 6, 2024. It is now read-only.

Merge dogwood to master #2772

Merged
merged 9 commits into from
Feb 12, 2016
Merged

Merge dogwood to master #2772

merged 9 commits into from
Feb 12, 2016

Conversation

nedbat
Copy link
Contributor

@nedbat nedbat commented Feb 11, 2016

The dogwood-specific detail about box defaults is not merged.

@openedx-webhooks
Copy link

PR Cover Letter

  • A detailed description of the changes in the body of the PR

Reviewers

If you've been tagged for review, please check your corresponding box once you've given the 👍.

Post-review

  • Squash commits

@nedbat
Copy link
Contributor Author

nedbat commented Feb 11, 2016

@feanil almost all the rest of the Dogwood changes.

@feanil
Copy link
Contributor

feanil commented Feb 12, 2016

👍

nedbat added a commit that referenced this pull request Feb 12, 2016
@nedbat nedbat merged commit adae6cf into master Feb 12, 2016
@nedbat nedbat deleted the ned/merge-dogwood-to-master branch February 12, 2016 18:54
smarnach added a commit to open-craft/configuration that referenced this pull request Feb 18, 2016
…e-dogwood-to-master"

This reverts commit adae6cf, reversing
changes made to 43e2f6f.
@smarnach
Copy link
Contributor

This PR broke sandbox builds for us. Restarting the certs service now fails:

2016-02-19 00:26:27+0100 | INFO | instance.models.instance  | instance=master.sandbox | TASK: [certs | restart certs] *************************************************
2016-02-19 00:26:27+0100 | INFO | instance.models.instance  | instance=master.sandbox | failed: [149.202.188.208] => {"failed": true}
2016-02-19 00:26:27+0100 | INFO | instance.models.instance  | instance=master.sandbox | msg: certs: stopped
2016-02-19 00:26:27+0100 | ERROR | instance.models.instance  | instance=master.sandbox | You should consider upgrading via the 'pip install --upgrade pip' command.
2016-02-19 00:26:27+0100 | INFO | instance.models.instance  | instance=master.sandbox | certs: ERROR (spawn error)

This is the log output for the certs service:

Traceback (most recent call last):
  File "/edx/app/certs/certificates/certificate_agent.py", line 197, in <module>
    main()
  File "/edx/app/certs/certificates/certificate_agent.py", line 58, in main
    settings.QUEUE_USER, settings.QUEUE_PASS)
  File "/edx/app/certs/certificates/openedx_certificates/queue_xqueue.py", line 28, in __init__
    self._login()
  File "/edx/app/certs/certificates/openedx_certificates/queue_xqueue.py", line 40, in _login
    'password': self.queue_pass})
  File "/edx/app/certs/venvs/certs/local/lib/python2.7/site-packages/requests/sessions.py", line 498, in post
    return self.request('POST', url, data=data, **kwargs)
  File "/edx/app/certs/venvs/certs/local/lib/python2.7/site-packages/requests/sessions.py", line 456, in request
    resp = self.send(prep, **send_kwargs)
  File "/edx/app/certs/venvs/certs/local/lib/python2.7/site-packages/requests/sessions.py", line 559, in send
    r = adapter.send(request, **kwargs)
  File "/edx/app/certs/venvs/certs/local/lib/python2.7/site-packages/requests/adapters.py", line 375, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=18040): Max retries exceeded with url: /xqueue/login/ (Caused by <class 'socket.error'>: [Errno 111] Connection refused)

Reverting this PR fixed the problem for us.

@nedbat
Copy link
Contributor Author

nedbat commented Feb 19, 2016

@smarnach odd, because much of the change in this PR was to fix a problem where certs wouldn't start because xqueue wasn't ready yet, which looks like what's happening here.

@smarnach
Copy link
Contributor

@nedbat Running the automated installation exactly as documented on a freshly installed Ubuntu 12.04 box results in the same error, so I think this is a genuine problem.

We can work around the problem by reverting this PR, so we don't need a fix urgently. However, we probably don't want to leave it in the state it currently is. How should we proceed?

@benpatterson Just for your information, here is another problem with sandbox deployment that wasn't caught by tests on your side.

@nedbat
Copy link
Contributor Author

nedbat commented Feb 19, 2016

@smarnach I ran those instructions today, and it worked. It sounds like we have a race condition at work here. But the code changed specifically to fix this problem. The old way would use notify to restart servers, and the cert and xqueue restarts both happened very close together, which caused the problem.

The new code restarts each server at the end of its own role, which should separate them more in time. Perhaps we need more separation?

@feanil any thoughts?

@smarnach
Copy link
Contributor

@nedbat Just as additional data points, we experienced the error in our automated sandbox deployment in 9 out of 9 attempts. Then I tried it manually on a freshly provisioned Vagrant instance using the ubuntu/precise64 box with the same result. So if this is a race condition, I seem to have rather bad Karma. :)
We might also have done something slightly different. Did you use the master branch for both edx-platform and configuration, i.e. did you leave OPENEDX_RELEASE unset when running the deployment scripts in the above instructions?

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Feb 22, 2016
…e-dogwood-to-master"

This reverts commit adae6cf, reversing
changes made to 43e2f6f.
omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 2, 2016
…e-dogwood-to-master"

This reverts commit adae6cf, reversing
changes made to 43e2f6f.
omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 2, 2016
…e-dogwood-to-master"

This reverts commit adae6cf, reversing
changes made to 43e2f6f.
omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 2, 2016
…e-dogwood-to-master"

This reverts commit adae6cf, reversing
changes made to 43e2f6f.
omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 3, 2016
…e-dogwood-to-master"

This reverts commit adae6cf, reversing
changes made to 43e2f6f.
jbzdak pushed a commit to open-craft/configuration that referenced this pull request Mar 14, 2016
…e-dogwood-to-master"

This reverts commit adae6cf, reversing
changes made to 43e2f6f.
@antoviaque
Copy link
Contributor

@smarnach Has this been solved or discussed further somewhere else?

@smarnach
Copy link
Contributor

@antoviaque I currently don't know whether we still experience this particular bug. Until last week, we were able to successfully provision sandboxes with the branch that reverts this pull request; then sandbox provisioning started failing again, but due to a different issue. I haven't tested specifically whether this is still a problem (and there haven't been any further discussions).

@nedbat
Copy link
Contributor Author

nedbat commented Mar 15, 2016

What's the latest error?

@smarnach
Copy link
Contributor

@nedbat It's a longer-standing issue with Celery workers hanging when being restarted by supervisorctl. There seems to be an (intermittently occurring) deadlock either in supervisord or in the Celery workers. The symptoms are that the playbook either hangs indefinitely when trying to restart one of the Celery workers, or it errors out completely.

For a while, the changes in #2557 solved the problem for us, but currently we experience the problem consistently in every single deployment attempt again. In one respect, this is better than the intermittent failures, since we can now actually debug the problem, but I wasn't able to pin it down so far. My current theory is that Celery is the culprit (mostly because Celery Billiard is doing horrible things with mixing threads and fork() that definitely aren't POSIX compliant and in my opinion can't work reliably).

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 16, 2016
…e-dogwood-to-master"

This reverts commit adae6cf, reversing
changes made to 43e2f6f.
@bradenmacdonald
Copy link
Contributor

@nedbat I can now confirm we are still experiencing the msg: certs: stopped / Max retries exceeded with url: /xqueue/login/ error on our sandbox builds if we use the master configuration version.

What mechanism is used to delay restart certs until xqueue is responding to connections? At a quick glance, I don't see that in the code, and the "restart xqueue" task is certainly not waiting for xqueue to listen to connections since it seems to finish immediately.

Ansible log (check out the timestamps):


2016-03-15 22:38:50 | TASK: [xqueue | restart xqueue] ***********************************************
2016-03-15 22:38:50 | changed: [149.202.177.56] => (item=xqueue)
2016-03-15 22:38:50 | changed: [149.202.177.56] => (item=xqueue_consumer)
2016-03-15 22:38:50 | 
2016-03-15 22:38:51 | TASK: [certs | create application user] ***************************************
2016-03-15 22:38:51 | changed: [149.202.177.56]

... SNIP ...

2016-03-15 22:39:43 | TASK: [certs | restart certs] *************************************************
2016-03-15 22:39:43 | failed: [149.202.177.56] => {"failed": true}
2016-03-15 22:39:43 | msg: certs: stopped
2016-03-15 22:39:43 | You should consider upgrading via the 'pip install --upgrade pip' command.

Edit: Ah, I see now there is no "wait until xqueue is ready" mechanism. Sounds like we need to add an explicit check, or further increase the delay between these tasks.

@nedbat
Copy link
Contributor Author

nedbat commented Mar 16, 2016

@bradenmacdonald let's open a new issue about this so we can get devops eyes on it.

Also, how do you get timestamps on the ansible output? :)

@bradenmacdonald
Copy link
Contributor

@nedbat I opened https://openedx.atlassian.net/browse/CRI-56 for now, since I wasn't sure which project to use (DEVOPS doesn't accept "Bug" reports). Feel free to move that issue if necessary.

Timestamps are not from ansible but from the “OpenCraft Instance Manager” software which builds our sandboxes; it timestamps the logs from each instance as they stream back to the central console.

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 17, 2016
…e-dogwood-to-master"

This reverts commit adae6cf, reversing
changes made to 43e2f6f.
omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 17, 2016
…e-dogwood-to-master"

This reverts commit adae6cf, reversing
changes made to 43e2f6f.
omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 19, 2016
…e-dogwood-to-master"

This reverts commit adae6cf, reversing
changes made to 43e2f6f.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants