Merge dogwood to master #2772

nedbat · 2016-02-11T21:17:16Z

The dogwood-specific detail about box defaults is not merged.

openedx-webhooks · 2016-02-11T21:17:20Z

PR Cover Letter

A detailed description of the changes in the body of the PR

Reviewers

If you've been tagged for review, please check your corresponding box once you've given the 👍.

Code review RFE - a one liner defining Ansible and a link to its homepage would improve this readme #1 @feanil

Post-review

Squash commits

nedbat · 2016-02-11T21:32:13Z

@feanil almost all the rest of the Dogwood changes.

feanil · 2016-02-12T15:29:45Z

👍

Merge dogwood to master

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

smarnach · 2016-02-19T15:12:49Z

This PR broke sandbox builds for us. Restarting the certs service now fails:

2016-02-19 00:26:27+0100 | INFO | instance.models.instance  | instance=master.sandbox | TASK: [certs | restart certs] *************************************************
2016-02-19 00:26:27+0100 | INFO | instance.models.instance  | instance=master.sandbox | failed: [149.202.188.208] => {"failed": true}
2016-02-19 00:26:27+0100 | INFO | instance.models.instance  | instance=master.sandbox | msg: certs: stopped
2016-02-19 00:26:27+0100 | ERROR | instance.models.instance  | instance=master.sandbox | You should consider upgrading via the 'pip install --upgrade pip' command.
2016-02-19 00:26:27+0100 | INFO | instance.models.instance  | instance=master.sandbox | certs: ERROR (spawn error)

This is the log output for the certs service:

Traceback (most recent call last):
  File "/edx/app/certs/certificates/certificate_agent.py", line 197, in <module>
    main()
  File "/edx/app/certs/certificates/certificate_agent.py", line 58, in main
    settings.QUEUE_USER, settings.QUEUE_PASS)
  File "/edx/app/certs/certificates/openedx_certificates/queue_xqueue.py", line 28, in __init__
    self._login()
  File "/edx/app/certs/certificates/openedx_certificates/queue_xqueue.py", line 40, in _login
    'password': self.queue_pass})
  File "/edx/app/certs/venvs/certs/local/lib/python2.7/site-packages/requests/sessions.py", line 498, in post
    return self.request('POST', url, data=data, **kwargs)
  File "/edx/app/certs/venvs/certs/local/lib/python2.7/site-packages/requests/sessions.py", line 456, in request
    resp = self.send(prep, **send_kwargs)
  File "/edx/app/certs/venvs/certs/local/lib/python2.7/site-packages/requests/sessions.py", line 559, in send
    r = adapter.send(request, **kwargs)
  File "/edx/app/certs/venvs/certs/local/lib/python2.7/site-packages/requests/adapters.py", line 375, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=18040): Max retries exceeded with url: /xqueue/login/ (Caused by <class 'socket.error'>: [Errno 111] Connection refused)

Reverting this PR fixed the problem for us.

nedbat · 2016-02-19T17:49:32Z

@smarnach odd, because much of the change in this PR was to fix a problem where certs wouldn't start because xqueue wasn't ready yet, which looks like what's happening here.

smarnach · 2016-02-19T19:47:13Z

@nedbat Running the automated installation exactly as documented on a freshly installed Ubuntu 12.04 box results in the same error, so I think this is a genuine problem.

We can work around the problem by reverting this PR, so we don't need a fix urgently. However, we probably don't want to leave it in the state it currently is. How should we proceed?

@benpatterson Just for your information, here is another problem with sandbox deployment that wasn't caught by tests on your side.

nedbat · 2016-02-19T21:49:48Z

@smarnach I ran those instructions today, and it worked. It sounds like we have a race condition at work here. But the code changed specifically to fix this problem. The old way would use notify to restart servers, and the cert and xqueue restarts both happened very close together, which caused the problem.

The new code restarts each server at the end of its own role, which should separate them more in time. Perhaps we need more separation?

@feanil any thoughts?

smarnach · 2016-02-19T22:07:37Z

@nedbat Just as additional data points, we experienced the error in our automated sandbox deployment in 9 out of 9 attempts. Then I tried it manually on a freshly provisioned Vagrant instance using the ubuntu/precise64 box with the same result. So if this is a race condition, I seem to have rather bad Karma. :)
We might also have done something slightly different. Did you use the master branch for both edx-platform and configuration, i.e. did you leave OPENEDX_RELEASE unset when running the deployment scripts in the above instructions?

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

antoviaque · 2016-03-15T12:15:01Z

@smarnach Has this been solved or discussed further somewhere else?

smarnach · 2016-03-15T13:13:50Z

@antoviaque I currently don't know whether we still experience this particular bug. Until last week, we were able to successfully provision sandboxes with the branch that reverts this pull request; then sandbox provisioning started failing again, but due to a different issue. I haven't tested specifically whether this is still a problem (and there haven't been any further discussions).

nedbat · 2016-03-15T15:35:25Z

What's the latest error?

smarnach · 2016-03-15T16:14:47Z

@nedbat It's a longer-standing issue with Celery workers hanging when being restarted by supervisorctl. There seems to be an (intermittently occurring) deadlock either in supervisord or in the Celery workers. The symptoms are that the playbook either hangs indefinitely when trying to restart one of the Celery workers, or it errors out completely.

For a while, the changes in #2557 solved the problem for us, but currently we experience the problem consistently in every single deployment attempt again. In one respect, this is better than the intermittent failures, since we can now actually debug the problem, but I wasn't able to pin it down so far. My current theory is that Celery is the culprit (mostly because Celery Billiard is doing horrible things with mixing threads and fork() that definitely aren't POSIX compliant and in my opinion can't work reliably).

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

bradenmacdonald · 2016-03-16T17:31:26Z

@nedbat I can now confirm we are still experiencing the msg: certs: stopped / Max retries exceeded with url: /xqueue/login/ error on our sandbox builds if we use the master configuration version.

What mechanism is used to delay restart certs until xqueue is responding to connections? At a quick glance, I don't see that in the code, and the "restart xqueue" task is certainly not waiting for xqueue to listen to connections since it seems to finish immediately.

Ansible log (check out the timestamps):


2016-03-15 22:38:50 | TASK: [xqueue | restart xqueue] ***********************************************
2016-03-15 22:38:50 | changed: [149.202.177.56] => (item=xqueue)
2016-03-15 22:38:50 | changed: [149.202.177.56] => (item=xqueue_consumer)
2016-03-15 22:38:50 | 
2016-03-15 22:38:51 | TASK: [certs | create application user] ***************************************
2016-03-15 22:38:51 | changed: [149.202.177.56]

... SNIP ...

2016-03-15 22:39:43 | TASK: [certs | restart certs] *************************************************
2016-03-15 22:39:43 | failed: [149.202.177.56] => {"failed": true}
2016-03-15 22:39:43 | msg: certs: stopped
2016-03-15 22:39:43 | You should consider upgrading via the 'pip install --upgrade pip' command.

Edit: Ah, I see now there is no "wait until xqueue is ready" mechanism. Sounds like we need to add an explicit check, or further increase the delay between these tasks.

nedbat · 2016-03-16T19:49:29Z

@bradenmacdonald let's open a new issue about this so we can get devops eyes on it.

Also, how do you get timestamps on the ansible output? :)

bradenmacdonald · 2016-03-16T23:09:36Z

@nedbat I opened https://openedx.atlassian.net/browse/CRI-56 for now, since I wasn't sure which project to use (DEVOPS doesn't accept "Bug" reports). Feel free to move that issue if necessary.

Timestamps are not from ansible but from the “OpenCraft Instance Manager” software which builds our sandboxes; it timestamps the logs from each instance as they stream back to the central console.

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

Ned Batchelder added 9 commits February 11, 2016 15:59

Don't check for vagrant user, that doesn't work on non-Vagrant installs

c4f7a23

Convert the 'restart certs' handler into a normal task

8bc98db

Convert the 'restart xqueue' handler into a normal task

70e779e

Add the forums migration for teams discussion filtering

19e0127

Add two more management commands, and make manage.py invocation uniform

41f4aa2

Run the fixups after we get to the final version of the code.

6e59c51

In both devstack and fullstack, pip needs to run as edxapp

72a5d96

Dogwood final box names

9a024a4

Remove the Cypress box names

5a80d2e

nedbat added a commit that referenced this pull request Feb 12, 2016

Merge pull request #2772 from edx/ned/merge-dogwood-to-master

adae6cf

Merge dogwood to master

nedbat merged commit adae6cf into master Feb 12, 2016

nedbat deleted the ned/merge-dogwood-to-master branch February 12, 2016 18:54

smarnach added a commit to open-craft/configuration that referenced this pull request Feb 18, 2016

Revert "Merge pull request openedx-unsupported#2772 from edx/ned/merg…

26e3ad9

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Feb 22, 2016

Revert "Merge pull request openedx-unsupported#2772 from edx/ned/merg…

ded2c40

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 2, 2016

Revert "Merge pull request openedx-unsupported#2772 from edx/ned/merg…

aa3c0e0

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 2, 2016

Revert "Merge pull request openedx-unsupported#2772 from edx/ned/merg…

038d486

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 2, 2016

Revert "Merge pull request openedx-unsupported#2772 from edx/ned/merg…

c797d06

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 3, 2016

Revert "Merge pull request openedx-unsupported#2772 from edx/ned/merg…

243f72b

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

jbzdak pushed a commit to open-craft/configuration that referenced this pull request Mar 14, 2016

Revert "Merge pull request openedx-unsupported#2772 from edx/ned/merg…

fc6bb81

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 16, 2016

Revert "Merge pull request openedx-unsupported#2772 from edx/ned/merg…

75a94e8

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 17, 2016

Revert "Merge pull request openedx-unsupported#2772 from edx/ned/merg…

6751d46

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 17, 2016

Revert "Merge pull request openedx-unsupported#2772 from edx/ned/merg…

e47958b

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

omarkhan pushed a commit to open-craft/configuration that referenced this pull request Mar 19, 2016

Revert "Merge pull request openedx-unsupported#2772 from edx/ned/merg…

71bc30d

…e-dogwood-to-master" This reverts commit adae6cf, reversing changes made to 43e2f6f.

smarnach mentioned this pull request Apr 27, 2016

Restart nginx after copying the configuration. #2983

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge dogwood to master #2772

Merge dogwood to master #2772

nedbat commented Feb 11, 2016

openedx-webhooks commented Feb 11, 2016

nedbat commented Feb 11, 2016

feanil commented Feb 12, 2016

smarnach commented Feb 19, 2016

nedbat commented Feb 19, 2016

smarnach commented Feb 19, 2016

nedbat commented Feb 19, 2016

smarnach commented Feb 19, 2016

antoviaque commented Mar 15, 2016

smarnach commented Mar 15, 2016

nedbat commented Mar 15, 2016

smarnach commented Mar 15, 2016

bradenmacdonald commented Mar 16, 2016

nedbat commented Mar 16, 2016

bradenmacdonald commented Mar 16, 2016

Merge dogwood to master #2772

Merge dogwood to master #2772

Conversation

nedbat commented Feb 11, 2016

openedx-webhooks commented Feb 11, 2016

PR Cover Letter

Reviewers

Post-review

nedbat commented Feb 11, 2016

feanil commented Feb 12, 2016

smarnach commented Feb 19, 2016

nedbat commented Feb 19, 2016

smarnach commented Feb 19, 2016

nedbat commented Feb 19, 2016

smarnach commented Feb 19, 2016

antoviaque commented Mar 15, 2016

smarnach commented Mar 15, 2016

nedbat commented Mar 15, 2016

smarnach commented Mar 15, 2016

bradenmacdonald commented Mar 16, 2016

nedbat commented Mar 16, 2016

bradenmacdonald commented Mar 16, 2016