-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Auto stop-restart #2809
Auto stop-restart #2809
Conversation
bf0a9a1
to
bff62fe
Compare
GitHub is having a strange moment and not accepting new changes to this branch 🪲 so I will close and re-open this PR. |
GitHub is recovering |
bff62fe
to
1e874b3
Compare
1e874b3
to
2b2f3a6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments. Can talk through them in person tomorrow.
lib/cylc/cfgspec/glbl_cfg.py
Outdated
CONFIG_INSTANCE[0] = GlobalConfig() | ||
return CONFIG_INSTANCE[0] | ||
else: | ||
return GlobalConfig() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we just call .load
on the default instance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't want to change the config that the suite was run with, Cylc does not yet posses the ability to adapt to a dynamic config? So the safer option is to return a new config object rather than updating the old one.
|
||
# proc will start with current env (incl CYLC_HOME etc) | ||
proc = Popen( | ||
cmd + ['--host=%s' % new_host], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a cylc restart
without --host=HOST
work here? In which case, we would no longer need the HostAppointer
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we could do it that way, the reasons I went down this route are:
- We can make sure that there is an available host to restart on before shutting down the suite.
- We can log the host the suite will attempt to restart on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more comments. The new tests are all failing in my environment at the moment. I'll take a look to see what the issues may be from my side.
2b2f3a6
to
3a68810
Compare
De-conflicted, feedback addressed, new test added. |
c287956
to
3705f50
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A bit more minor comments. I am going to see if I can break it now.
3705f50
to
20d05c8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested again on our HPC, all good. @kinow may have a look too, but I think we're happy at this end. We'll leave it to the Met Office crew to make the final merge decision though, as you have the most critical heavy duty use case for this feature!
Didn't need to use NFS, as any shared file system appears to work with Cylc (hooray!). 1. DockerUsed the same images for distributed Cylc (via SSH), with mapped volumes.
Then also copied the docker containers ID's to the
And finally using the test suite, registered as stop-restart, 2018-11-21T23:58:54Z DEBUG - Loading site/user global config files
2018-11-21T23:58:54Z DEBUG - Reading file /root/.cylc/global.rc
2018-11-21T23:58:54Z DEBUG - Loading site/user global config files
2018-11-21T23:58:54Z DEBUG - Reading file /root/.cylc/global.rc
._.
| | The Cylc Suite Engine [7.7.1-398-g7e472]
._____._. ._| |_____. Copyright (C) 2008-2018 NIWA
| .___| | | | | .___| & British Crown (Met Office) & Contributors.
| !___| !_! | | !___. _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____! This program comes with ABSOLUTELY NO WARRANTY;
.___! | see `cylc warranty`. It is free software, you
!_____! are welcome to redistribute it under certain
2018-11-21T23:58:54Z DEBUG - creating suite run directory: /root/cylc-run/stop-restart
2018-11-21T23:58:54Z DEBUG - creating suite log directory: /root/cylc-run/stop-restart/log/suite
2018-11-21T23:58:54Z DEBUG - creating suite job log directory: /root/cylc-run/stop-restart/log/job
2018-11-21T23:58:54Z DEBUG - creating suite config log directory: /root/cylc-run/stop-restart/log/suiterc
2018-11-21T23:58:54Z DEBUG - creating suite work directory: /root/cylc-run/stop-restart/work
2018-11-21T23:58:54Z DEBUG - creating suite share directory: /root/cylc-run/stop-restart/share
*** listening on https://a5442ac9d102:43023/ ***
To view suite server program contact information:
$ cylc get-suite-contact stop-restart
Other ways to see if the suite is still running:
$ cylc scan -n 'stop-restart' a5442ac9d102
$ cylc ping -v --host=a5442ac9d102 stop-restart
$ ps -opid,args 15214 # on a5442ac9d102 At this point, the suite is running on the jump box in Docker (a5442ac9d102). Then I condemned that host 👿 , and could see tailing the log it was sent to the other host (9ccdbf43ca1f) 👏 👏 👏 One thing that I noticed was that while doing the root@a5442ac9d102:/opt/cylc# cylc scan
stop-restart root@a5442ac9d102:43023
root@a5442ac9d102:/opt/cylc# cylc scan
root@a5442ac9d102:/opt/cylc# cylc scan
stop-restart root@9ccdbf43ca1f:43052
root@a5442ac9d102:/opt/cylc# cylc scan
stop-restart root@9ccdbf43ca1f:43052 It is also possible to confirm in the sqlite database that the host value changed during the execution. I am using sqlitebrowser, which is one of the most useful tools I've learned about recently. As it is possible to see in the screen shot, Now it's interesting that But Quite fun to play with this change @oliver-sanders :D 👏 👏 👏 NoteI got into a bit of dead-lock... I noticed that after a few minutes the suite kept being pushed from one host to another... then I realized I had localhost in |
I guess that's just the interval between shut down and restart, and |
@kinow -
Good spotting! State checkpoints just record the state of all tasks in the pool of tasks that the server program is aware of (at time of checkpointing), and are used to restart the workflow in that state. Your suite isn't a cycling one, so that means all tasks in the entire workflow. The evolving latest/up-to-date workflow state (which is the default for restarting from) is held in the "task_pool" table. Particular named (or ID'd) checkpoints can also be stored to allow restart from an earlier state; those are held in the "task_pool_checkpoints" table. We automatically write out a checkpoint at restart, to allow going back to the original restart state in case of disaster - that's what you're seeing in the checkpoints table above. |
2. HPCUsing HPC, managed to not break my whole environment, and set up Cylc latest version, checked out this PR, then set up some Started the suite in one node, then during its execution marked that node as condemned 👿 it was successfully moved to the next host in the list 🎉 Exactly same behaviour as noted above for Docker. For a while During the stop/start I had a few Traceback (most recent call last):
File "/bruno/cylc/lib/cylc/network/httpserver.py", line 158, in start
cherrypy.engine.start()
File "/bruno/cylc/lib/cherrypy/process/wspbus.py", line 256, in start
raise e_info
ChannelFailures: IOError("Port 43069 not free on 'hosthost.niwa.co.nz'",) But most likely because I started the server too fast (though cylc scan was saying there was nothing running, and the suite log said it stopped successfully...). So +1 from me, and once again, kudos @oliver-sanders 👏 👏 👏 |
@hjoliver going to have to have a chat with my friend CUG again, and re-read your comment to make sure I understood it well :) but I assume then that that's OK to have those values in that DB table? |
Yes, it's expected - you saw the static restart checkpoint state. The "task_pool" holds the latest (and continually evolving) checkpoint state. |
To attempt to clarify this - that checkpoint is stored at restart. If you look before then, it won't be in the DB. It allows you to "redo" the restart from scratch in case you screw it up somehow (e.g. by some kind of misdirected manual intervention in the suite state). |
Oohh, I think I'm starting to understand how it will work. Sounds useful to suite operational teams. Last question, promise. I accidentally used a host string without commas: run hosts = a5442ac9d102 ea21caacc34a It tried using the value I think host names are not allowed to have spaces. If that's correct, would it make sense later to have a check to validate that that setting value contains only valid host names (syntax-wise, not related to DNS/network) ? I think it could be useful for suite operators (or for newbie users, like me 😬 ) |
Yes, probably a good idea. There might be a few other config items that should not have spaces either, which could indicate omission of the comma list-separator. |
@kinow -
what do you mean by that? (the server should be restarting automatically ...). |
Sorry, wasn't really clear in the comment. While I was testing, I would run I saw that exception twice, after running After a while the suite completed, and there were no more errors 🤔 even though I had that exception in my terminal. |
Hi @kinow I've just had another look at the logic again: https://github.com/cylc/cylc/blob/master/lib/cylc/rundb.py#L178 If you want to look at the latest checkpoint task pool, you should filter for (Conscious that we should give #2483 some priority - so we don't end up with this sort of confusion.) |
89241a0
to
0ef194e
Compare
I think you are right and this is easy to get wrong so I've put a check in place which will raise an error if a hostname is specified with a space. I've not put in any more advanced checking (e.g.
I've left it with hostname lists for now, it should be easy to extend this later. |
@kinow Your cherrypy errors should be safe. On suite start up, it will shuffle the allowed port ranges and then try to start a web server on the first unused port of the shuffled list. If a port is already used, it will print this error, and retry using the next port, until it is able to start up on an unused port or until it has exhausted all the numbers in the range. Having said that, it may be best if we are able to prevent the printing of the error message - as it is not very useful. (I can't remember if it only appears in debug level or not. In which case, perhaps it does not really matter.) |
Who is going to do the honour and press the Merge button? |
Me!! 💥 |
Woohoo! |
Implement functionality to allow suites to automatically migrate away from specified servers.
Closes #988
Spawned #2843 , #2799
Until #2799 this functionality is implemented with the caveat that suites which cannot be safely restarted will not attempt to (but will log critical messages to alert the user, do we want to look into automatically sending emails as well?).
TODO:
cylc restart
(resolve "incompatible setting stop_point" issue)Ready for
initialreview