Auto stop-restart #2809

oliver-sanders · 2018-10-19T15:58:10Z

Implement functionality to allow suites to automatically migrate away from specified servers.

Until #2799 this functionality is implemented with the caveat that suites which cannot be safely restarted will not attempt to (but will log critical messages to alert the user, do we want to look into automatically sending emails as well?).

TODO:

Don't auto stop-restart if the suite is already stopping.
Elegantly handle local jobs running under background or at.
Enable auto stop-restart for suites running under cylc restart (resolve "incompatible setting stop_point" issue)

Ready for ~~initial~~ review

oliver-sanders · 2018-10-22T08:57:17Z

GitHub is having a strange moment and not accepting new changes to this branch 🪲 so I will close and re-open this PR.

oliver-sanders · 2018-10-22T10:26:28Z

GitHub is recovering

matthewrmshin

Some initial comments. Can talk through them in person tomorrow.

lib/cylc/cfgspec/glbl_cfg.py

matthewrmshin · 2018-10-30T20:45:18Z

lib/cylc/cfgspec/glbl_cfg.py

+            CONFIG_INSTANCE[0] = GlobalConfig()
+        return CONFIG_INSTANCE[0]
+    else:
+        return GlobalConfig()


Can we just call .load on the default instance?

We don't want to change the config that the suite was run with, Cylc does not yet posses the ability to adapt to a dynamic config? So the safer option is to return a new config object rather than updating the old one.

lib/cylc/hostuserutil.py

lib/cylc/scheduler.py

matthewrmshin · 2018-10-30T20:49:22Z

lib/cylc/scheduler.py

+
+            # proc will start with current env (incl CYLC_HOME etc)
+            proc = Popen(
+                cmd + ['--host=%s' % new_host],


Would a cylc restart without --host=HOST work here? In which case, we would no longer need the HostAppointer.

Yes, we could do it that way, the reasons I went down this route are:

We can make sure that there is an available host to restart on before shutting down the suite.

We can log the host the suite will attempt to restart on.

matthewrmshin

A few more comments. The new tests are all failing in my environment at the moment. I'll take a look to see what the issues may be from my side.

lib/cylc/cfgspec/globalcfg.py

oliver-sanders · 2018-10-31T12:17:17Z

De-conflicted, feedback addressed, new test added.

matthewrmshin

A bit more minor comments. I am going to see if I can break it now.

tests/restart/39-auto-restart-no-suitible-host.t

tests/restart/36-auto-restart-incompatible.t

hjoliver

Tested again on our HPC, all good. @kinow may have a look too, but I think we're happy at this end. We'll leave it to the Met Office crew to make the final merge decision though, as you have the most critical heavy duty use case for this feature!

kinow · 2018-11-22T00:48:52Z

Didn't need to use NFS, as any shared file system appears to work with Cylc (hooray!).

1. Docker

Used the same images for distributed Cylc (via SSH), with mapped volumes.

ssh-keygen -t rsa -f ./id_rsa -N "" -q
CYLC_SSH_PUBKEY=$(cat id_rsa.pub) docker-compose up --build

# access the jump box
docker exec -ti distributed_cylc-jump-box_1 /bin/bash
# edited /etc/hosts in jump box, to workaround an issue submitting jobs

Then also copied the docker containers ID's to the global.rc

[cylc]
    health check interval = PT10S
    [[events]]
        abort on inactivity = True
        abort on timeout = True
        inactivity = PT1M
        timeout = PT1M
[suite servers]
    auto restart delay = PT10S
    #condemned hosts = a5442ac9d102, ea21caacc34a
    #condemned hosts = a5442ac9d102
    # a5442ac9d102 is the jump box
    run hosts = a5442ac9d102, ea21caacc34a

And finally using the test suite, registered as stop-restart, cylc run --debug stop-restart.

2018-11-21T23:58:54Z DEBUG - Loading site/user global config files
2018-11-21T23:58:54Z DEBUG - Reading file /root/.cylc/global.rc
2018-11-21T23:58:54Z DEBUG - Loading site/user global config files
2018-11-21T23:58:54Z DEBUG - Reading file /root/.cylc/global.rc
            ._.                                                       
            | |           The Cylc Suite Engine [7.7.1-398-g7e472]    
._____._. ._| |_____.           Copyright (C) 2008-2018 NIWA          
| .___| | | | | .___|   & British Crown (Met Office) & Contributors.  
| !___| !_! | | !___.  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____!  This program comes with ABSOLUTELY NO WARRANTY;
      .___! |          see `cylc warranty`.  It is free software, you 
      !_____!           are welcome to redistribute it under certain  
2018-11-21T23:58:54Z DEBUG - creating suite run directory: /root/cylc-run/stop-restart
2018-11-21T23:58:54Z DEBUG - creating suite log directory: /root/cylc-run/stop-restart/log/suite
2018-11-21T23:58:54Z DEBUG - creating suite job log directory: /root/cylc-run/stop-restart/log/job
2018-11-21T23:58:54Z DEBUG - creating suite config log directory: /root/cylc-run/stop-restart/log/suiterc
2018-11-21T23:58:54Z DEBUG - creating suite work directory: /root/cylc-run/stop-restart/work
2018-11-21T23:58:54Z DEBUG - creating suite share directory: /root/cylc-run/stop-restart/share

*** listening on https://a5442ac9d102:43023/ ***

To view suite server program contact information:
 $ cylc get-suite-contact stop-restart

Other ways to see if the suite is still running:
 $ cylc scan -n 'stop-restart' a5442ac9d102
 $ cylc ping -v --host=a5442ac9d102 stop-restart
 $ ps -opid,args 15214  # on a5442ac9d102

At this point, the suite is running on the jump box in Docker (a5442ac9d102). Then I condemned that host 👿 , and could see tailing the log it was sent to the other host (9ccdbf43ca1f) 👏 👏 👏

One thing that I noticed was that while doing the cylc scan to see where the suite was running, there was a short time where it could not locate the suite. I believe this could be a problem if someone was relying on that command for monitoring it. See command output below.

root@a5442ac9d102:/opt/cylc# cylc scan 
stop-restart root@a5442ac9d102:43023
root@a5442ac9d102:/opt/cylc# cylc scan 
root@a5442ac9d102:/opt/cylc# cylc scan 
stop-restart root@9ccdbf43ca1f:43052
root@a5442ac9d102:/opt/cylc# cylc scan 
stop-restart root@9ccdbf43ca1f:43052

It is also possible to confirm in the sqlite database that the host value changed during the execution. I am using sqlitebrowser, which is one of the most useful tools I've learned about recently.

As it is possible to see in the screen shot, task_foo01, to task_foo04 were executed on the jump box. But task_foo005 and later were executed on the slave via SSH.

Now it's interesting that task_states list all tasks, from task_foo01 to task_foo025 as succeeded.

But task_pool_checkpoints has succeeded only for up to task_foo004. Q: does it matter?. Unfortunately I haven't learned all about the tables used in Cylc, not much about checkpoints (I believe it's like a checkpoint as in continuation passing style programs, so that a workflow/process can be restarted at a point). So I am not sure if it is irrelevant.

Quite fun to play with this change @oliver-sanders :D 👏 👏 👏

Note

I got into a bit of dead-lock... I noticed that after a few minutes the suite kept being pushed from one host to another... then I realized I had localhost in run suites, and also in condemned suites. I quickly fixed that, but thought interesting that it was possible to play this ping-pong by accident :)

hjoliver · 2018-11-22T00:59:57Z

One thing that I noticed was that while doing the cylc scan to see where the suite was running, there was a short time where it could not locate the suite. I believe this could be a problem if someone was relying on that command for monitoring it.

I guess that's just the interval between shut down and restart, and cylc scan only sees running suite server programs. So that's fine IMO.

hjoliver · 2018-11-22T01:22:00Z

@kinow -

But task_pool_checkpoints has succeeded only for up to task_foo004. Q: does it matter?...

Good spotting! State checkpoints just record the state of all tasks in the pool of tasks that the server program is aware of (at time of checkpointing), and are used to restart the workflow in that state. Your suite isn't a cycling one, so that means all tasks in the entire workflow. The evolving latest/up-to-date workflow state (which is the default for restarting from) is held in the "task_pool" table. Particular named (or ID'd) checkpoints can also be stored to allow restart from an earlier state; those are held in the "task_pool_checkpoints" table. We automatically write out a checkpoint at restart, to allow going back to the original restart state in case of disaster - that's what you're seeing in the checkpoints table above.

kinow · 2018-11-22T01:30:38Z

2. HPC

Using HPC, managed to not break my whole environment, and set up Cylc latest version, checked out this PR, then set up some .bashrc envs (thanks for the session today teaching me some new tricks @hjoliver !).

Started the suite in one node, then during its execution marked that node as condemned 👿 it was successfully moved to the next host in the list 🎉

Exactly same behaviour as noted above for Docker. For a while cylc scan returns nothing (tested with while true; do if [ $(cylc scan | wc -l) -eq 0 ]; then echo "Nothing is running" && sleep 2; else sleep 2 ; fi; done). And in the SQLite DB, the task_pool_checkpoints kept the records of succeeded tasks only up to the point when I transferred the suite over another host.

During the stop/start I had a few

Traceback (most recent call last):
  File "/bruno/cylc/lib/cylc/network/httpserver.py", line 158, in start
    cherrypy.engine.start()
  File "/bruno/cylc/lib/cherrypy/process/wspbus.py", line 256, in start
    raise e_info
ChannelFailures: IOError("Port 43069 not free on 'hosthost.niwa.co.nz'",)

But most likely because I started the server too fast (though cylc scan was saying there was nothing running, and the suite log said it stopped successfully...).

So +1 from me, and once again, kudos @oliver-sanders 👏 👏 👏

kinow · 2018-11-22T01:32:35Z

@hjoliver going to have to have a chat with my friend CUG again, and re-read your comment to make sure I understood it well :) but I assume then that that's OK to have those values in that DB table?

hjoliver · 2018-11-22T01:33:49Z

Yes, it's expected - you saw the static restart checkpoint state. The "task_pool" holds the latest (and continually evolving) checkpoint state.

hjoliver · 2018-11-22T01:38:12Z

And in the SQLite DB, the task_pool_checkpoints kept the records of succeeded tasks only up to the point when I transferred the suite over another host.

To attempt to clarify this - that checkpoint is stored at restart. If you look before then, it won't be in the DB. It allows you to "redo" the restart from scratch in case you screw it up somehow (e.g. by some kind of misdirected manual intervention in the suite state).

kinow · 2018-11-22T01:43:09Z

Oohh, I think I'm starting to understand how it will work. Sounds useful to suite operational teams.

Last question, promise. I accidentally used a host string without commas:

    run hosts = a5442ac9d102 ea21caacc34a

It tried using the value a5442ac9d102 ea21caacc34a for hostname in some part of the code (HostAppointer I think?), then I found that in the logs and fixed my settings file.

I think host names are not allowed to have spaces. If that's correct, would it make sense later to have a check to validate that that setting value contains only valid host names (syntax-wise, not related to DNS/network) ?

I think it could be useful for suite operators (or for newbie users, like me 😬 )

hjoliver · 2018-11-22T02:31:31Z

If that's correct, would it make sense later to have a check to validate that that setting value contains only valid host names (syntax-wise,...

Yes, probably a good idea. There might be a few other config items that should not have spaces either, which could indicate omission of the comma list-separator.

hjoliver · 2018-11-22T02:47:30Z

@kinow -

But most likely because I started the server too fast ...

what do you mean by that? (the server should be restarting automatically ...).

kinow · 2018-11-22T02:59:42Z

@hjoliver

But most likely because I started the server too fast ...

what do you mean by that? (the server should be restarting automatically ...).

Sorry, wasn't really clear in the comment. While I was testing, I would run cylc run --debug --verbose stop-restart, and as it would take a while to complete, if I needed to test a different scenario, I would just cylc stop stop-restart, wait until it said it succeeded in stopping the suite in the output of cylc cat-log... then call cylc run again.

I saw that exception twice, after running cyl run... the second time I paid closer attention, and noticed that in the logs, the suite appeared to have started successfully.

After a while the suite completed, and there were no more errors 🤔 even though I had that exception in my terminal.

matthewrmshin · 2018-11-22T10:00:31Z

Hi @kinow I've just had another look at the logic again:

https://github.com/cylc/cylc/blob/master/lib/cylc/rundb.py#L178

If you want to look at the latest checkpoint task pool, you should filter for id=0. Sorry this is not obvious.

(Conscious that we should give #2483 some priority - so we don't end up with this sort of confusion.)

oliver-sanders · 2018-11-22T15:07:20Z

I think host names are not allowed to have spaces

I think you are right and this is easy to get wrong so I've put a check in place which will raise an error if a hostname is specified with a space. I've not put in any more advanced checking (e.g. ^\w+$) as unicode is creeping in on hostnames so I'd rather not write an entire hostname validator.

There might be a few other config items that should not have spaces either

I've left it with hostname lists for now, it should be easy to extend this later.

matthewrmshin · 2018-11-22T19:55:39Z

@kinow Your cherrypy errors should be safe. On suite start up, it will shuffle the allowed port ranges and then try to start a web server on the first unused port of the shuffled list. If a port is already used, it will print this error, and retry using the next port, until it is able to start up on an unused port or until it has exhausted all the numbers in the range.

Having said that, it may be best if we are able to prevent the printing of the error message - as it is not very useful. (I can't remember if it only appears in debug level or not. In which case, perhaps it does not really matter.)

matthewrmshin · 2018-11-22T19:59:27Z

Who is going to do the honour and press the Merge button?

hjoliver · 2018-11-22T21:50:47Z

Who is going to do the honour and press the Merge button?

Me!! 💥

oliver-sanders · 2018-11-23T10:13:55Z

Woohoo!

oliver-sanders added this to the next release milestone Oct 19, 2018

oliver-sanders self-assigned this Oct 19, 2018

cylc deleted a comment Oct 21, 2018

oliver-sanders force-pushed the jump-ship branch 2 times, most recently from bf0a9a1 to bff62fe Compare October 22, 2018 08:44

oliver-sanders closed this Oct 22, 2018

oliver-sanders reopened this Oct 22, 2018

cylc deleted a comment Oct 22, 2018

oliver-sanders force-pushed the jump-ship branch from bff62fe to 1e874b3 Compare October 23, 2018 09:11

cylc deleted a comment Oct 23, 2018

oliver-sanders force-pushed the jump-ship branch from 1e874b3 to 2b2f3a6 Compare October 24, 2018 10:53

oliver-sanders requested review from matthewrmshin and hjoliver October 24, 2018 10:54

cylc deleted a comment Oct 24, 2018

matthewrmshin reviewed Oct 30, 2018

View reviewed changes

matthewrmshin reviewed Oct 31, 2018

View reviewed changes

lib/cylc/cfgspec/globalcfg.py Outdated Show resolved Hide resolved

lib/cylc/cfgspec/globalcfg.py Outdated Show resolved Hide resolved

oliver-sanders force-pushed the jump-ship branch from 2b2f3a6 to 3a68810 Compare October 31, 2018 12:15

oliver-sanders mentioned this pull request Oct 31, 2018

Add coverage to the project reports. #2766

Merged

cylc deleted a comment Oct 31, 2018

oliver-sanders force-pushed the jump-ship branch from c287956 to 3705f50 Compare October 31, 2018 15:19

matthewrmshin reviewed Oct 31, 2018

View reviewed changes

tests/restart/39-auto-restart-no-suitible-host.t Outdated Show resolved Hide resolved

tests/restart/36-auto-restart-incompatible.t Outdated Show resolved Hide resolved

oliver-sanders force-pushed the jump-ship branch from 3705f50 to 20d05c8 Compare October 31, 2018 16:07

cylc deleted a comment Nov 1, 2018

cylc deleted a comment Nov 21, 2018

hjoliver approved these changes Nov 21, 2018

View reviewed changes

kinow approved these changes Nov 22, 2018

View reviewed changes

host-appointer: prohibit invalid host names

0ef194e

oliver-sanders force-pushed the jump-ship branch from 89241a0 to 0ef194e Compare November 22, 2018 15:02

matthewrmshin approved these changes Nov 22, 2018

View reviewed changes

cylc deleted a comment Nov 22, 2018

cylc deleted a comment from matthewrmshin Nov 22, 2018

hjoliver merged commit 85632de into cylc:master Nov 22, 2018

matthewrmshin mentioned this pull request Nov 23, 2018

Increase coverage for the parsec package #2839

Merged

oliver-sanders deleted the jump-ship branch November 23, 2018 10:13

matthewrmshin mentioned this pull request Nov 27, 2018

Update change log for 7.8 release #2890

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto stop-restart #2809

Auto stop-restart #2809

oliver-sanders commented Oct 19, 2018 •

edited

Loading

oliver-sanders commented Oct 22, 2018

oliver-sanders commented Oct 22, 2018

matthewrmshin left a comment

matthewrmshin Oct 30, 2018

oliver-sanders Oct 31, 2018

matthewrmshin Oct 30, 2018

oliver-sanders Oct 31, 2018

matthewrmshin left a comment

oliver-sanders commented Oct 31, 2018

matthewrmshin left a comment

hjoliver left a comment

kinow commented Nov 22, 2018

hjoliver commented Nov 22, 2018

hjoliver commented Nov 22, 2018

kinow commented Nov 22, 2018 •

edited

Loading

kinow commented Nov 22, 2018

hjoliver commented Nov 22, 2018

hjoliver commented Nov 22, 2018

kinow commented Nov 22, 2018

hjoliver commented Nov 22, 2018

hjoliver commented Nov 22, 2018

kinow commented Nov 22, 2018

matthewrmshin commented Nov 22, 2018 •

edited

Loading

oliver-sanders commented Nov 22, 2018

matthewrmshin commented Nov 22, 2018

matthewrmshin commented Nov 22, 2018

hjoliver commented Nov 22, 2018

oliver-sanders commented Nov 23, 2018

Auto stop-restart #2809

Auto stop-restart #2809

Conversation

oliver-sanders commented Oct 19, 2018 • edited Loading

oliver-sanders commented Oct 22, 2018

oliver-sanders commented Oct 22, 2018

matthewrmshin left a comment

Choose a reason for hiding this comment

matthewrmshin Oct 30, 2018

Choose a reason for hiding this comment

oliver-sanders Oct 31, 2018

Choose a reason for hiding this comment

matthewrmshin Oct 30, 2018

Choose a reason for hiding this comment

oliver-sanders Oct 31, 2018

Choose a reason for hiding this comment

matthewrmshin left a comment

Choose a reason for hiding this comment

oliver-sanders commented Oct 31, 2018

matthewrmshin left a comment

Choose a reason for hiding this comment

hjoliver left a comment

Choose a reason for hiding this comment

kinow commented Nov 22, 2018

1. Docker

Note

hjoliver commented Nov 22, 2018

hjoliver commented Nov 22, 2018

kinow commented Nov 22, 2018 • edited Loading

2. HPC

kinow commented Nov 22, 2018

hjoliver commented Nov 22, 2018

hjoliver commented Nov 22, 2018

kinow commented Nov 22, 2018

hjoliver commented Nov 22, 2018

hjoliver commented Nov 22, 2018

kinow commented Nov 22, 2018

matthewrmshin commented Nov 22, 2018 • edited Loading

oliver-sanders commented Nov 22, 2018

matthewrmshin commented Nov 22, 2018

matthewrmshin commented Nov 22, 2018

hjoliver commented Nov 22, 2018

oliver-sanders commented Nov 23, 2018

oliver-sanders commented Oct 19, 2018 •

edited

Loading

kinow commented Nov 22, 2018 •

edited

Loading

matthewrmshin commented Nov 22, 2018 •

edited

Loading