Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto stop-restart #2809

Merged
merged 10 commits into from
Nov 22, 2018
Merged

Auto stop-restart #2809

merged 10 commits into from
Nov 22, 2018

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Oct 19, 2018

Implement functionality to allow suites to automatically migrate away from specified servers.

Closes #988
Spawned #2843 , #2799

Until #2799 this functionality is implemented with the caveat that suites which cannot be safely restarted will not attempt to (but will log critical messages to alert the user, do we want to look into automatically sending emails as well?).

TODO:

  • Don't auto stop-restart if the suite is already stopping.
  • Elegantly handle local jobs running under background or at.
  • Enable auto stop-restart for suites running under cylc restart (resolve "incompatible setting stop_point" issue)

Ready for initial review

@oliver-sanders oliver-sanders added this to the next release milestone Oct 19, 2018
@oliver-sanders oliver-sanders self-assigned this Oct 19, 2018
@cylc cylc deleted a comment Oct 21, 2018
@cylc cylc deleted a comment Oct 21, 2018
@oliver-sanders oliver-sanders force-pushed the jump-ship branch 2 times, most recently from bf0a9a1 to bff62fe Compare October 22, 2018 08:44
@oliver-sanders
Copy link
Member Author

GitHub is having a strange moment and not accepting new changes to this branch 🪲 so I will close and re-open this PR.

@oliver-sanders
Copy link
Member Author

GitHub is recovering :octocat:

@cylc cylc deleted a comment Oct 22, 2018
@cylc cylc deleted a comment Oct 22, 2018
@cylc cylc deleted a comment Oct 23, 2018
@cylc cylc deleted a comment Oct 23, 2018
@cylc cylc deleted a comment Oct 24, 2018
Copy link
Contributor

@matthewrmshin matthewrmshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. Can talk through them in person tomorrow.

CONFIG_INSTANCE[0] = GlobalConfig()
return CONFIG_INSTANCE[0]
else:
return GlobalConfig()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just call .load on the default instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to change the config that the suite was run with, Cylc does not yet posses the ability to adapt to a dynamic config? So the safer option is to return a new config object rather than updating the old one.


# proc will start with current env (incl CYLC_HOME etc)
proc = Popen(
cmd + ['--host=%s' % new_host],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a cylc restart without --host=HOST work here? In which case, we would no longer need the HostAppointer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we could do it that way, the reasons I went down this route are:

  1. We can make sure that there is an available host to restart on before shutting down the suite.
  2. We can log the host the suite will attempt to restart on.

Copy link
Contributor

@matthewrmshin matthewrmshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more comments. The new tests are all failing in my environment at the moment. I'll take a look to see what the issues may be from my side.

@oliver-sanders
Copy link
Member Author

De-conflicted, feedback addressed, new test added.

Copy link
Contributor

@matthewrmshin matthewrmshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A bit more minor comments. I am going to see if I can break it now.

@cylc cylc deleted a comment Nov 21, 2018
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested again on our HPC, all good. @kinow may have a look too, but I think we're happy at this end. We'll leave it to the Met Office crew to make the final merge decision though, as you have the most critical heavy duty use case for this feature!

@kinow
Copy link
Member

kinow commented Nov 22, 2018

Didn't need to use NFS, as any shared file system appears to work with Cylc (hooray!).

1. Docker

Used the same images for distributed Cylc (via SSH), with mapped volumes.

ssh-keygen -t rsa -f ./id_rsa -N "" -q
CYLC_SSH_PUBKEY=$(cat id_rsa.pub) docker-compose up --build

# access the jump box
docker exec -ti distributed_cylc-jump-box_1 /bin/bash
# edited /etc/hosts in jump box, to workaround an issue submitting jobs

Then also copied the docker containers ID's to the global.rc

[cylc]
    health check interval = PT10S
    [[events]]
        abort on inactivity = True
        abort on timeout = True
        inactivity = PT1M
        timeout = PT1M
[suite servers]
    auto restart delay = PT10S
    #condemned hosts = a5442ac9d102, ea21caacc34a
    #condemned hosts = a5442ac9d102
    # a5442ac9d102 is the jump box
    run hosts = a5442ac9d102, ea21caacc34a

And finally using the test suite, registered as stop-restart, cylc run --debug stop-restart.

2018-11-21T23:58:54Z DEBUG - Loading site/user global config files
2018-11-21T23:58:54Z DEBUG - Reading file /root/.cylc/global.rc
2018-11-21T23:58:54Z DEBUG - Loading site/user global config files
2018-11-21T23:58:54Z DEBUG - Reading file /root/.cylc/global.rc
            ._.                                                       
            | |           The Cylc Suite Engine [7.7.1-398-g7e472]    
._____._. ._| |_____.           Copyright (C) 2008-2018 NIWA          
| .___| | | | | .___|   & British Crown (Met Office) & Contributors.  
| !___| !_! | | !___.  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
!_____!___. |_!_____!  This program comes with ABSOLUTELY NO WARRANTY;
      .___! |          see `cylc warranty`.  It is free software, you 
      !_____!           are welcome to redistribute it under certain  
2018-11-21T23:58:54Z DEBUG - creating suite run directory: /root/cylc-run/stop-restart
2018-11-21T23:58:54Z DEBUG - creating suite log directory: /root/cylc-run/stop-restart/log/suite
2018-11-21T23:58:54Z DEBUG - creating suite job log directory: /root/cylc-run/stop-restart/log/job
2018-11-21T23:58:54Z DEBUG - creating suite config log directory: /root/cylc-run/stop-restart/log/suiterc
2018-11-21T23:58:54Z DEBUG - creating suite work directory: /root/cylc-run/stop-restart/work
2018-11-21T23:58:54Z DEBUG - creating suite share directory: /root/cylc-run/stop-restart/share

*** listening on https://a5442ac9d102:43023/ ***

To view suite server program contact information:
 $ cylc get-suite-contact stop-restart

Other ways to see if the suite is still running:
 $ cylc scan -n 'stop-restart' a5442ac9d102
 $ cylc ping -v --host=a5442ac9d102 stop-restart
 $ ps -opid,args 15214  # on a5442ac9d102

At this point, the suite is running on the jump box in Docker (a5442ac9d102). Then I condemned that host 👿 , and could see tailing the log it was sent to the other host (9ccdbf43ca1f) 👏 👏 👏

One thing that I noticed was that while doing the cylc scan to see where the suite was running, there was a short time where it could not locate the suite. I believe this could be a problem if someone was relying on that command for monitoring it. See command output below.

root@a5442ac9d102:/opt/cylc# cylc scan 
stop-restart root@a5442ac9d102:43023
root@a5442ac9d102:/opt/cylc# cylc scan 
root@a5442ac9d102:/opt/cylc# cylc scan 
stop-restart root@9ccdbf43ca1f:43052
root@a5442ac9d102:/opt/cylc# cylc scan 
stop-restart root@9ccdbf43ca1f:43052

It is also possible to confirm in the sqlite database that the host value changed during the execution. I am using sqlitebrowser, which is one of the most useful tools I've learned about recently.

image

As it is possible to see in the screen shot, task_foo01, to task_foo04 were executed on the jump box. But task_foo005 and later were executed on the slave via SSH.

Now it's interesting that task_states list all tasks, from task_foo01 to task_foo025 as succeeded.

image

But task_pool_checkpoints has succeeded only for up to task_foo004. Q: does it matter?. Unfortunately I haven't learned all about the tables used in Cylc, not much about checkpoints (I believe it's like a checkpoint as in continuation passing style programs, so that a workflow/process can be restarted at a point). So I am not sure if it is irrelevant.

image

Quite fun to play with this change @oliver-sanders :D 👏 👏 👏

Note

I got into a bit of dead-lock... I noticed that after a few minutes the suite kept being pushed from one host to another... then I realized I had localhost in run suites, and also in condemned suites. I quickly fixed that, but thought interesting that it was possible to play this ping-pong by accident :)

@hjoliver
Copy link
Member

One thing that I noticed was that while doing the cylc scan to see where the suite was running, there was a short time where it could not locate the suite. I believe this could be a problem if someone was relying on that command for monitoring it.

I guess that's just the interval between shut down and restart, and cylc scan only sees running suite server programs. So that's fine IMO.

@hjoliver
Copy link
Member

@kinow -

But task_pool_checkpoints has succeeded only for up to task_foo004. Q: does it matter?...

Good spotting! State checkpoints just record the state of all tasks in the pool of tasks that the server program is aware of (at time of checkpointing), and are used to restart the workflow in that state. Your suite isn't a cycling one, so that means all tasks in the entire workflow. The evolving latest/up-to-date workflow state (which is the default for restarting from) is held in the "task_pool" table. Particular named (or ID'd) checkpoints can also be stored to allow restart from an earlier state; those are held in the "task_pool_checkpoints" table. We automatically write out a checkpoint at restart, to allow going back to the original restart state in case of disaster - that's what you're seeing in the checkpoints table above.

@kinow
Copy link
Member

kinow commented Nov 22, 2018

2. HPC

Using HPC, managed to not break my whole environment, and set up Cylc latest version, checked out this PR, then set up some .bashrc envs (thanks for the session today teaching me some new tricks @hjoliver !).

Started the suite in one node, then during its execution marked that node as condemned 👿 it was successfully moved to the next host in the list 🎉

Exactly same behaviour as noted above for Docker. For a while cylc scan returns nothing (tested with while true; do if [ $(cylc scan | wc -l) -eq 0 ]; then echo "Nothing is running" && sleep 2; else sleep 2 ; fi; done). And in the SQLite DB, the task_pool_checkpoints kept the records of succeeded tasks only up to the point when I transferred the suite over another host.

During the stop/start I had a few

Traceback (most recent call last):
  File "/bruno/cylc/lib/cylc/network/httpserver.py", line 158, in start
    cherrypy.engine.start()
  File "/bruno/cylc/lib/cherrypy/process/wspbus.py", line 256, in start
    raise e_info
ChannelFailures: IOError("Port 43069 not free on 'hosthost.niwa.co.nz'",)

But most likely because I started the server too fast (though cylc scan was saying there was nothing running, and the suite log said it stopped successfully...).

So +1 from me, and once again, kudos @oliver-sanders 👏 👏 👏

@kinow
Copy link
Member

kinow commented Nov 22, 2018

@hjoliver going to have to have a chat with my friend CUG again, and re-read your comment to make sure I understood it well :) but I assume then that that's OK to have those values in that DB table?

@hjoliver
Copy link
Member

Yes, it's expected - you saw the static restart checkpoint state. The "task_pool" holds the latest (and continually evolving) checkpoint state.

@hjoliver
Copy link
Member

And in the SQLite DB, the task_pool_checkpoints kept the records of succeeded tasks only up to the point when I transferred the suite over another host.

To attempt to clarify this - that checkpoint is stored at restart. If you look before then, it won't be in the DB. It allows you to "redo" the restart from scratch in case you screw it up somehow (e.g. by some kind of misdirected manual intervention in the suite state).

@kinow
Copy link
Member

kinow commented Nov 22, 2018

Oohh, I think I'm starting to understand how it will work. Sounds useful to suite operational teams.

Last question, promise. I accidentally used a host string without commas:

    run hosts = a5442ac9d102 ea21caacc34a

It tried using the value a5442ac9d102 ea21caacc34a for hostname in some part of the code (HostAppointer I think?), then I found that in the logs and fixed my settings file.

I think host names are not allowed to have spaces. If that's correct, would it make sense later to have a check to validate that that setting value contains only valid host names (syntax-wise, not related to DNS/network) ?

I think it could be useful for suite operators (or for newbie users, like me 😬 )

@hjoliver
Copy link
Member

If that's correct, would it make sense later to have a check to validate that that setting value contains only valid host names (syntax-wise,...

Yes, probably a good idea. There might be a few other config items that should not have spaces either, which could indicate omission of the comma list-separator.

@hjoliver
Copy link
Member

@kinow -

But most likely because I started the server too fast ...

what do you mean by that? (the server should be restarting automatically ...).

@kinow
Copy link
Member

kinow commented Nov 22, 2018

@hjoliver

But most likely because I started the server too fast ...

what do you mean by that? (the server should be restarting automatically ...).

Sorry, wasn't really clear in the comment. While I was testing, I would run cylc run --debug --verbose stop-restart, and as it would take a while to complete, if I needed to test a different scenario, I would just cylc stop stop-restart, wait until it said it succeeded in stopping the suite in the output of cylc cat-log... then call cylc run again.

I saw that exception twice, after running cyl run... the second time I paid closer attention, and noticed that in the logs, the suite appeared to have started successfully.

After a while the suite completed, and there were no more errors 🤔 even though I had that exception in my terminal.

@matthewrmshin
Copy link
Contributor

matthewrmshin commented Nov 22, 2018

Hi @kinow I've just had another look at the logic again:

https://github.com/cylc/cylc/blob/master/lib/cylc/rundb.py#L178

If you want to look at the latest checkpoint task pool, you should filter for id=0. Sorry this is not obvious.

(Conscious that we should give #2483 some priority - so we don't end up with this sort of confusion.)

@oliver-sanders
Copy link
Member Author

I think host names are not allowed to have spaces

I think you are right and this is easy to get wrong so I've put a check in place which will raise an error if a hostname is specified with a space. I've not put in any more advanced checking (e.g. ^\w+$) as unicode is creeping in on hostnames so I'd rather not write an entire hostname validator.

There might be a few other config items that should not have spaces either

I've left it with hostname lists for now, it should be easy to extend this later.

@matthewrmshin
Copy link
Contributor

@kinow Your cherrypy errors should be safe. On suite start up, it will shuffle the allowed port ranges and then try to start a web server on the first unused port of the shuffled list. If a port is already used, it will print this error, and retry using the next port, until it is able to start up on an unused port or until it has exhausted all the numbers in the range.

Having said that, it may be best if we are able to prevent the printing of the error message - as it is not very useful. (I can't remember if it only appears in debug level or not. In which case, perhaps it does not really matter.)

@matthewrmshin
Copy link
Contributor

Who is going to do the honour and press the Merge button?

@cylc cylc deleted a comment Nov 22, 2018
@cylc cylc deleted a comment Nov 22, 2018
@cylc cylc deleted a comment from matthewrmshin Nov 22, 2018
@hjoliver
Copy link
Member

Who is going to do the honour and press the Merge button?

Me!! 💥

@hjoliver hjoliver merged commit 85632de into cylc:master Nov 22, 2018
@oliver-sanders
Copy link
Member Author

Woohoo!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Auto shutdown & restart
4 participants