Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto stop-restart #2809

Merged
merged 10 commits into from
Nov 22, 2018
61 changes: 61 additions & 0 deletions doc/src/cylc-user-guide/cug.tex
Original file line number Diff line number Diff line change
Expand Up @@ -8221,6 +8221,67 @@ \subsection{Disaster Recovery}
a DB, or else write a back-up utility using the official sqlite backup API:
\url{http://www.sqlite.org/backup.html}.}

\subsection{Auto Stop-Restart}
\label{auto-stop-restart}

Cylc has the ability to automatically stop suites running on a particular host
and optionally, restart them on a different host.
This is useful if a host needs to be taken off-line e.g.\ for
scheduled maintenance.

This functionality is configured via the following site configuration settings:
\begin{myitemize}
\item \lstinline=[run hosts][suite servers]auto restart delay=
\item \lstinline=[run hosts][suite servers]condemned hosts=
\item \lstinline=[run hosts][suite servers]run hosts=
\end{myitemize}

The auto stop-restart feature has two modes:

\begin{description}
\item[Normal Mode] \hfill

When a host is added to the \lstinline=condemned hosts= list, any suites
running on that host will automatically shutdown then restart selecting a
new host from \lstinline=run hosts=.

For safety, before attempting to stop the suite cylc will first wait for any
jobs running locally (under background or at) to complete.

{\em In order for Cylc to be able to successfully restart suites the
\lstinline=run hosts= must all be on a shared filesystem.}

\item[Force Mode] \hfill

If a host is suffixed with an exclamation mark then Cylc will not attempt
to automatically restart the suite and any local jobs (running under
background or at) will be left running.
\end{description}

For example in the following configuration any suites running on
\lstinline=foo= will attempt to restart on \lstinline=pub= whereas any suites
running on \lstinline=bar= will stop immediately, making no attempt to restart.

\begin{lstlisting}
[suite servers]
run hosts = pub
condemned hosts = foo, bar!
\end{lstlisting}

To prevent large numbers of suites attempting to restart simultaneously the
\lstinline=auto restart delay= setting defines a period of time in seconds.
Suites will wait for a random period of time between zero and
\lstinline=auto restart delay= seconds before attempting to stop and restart.

At present the auto shutdown-restart functionality can only operate provided
that the user hasn't specified any behaviour which is not preserved by
\lstinline=cylc restart= (e.g.\ user specified hold point or run mode). This
caveat will be removed in a future version, currently Cylc will not attempt to
auto shutdown-restart suites which meet this criterion but will log a critical
error message to alert the user.

See the \lstinline=[suite servers]= configuration section
(\ref{global-suite-servers}) for more details.

\section{Suite Storage, Discovery, Revision Control, and Deployment}
\label{SuiteStorageEtc}
Expand Down
26 changes: 26 additions & 0 deletions doc/src/cylc-user-guide/siterc.tex
Original file line number Diff line number Diff line change
Expand Up @@ -752,12 +752,38 @@ \subsection{[hosts]}

\subsection{[suite servers] }

\label{global-suite-servers}

Configure allowed suite hosts and ports for starting up (running or
restarting) suites and enabling them to be detected whilst running via
utilities such as \lstinline=cylc gscan=. Additionally configure host
selection settings specifying how to determine the most suitable run host at
any given time from those configured.

\subsubsection[run hosts]{[suite servers] \textrightarrow auto restart delay}

Relates to Cylc's auto stop-restart mechanism (see~\ref{auto-stop-restart}).
When a host is set to automatically shutdown/restart it will first wait a
random period of time between zero and \lstinline=auto restart delay=
seconds before beginning the process. This is to prevent large numbers
of suites from restarting simultaneously.

\begin{myitemize}
\item {\em type:} integer
\item {\em default:} \lstinline=0=
\end{myitemize}

\subsubsection[run hosts]{[suite servers] \textrightarrow condemned hosts}

Hosts specified in \lstinline=condemned hosts= will not be considered as suite
run hosts. If suites are already running on \lstinline=condemned hosts= they
will be automatically shutdown and restarted (see~\ref{auto-stop-restart}).

\begin{myitemize}
\item {\em type:} comma-separated list of host names and/or IP addresses.
\item {\em default:} (none)
\end{myitemize}

\subsubsection[run hosts]{[suite servers] \textrightarrow run hosts }

A list of allowed suite run hosts. One of these hosts will be appointed for
Expand Down
4 changes: 2 additions & 2 deletions lib/cylc/cfgspec/glbl_cfg.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@
"""Allow lazy loading of `cylc.cfgspec.globalcfg`."""


def glbl_cfg():
def glbl_cfg(cached=True):
"""Load and return the global configuration singleton instance."""
from cylc.cfgspec.globalcfg import GlobalConfig
return GlobalConfig.get_inst()
return GlobalConfig.get_inst(cached=cached)
20 changes: 17 additions & 3 deletions lib/cylc/cfgspec/globalcfg.py
Original file line number Diff line number Diff line change
Expand Up @@ -245,6 +245,8 @@
'run ports': [VDR.V_INTEGER_LIST, range(43001, 43101)],
'scan hosts': [VDR.V_STRING_LIST],
'scan ports': [VDR.V_INTEGER_LIST, range(43001, 43101)],
'condemned hosts': [VDR.V_STRING_LIST],
'auto restart delay': [VDR.V_INTERVAL],
'run host select': {
'rank': [VDR.V_STRING, 'random', 'load:1', 'load:5', 'load:15',
'memory', 'disk-space'],
Expand Down Expand Up @@ -424,9 +426,21 @@ class GlobalConfig(ParsecConfig):
USER_CONF_DIR_2 = os.path.join(os.environ['HOME'], '.cylc')

@classmethod
def get_inst(cls):
"""Return the singleton instance."""
if not cls._DEFAULT:
def get_inst(cls, cached=True):
"""Return a GlobalConfig instance.

Args:
cached (bool):
If cached create if necessary and return the singleton
instance, else return a new instance.
"""
if not cached:
# Return an up-to-date global config without affecting the
# singleton.
new_instance = cls(SPEC, upg, validator=cylc_config_validate)
new_instance.load()
return new_instance
elif not cls._DEFAULT:
cls._DEFAULT = cls(SPEC, upg, validator=cylc_config_validate)
cls._DEFAULT.load()
return cls._DEFAULT
Expand Down
Loading