Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

restart: polling of local jobs when suite restarted on different host #2843

Closed
oliver-sanders opened this issue Nov 7, 2018 · 4 comments
Closed
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@oliver-sanders
Copy link
Member

When a suite is restarted on a different host, jobs running locally may get marked as failed when the suite restarts.

The following test (which currently fails) outlines the issue:

#!/bin/bash
# THIS FILE IS PART OF THE CYLC SUITE ENGINE.
# Copyright (C) 2008-2018 NIWA & British Crown (Met Office) & Contributors.
# 
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.
#-------------------------------------------------------------------------------
. "$(dirname "$0")/test_header"
#-------------------------------------------------------------------------------
export CYLC_TEST_HOST=$( \
    cylc get-global-config -i '[test battery]remote host with shared fs' \
    2>'/dev/null')
if [[ -z "${CYLC_TEST_HOST}" ]]; then
    skip_all '"[test battery]remote host with shared fs": not defined'
fi
set_test_number 3
#-------------------------------------------------------------------------------
# Test Cylc's handling of local jobs when the suite is restarted on a different
# host.
TEST_DIR="$HOME/cylc-run/" init_suite "${TEST_NAME_BASE}" <<< '
[scheduling]
    [[dependencies]]
        graph = """
            foo:start => bar
            foo & bar => pub
        """
[runtime]
    [[foo]]
        # innocent bystander
        script = sleep 40
    [[bar]]
        script = """
            function poll() {
                local TIMEOUT="$(($(date +%s) + 60))" # wait 1 minute
                while (($(date +%s) < TIMEOUT)) && eval "$@"; do
                    sleep 1
                done
            }

            cylc stop "${CYLC_SUITE_NAME}" --now
            poll test -f "${CYLC_SUITE_RUN_DIR}/.service/contact"
            sleep 1
            cylc restart "${CYLC_SUITE_NAME}" --host="'"${CYLC_TEST_HOST}"'"
        """
'

cylc run "${SUITE_NAME}" --host='localhost'

# wait for suite to stop
FILE=$(cylc cat-log "${SUITE_NAME}" -m p |xargs readlink -f)
log_scan "${TEST_NAME_BASE}-stop" "${FILE}" 20 1 \
    'Suite shutting down - REQUEST(NOW)'

# wait for suite to restart
poll ! test -f "${SUITE_RUN_DIR}/.service/contact"
sleep 2

# wait for suite to stop
FILE=$(cylc cat-log "${SUITE_NAME}" -m p |xargs readlink -f)
log_scan "${TEST_NAME_BASE}-restart" "${FILE}" 60 1 \
    '\[pub.1\] -(current:running)> succeeded' \
    'Suite shutting down - AUTOMATIC'

cylc stop "${SUITE_NAME}" --now --now
poll test -f "${SUITE_RUN_DIR}/.service/contact"
sleep 1
purge_suite "${SUITE_NAME}"
@oliver-sanders oliver-sanders added the bug Something is wrong :( label Nov 7, 2018
@oliver-sanders oliver-sanders added this to the soon milestone Nov 7, 2018
@oliver-sanders
Copy link
Member Author

See also: #2809 (comment)

@matthewrmshin
Copy link
Contributor

Remark:

  • For background and at jobs, we want to mark them as running under the original suite host.
    • Job runs as a process under the original suite host.
  • For jobs submitted to a cluster via a batch scheduler, we want to continue to mark them as localhost jobs:
    • Job is unlikely to be running as a process under the original suite host.
    • Assuming that the new suite host has similar access to the cluster's batch scheduler client for polling the job.

@oliver-sanders oliver-sanders mentioned this issue Nov 8, 2018
3 tasks
@hjoliver
Copy link
Member

@matthewrmshin @oliver-sanders - can this be closed now? (by #2849)

@matthewrmshin matthewrmshin modified the milestones: soon, cylc-7.8.0 Nov 27, 2018
@matthewrmshin matthewrmshin self-assigned this Nov 27, 2018
@matthewrmshin
Copy link
Contributor

Yes. Closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

No branches or pull requests

3 participants