Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PBS 13 "multi-cluster" support. #2877

Merged
merged 7 commits into from
Nov 26, 2018
Merged

Conversation

hjoliver
Copy link
Member

@hjoliver hjoliver commented Nov 23, 2018

Close #2779

New batch system handler lib/cylc/batch_sys_handlers/pbs_multi_cluster.py docstring:

Support PBS clients that front heterogeneous clusters where the Job ID returned
by qsub is <id>.<server>. PBS 13 qstat and qdel need <id>.<server>@<server>.
From PBS 14, the standard cylc PBS module works ("@<server>" is not needed).

So this PBS handler writes "job_id@server" to the job status file, and appends
"@server" to Job IDs returned by qstat, to matched the stored IDs.

This should be an easy review: you might not be able to test it, but you can tell that it won't affect any existing cases.

Not assigning @kinow to review as he doesn't have access to PBS. I've tested this on an Azure cluster with PBS, set up to mimic this "multi cluster" heterogeneous environment.

I don't think I can make an integration test for this - it requires a non-trivial cluster environment, and an old version of PBS Pro. Obviously could make a small unit test for the new function, once our new unit testing machinery is in place. [unit tests added thanks to @kinow]

@matthewrmshin - please assign another reviewer from your team...

@hjoliver hjoliver added this to the next release milestone Nov 23, 2018
@hjoliver hjoliver self-assigned this Nov 23, 2018
@hjoliver hjoliver changed the title Pbs 13 "multi-cluster" support. PBS 13 "multi-cluster" support. Nov 23, 2018
@kinow
Copy link
Member

kinow commented Nov 23, 2018

I used PBS Torque in the past, and have a Docker container that I used some time ago. Still keen to spin it up and see what's the output of... qstat? Can't remember the commands well.

Maybe we could add a test too?

#!/usr/bin/env python2

# THIS FILE IS PART OF THE CYLC SUITE ENGINE.
# Copyright (C) 2008-2018 NIWA & British Crown (Met Office) & Contributors.
#
# This program is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with this program.  If not, see <http://www.gnu.org/licenses/>.

import unittest

from cylc.batch_sys_handlers.pbs_multi_cluster import *


def get_test_filter_poll_many_output():
    return [
        ("ignore\nignore\n123.localhost ignored", ["123.localhost@localhost"]),
        ("\n\n#\n#\n10.samba", ["10.samba@samba"]),
        (
            """
            a
            a
            12.localhost ____ # this is ignored
            09.jd-01
            """,
            [
                "12.localhost@localhost",
                "09.jd-01@jd-01",
            ]
        ),
        ("\n\n1.localhost", [])
    ]


def get_test_filter_poll_many_output_invalid():
    return [
        # error raised due to multiple `.` chars
        ("a\nb\n1.wd-00.terra", ValueError),
        # could not find a `.`
        ("a\nb\nnot_an_id", ValueError)
    ]


def get_test_manip_job_id():
    return [
        ("1.localhost", "1.localhost@localhost"),
        ("10000.jd-01", "10000.jd-01@jd-01")
    ]


def get_test_manip_job_id_invalid():
    return [
        ("1.jd-01.terra.luna", ValueError),
        ("103", ValueError)
    ]


class TestPBSMultiCluster(unittest.TestCase):

    def test_filter_poll_many_output(self):
        """Basic tests for filter_poll_many_output."""
        for out, expected in get_test_filter_poll_many_output():
            job_ids = PBSMulticlusterHandler.filter_poll_many_output(out)
            self.assertEqual(expected, job_ids)

    def test_filter_poll_many_output_invalid(self):
        """Test filter_poll_many_output with invalid values."""
        for out, ex in get_test_filter_poll_many_output_invalid():
            with self.assertRaises(ex):
                PBSMulticlusterHandler.filter_poll_many_output(out)

    def test_manip_job_id(self):
        """Basic tests for manip_job_id."""
        for job_id, expected in get_test_manip_job_id():
            mod_job_id = PBSMulticlusterHandler.manip_job_id(job_id)
            self.assertEqual(expected, mod_job_id)

    def test_manip_job_id_invalid(self):
        """Basic tests for manip_job_id with invalid values."""
        for job_id, ex in get_test_manip_job_id_invalid():
            with self.assertRaises(ex):
                PBSMulticlusterHandler.manip_job_id(job_id)

    def test_export_handler(self):
        import cylc.batch_sys_handlers.pbs_multi_cluster as m
        self.assertTrue(hasattr(m, 'BATCH_SYS_HANDLER'))

Writing the test, I noticed that it doesn't accept host names with dots... can't recall what the output looked like? @hjoliver Q is it valid: 1234@host.domain.bla? Or are server names also just simple characters, with no dots?

@kinow
Copy link
Member

kinow commented Nov 23, 2018

That test could be something like lib/cylc/tests/test_pbs_multi_cluster.py, and be executed once in place with:

$ PYTHONPATH=$(pwd -P)/lib pytest lib/cylc/tests/test_pbs_multi_cluster.py 
=============================================================================================== test session starts ===============================================================================================
platform linux2 -- Python 2.7.15rc1, pytest-3.9.1, py-1.7.0, pluggy-0.8.0
rootdir: /home/kinow/Development/python/workspace/cylc, inifile:
plugins: cov-2.6.0
collected 5 items                                                                                                                                                                                                 

lib/cylc/tests/test_pbs_multi_cluster.py .....                                                                                                                                                              [100%]

============================================================================================ 5 passed in 0.03 seconds =============================================================================================

And provide 100% coverage for the new code 🎉

@kinow
Copy link
Member

kinow commented Nov 23, 2018

Testing PBS + Cylc with Docker is doable, but would take a couple of days at least to find the right setup. However, I could submit a PBS job with agaveapi/torque.

[testuser@docker ~]$ qsub /home/testuser/torque.submit 
0.docker.example.com

So I think the hostname part may have multiple dots... so maybe?

diff --git a/lib/cylc/batch_sys_handlers/pbs_multi_cluster.py b/lib/cylc/batch_sys_handlers/pbs_multi_cluster.py
index 747b05375..ca33f274f 100644
--- a/lib/cylc/batch_sys_handlers/pbs_multi_cluster.py
+++ b/lib/cylc/batch_sys_handlers/pbs_multi_cluster.py
@@ -38,7 +38,7 @@ class PBSMulticlusterHandler(PBSHandler):
         lines = out.split('\n')
         for line in lines[2:]:
             job = line.split()[0]
-            _, server = job.split('.')
+            _, server = job.split('.', 1)
             job_ids.append(job + '@' + server)
         return job_ids

@hjoliver
Copy link
Member Author

@kinow - thanks for the review! You're dead right about the hostname dots - oops, I'd better fix that. Also thanks for the unit tests, I'll put them in...

@hjoliver
Copy link
Member Author

(Well, actually I'm not sure about the hostname, I was going by the info given to me on the real system this targets, which I don't have access to, and the mimic setup on Azure, ... so it's possible that PBS Pro 13 isn't using use a fully qualified name in this context. But best to be safe I guess.)

@hjoliver
Copy link
Member Author

Unit tests added. (As I recall we're not explicitly hooking these into cylc test-battery now - right?)

@kinow
Copy link
Member

kinow commented Nov 23, 2018

Unit tests added.

You may have to adjust the tests if you change the behaviour for the dots in the host name. I've added a test to validate that there.

(As I recall we're not explicitly hooking these into cylc test-battery now - right?)

Yup, for now the tests are not executed (they were not executed before as well I think). But you can run manually. The other PR to add coverage reports is taking care to call all tests, test-battery/unit.

@hjoliver
Copy link
Member Author

You may have to adjust the tests if you change the behaviour for the dots in the host name.

Yes I did that

they were not executed before as well I think

we were doing this for the few-ish existing unit tests:

$ git grep 'unit tests' tests
tests/api-suite-info/04-api-suite-info-unit-tests.t:# Run suite info API unit tests.
tests/cylc.scheduler_cli/00-host-appointer.t:# Run unit tests to test HostAppointer class for selecting hosts.
tests/graph_parser/00-unittests.t:# Run graph parser unit tests.
tests/param_expand/00-unittests.t:# Run graph parser unit tests.
tests/suite-host-self-id/01-unit.t:# Run graph parser unit tests.

... might need to remove those now?

@kinow
Copy link
Member

kinow commented Nov 23, 2018

Oh, you were right @hjoliver ! Had a look at the coverage branch, and hadn't removed all the tests from test-battery. Now cylc travis-battery should be running slightly faster (though probably still not noticeable 😢 ).

Copy link
Contributor

@matthewrmshin matthewrmshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. 2 comments on style.

"""For job_id's of the form "id.server", return job_id@server."""
out = out.strip()
job_ids = []
lines = out.split('\n')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lines = out.splitlines()

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't test this but it looks good as far as I can see.

@hjoliver
Copy link
Member Author

@matthewrmshin @oliver-sanders - you'd better take another quick look, since I addressed your feedback. @kinow - note I've changed your suggested unit tests a bit: assuming qstat always prints two header lines, then one line for each job ID; and unmatched IDs are returned as-is rather than raising an exception.

@codecov
Copy link

codecov bot commented Nov 24, 2018

Codecov Report

Merging #2877 into master will increase coverage by 0.03%.
The diff coverage is 93.75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2877      +/-   ##
==========================================
+ Coverage   65.22%   65.26%   +0.03%     
==========================================
  Files         106      107       +1     
  Lines       15113    15133      +20     
  Branches     3319     3323       +4     
==========================================
+ Hits         9857     9876      +19     
- Misses       5256     5257       +1
Impacted Files Coverage Δ
lib/cylc/batch_sys_handlers/pbs_multi_cluster.py 100% <100%> (ø)
lib/cylc/batch_sys_manager.py 53.68% <50%> (-0.02%) ⬇️
lib/cylc/task_job_mgr.py 62.47% <0%> (+0.32%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1ae0028...356dd39. Read the comment docs.

@hjoliver
Copy link
Member Author

(94% patch coverage (one line missed) ... to get to 100% probably requires extensive unit testing of batch_sys_manager.py, with mock(?) ... later...)

@matthewrmshin
Copy link
Contributor

(Yes, the logic in cylc.batch_sys_manager is tested enough under site tests - but with PBS, Slurm, etc tests being skipped in the Travis CI environment, the coverage stats suffer. Mock will help.)

Copy link
Contributor

@matthewrmshin matthewrmshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@matthewrmshin
Copy link
Contributor

@sadielbartholomew please sanity check and merge as @oliver-sanders is on leave today.

Copy link
Collaborator

@sadielbartholomew sadielbartholomew left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sanity check passed. Great!

@sadielbartholomew sadielbartholomew merged commit daf7dee into cylc:master Nov 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants