Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Host select #3489

Merged
merged 11 commits into from
Mar 30, 2020
Merged

Host select #3489

merged 11 commits into from
Mar 30, 2020

Conversation

oliver-sanders
Copy link
Member

@oliver-sanders oliver-sanders commented Jan 28, 2020

Closes #3391, #3392, #3398, #3233
Addresses cylc/cylc-admin#65 (the [platform groups] part)
Addresses #2199 (the rose host-select part)

Headline:

Abstract the existing suite host selection infrastructure to allow us to use it for the new [platform groups] feature which is effectively the migration of rose host-select functionality.

Overview:

  • Abstract the pre-existing suite host selection interface:
    • This effectively migrates rose host-select functionality to Cylc.
  • Upgrade the threshold / ranking system to use psutil expressions (host-metrics: use psutil #3391).
    • This means we are no longer responsible for calculating memory usage etc.
    • This opens up the full functionality of psutil for extra configurability.
    • Compatability improvements:
      • Now compatible with all Linux distros and kernel versions (not scraping /proc/meminfo any more)
      • BSD compatibility (including Darwin/MacOS).
      • Just for laughs psutil even works on Windows!
  • Also:
    • Improve CLI output in the "no hosts available" scenario to help users understand what this means.
    • Convert from a state-full class to a pure-functional implementation.
    • Unit-test the hell out of it (100% coverage when a remote test host is provided).
    • Add a new function for mocking the global config file (pytest: add test decorator for fiddling the return values of glbl_cfg #3392).
    • Fixes a bug with cylc.flow.remote:run_cmd and the stdin option (unit-tested).

Open Questions

  • The select_host method will only consider hosts which are visible from the current platform (because it first gets the fqdn of each host). This could be made optional either now or later if necessary.

Example: Cylc Suite Host Select:

~/.cylc/flow/8.0a1

[suite hosts]
    run hosts = foo, bar, baz
    condemned hosts = baz
    thresholds = """
        # 15min load average must be lower than 5
        getloadavg()[0] < 5

        # must have at least 1GB of ram to spare
        virtual_memory().available > 1000000000

        # rank by available memory
        virtual_memory().available
    """
>>> from cylc.flow.host_select import select_suite_host
>>> select_suite_host()
('foo', 'abc.foo.fqdn')

Example: Abstract Host Select Interface:

>>> from cylc.flow.host_select import select_host
>>> select_host(
...     ['foo', 'bar', 'baz'],
...     ranking_string='''
...         virtual_memory().available > 123456789123456789
...     '''
... )
Traceback (most recent call last):
...
cylc.flow.exceptions.HostSelectException: Could not select host from:
    foo:
        virtual_memory().available > 123456789123456789: False
    bar:
        virtual_memory().available > 123456789123456789: False
    baz:
        virtual_memory().available > 123456789123456789: False

TODO:

(pending cursory approval)

  • Changelog
  • Docs

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Appropriate tests are included (unit and/or functional).
  • Appropriate change log entry included.
  • (master branch) I have opened a documentation PR at cylc/cylc-doc/pull/XXXX.

@oliver-sanders oliver-sanders force-pushed the host-select branch 2 times, most recently from d206297 to cf7b351 Compare January 28, 2020 21:12
@oliver-sanders oliver-sanders self-assigned this Jan 28, 2020
@oliver-sanders oliver-sanders added this to the cylc-8.0.0 milestone Jan 28, 2020
@oliver-sanders
Copy link
Member Author

@dpmatthews you may want to take a cursory scan of this.

@hjoliver
Copy link
Member

hjoliver commented Jan 28, 2020

Nice improvement 🎉

One user-facing change here is that you need decent Python (and psutil) chops to get the new "threshold string" right, but we can document how to do it, and make sure that Python errors are properly flagged (you've probably done the latter already, but I haven't looked yet...).

@oliver-sanders
Copy link
Member Author

you need decent Python (and psutil) chops to get the new "threshold string" right, but we can document how to do it

So we could actually write an upgrade to do this easily enough but as it's a site config for admin use I think a quick example and a link to the psutil docs should suffice.

you've probably done the latter already

Any non-zero exit code gets treated the same way, the psutil stuff all happens in the remote subprocess so can't leak into the executing program.

@dpmatthews
Copy link
Contributor

Looks good. I assume this will replace [suite servers][run host select]rank & [suite servers][run host select]thresholds.

I don't think thresholds is the right name for the new setting - maybe rank method?

I assume selection will always choose the highest returned value? (need to remember this when ranking by load for example)

@oliver-sanders
Copy link
Member Author

Yep, will rename, higher is better so for server load divide by 1, will document.

The only thing I’ve not built in yet is the ability to specify your own command or script, e.g. to rank by queue length, perhaps a later job.

@oliver-sanders
Copy link
Member Author

Travis is consistently failing for:

./flakytests/restart/39-auto-restart-no-suitable-host.t (Wstat: 256 Tests: 0 Failed: 0)
./flakytests/restart/40-auto-restart-force-stop.t

I cannot get these tests to fail locally (NIWA or MO) so will have to push up some experimental debug commits, in the mean time ready for review.

@oliver-sanders
Copy link
Member Author

Test now passing as of f42ff04.

The issue is that I was using the host FQDN returned by host selection rather than the specified host name (e.g. localhost) which was causing a hostname verification issue on Travis.

'memory', 'disk-space'],
'thresholds': [VDR.V_STRING],
},
'thresholds': [VDR.V_STRING]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See Dave's comments?

from cylc.flow.remote import remote_cylc_cmd, run_cmd


def select_suite_host(cached=True):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this extra layer of function necessary? It's only adding 1 arg and 1 line to the wrapped function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The purpose of this method is to extract the required configurations from the global conf so that the interface is independent of the configuration file. For example when I rename the thresholds configuration I will only have to update the code in one place.

cylc/flow/host_select.py Outdated Show resolved Hide resolved
cylc/flow/host_select.py Outdated Show resolved Hide resolved
cylc/flow/host_select.py Outdated Show resolved Hide resolved
cylc/flow/remote.py Outdated Show resolved Hide resolved
@@ -0,0 +1,155 @@
"""Test the cylc.flow.host_select module with remote hosts.

NOTE: These tests require a remote host to work with and are skipped
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that you should offer a pointer on how to go about providing a remote host.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Uses the same system as the functional tests (see git grep 'remote host with shared fs').

@oliver-sanders oliver-sanders force-pushed the host-select branch 3 times, most recently from 66b82ea to 411cc83 Compare March 15, 2020 23:47
@hjoliver
Copy link
Member

Codacy Here is an overview of what got changed by this pull request:

Complexity increasing per file
==============================
- cylc/flow/host_select.py  15
- cylc/flow/scripts/cylc_psutil.py  4
         

Complexity decreasing per file
==============================
+ cylc/flow/scheduler_cli.py  -1
         

See the complete overview on Codacy

@oliver-sanders
Copy link
Member Author

oliver-sanders commented Mar 16, 2020

The coverage score appears to be wrong :(

When I run the test locally:

$ pytest --cov cylc/flow/tests/test_host_select*
...
cylc/flow/host_select.py                              151     11     96      0    95.55%

Looking at the CodeCov results it's got all of the doctests, and maybe one of the unittests?

@kinow
Copy link
Member

kinow commented Mar 16, 2020

But codecov only shows the method as being called once.

I am not sure how well codecov is able to understand how many times code is called. That number 1, I think, is the number of builds received by Codecov from Travis that have coverage for that function.

If you go to Codecov, then open Pulls, select the Host Select one, and look at "Build", you can click on "Download" to see the .txtreceived by Codecov (which contains all the coverage data, including XML output of coverage.py).

I had a look, and only one of these files had any coverage for host_select.py (this one - search for host_select.py a few times).

image

Looking at the CodeCov results it's got all of the doctests, and maybe one of the unittests?

Maybe coverage.py and codecov don't compute the coverage in the same way? I tried running it in PyCharm but two tests failed. Will have a better look tomorrow as I had issues with Codecov a few days ago and want to investigate if our reports are missing something 👍

@kinow
Copy link
Member

kinow commented Mar 17, 2020

@oliver-sanders just tried locally

$ pytest cylc/flow/tests/test_host_select* --cov=cylc
platform linux -- Python 3.7.3, pytest-5.3.5, py-1.8.1, pluggy-0.13.1 -- /home/kinow/Development/python/workspace/cylc-flow/venv/bin/python
cachedir: .pytest_cache
rootdir: /home/kinow/Development/python/workspace/cylc-flow, inifile: pytest.ini
plugins: cov-2.8.1
collected 12 items / 2 skipped / 10 selected                                                                                                                                                                      

cylc/flow/tests/test_host_select.py::test_hostname_checking PASSED                                                                                                                                          [  8%]
cylc/flow/tests/test_host_select.py::test_localhost PASSED                                                                                                                                                  [ 16%]
cylc/flow/tests/test_host_select.py::test_unique PASSED                                                                                                                                                     [ 25%]
cylc/flow/tests/test_host_select.py::test_filter PASSE....
...
...
cylc/flow/host_select.py                              151     11     96      0    95.55%

Then coverage xml produced an XML with line-rate="0.9272":

image

Looking again at the data uploaded to Codecov for this branch, it instead contains line-rate="0.7881.

Still doesn't explain what's wrong, but at least we know that Codecov is just reporting back the number that was uploaded. The problem may be in our Travis CI set up I guess.

@oliver-sanders
Copy link
Member Author

The problem may be in our Travis CI set up I guess.

Ok, either way I don't think its the fault of this PR, thanks for looking into this. Must prioritise GH actions...

@kinow
Copy link
Member

kinow commented Mar 17, 2020

Started running locally the same commands as Travis. Stopping after each command and checking command line output locally and on Travis unit tests.

After a while, found one difference that could explain the coverage under what we expected @oliver-sanders . One test is skipped on Travis.

image

Not sure if that's enough to justify the decrease from ~95% to ~85%, but at least we have something to blame.

@oliver-sanders
Copy link
Member Author

I knew about that skip (had to skip the test to get it to pass on Travis - DNS issues) but as you say, should account for such a large difference, when you look at the lines tested on codacy it's clear that it's completely missing tests all-together.

@hjoliver
Copy link
Member

Where are we at with this? Still looking for the reason for weird coverage results? (I'm trying to catch up on stuff...)

@oliver-sanders
Copy link
Member Author

Where are we at with this?

  • The branch is good to go
  • The project wide coverage currently doesn't capture cylc subcommands due to the command re-invocation (will be fixed by unify cylc cli (under click?) #3525).
  • There seems to be a small descrepancy between the coverage for the unit tests added in this PR and the coverage being measured by Travis. Perhaps a test setup issue, we need to get going with actions anyway, perhaps build coverage back then.

@oliver-sanders
Copy link
Member Author

Poke

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of minor comments, but tested and pretty much approved. Very sophisticated 🤔 (use of ast ... jeez, I have some catching up to do).

cylc/flow/host_select.py Outdated Show resolved Hide resolved
cylc/flow/host_select.py Outdated Show resolved Hide resolved
cylc/flow/host_select.py Outdated Show resolved Hide resolved
cylc/flow/tests/test_host_select.py Show resolved Hide resolved
cylc/flow/host_select.py Show resolved Hide resolved
cylc/flow/scripts/cylc_psutil.py Outdated Show resolved Hide resolved
Co-Authored-By: Hilary James Oliver <hilary.j.oliver@gmail.com>
Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@hjoliver
Copy link
Member

Travis passed already, I just committed a smaller conflict resolution to this branch via the GH UI (setup.py only) so ignoring the travis re-build. Can't cancel it though, the control buttons have disappeared again 😬

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

host-metrics: use psutil
5 participants