Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Init remote via multiprocessing pool #2468

Merged

Conversation

matthewrmshin
Copy link
Contributor

@matthewrmshin matthewrmshin commented Nov 2, 2017

Supersede #2400.

Introduce new commands to initialise/tidy suite directory hierarchy for remote task runs. These are launched on a remote owner@host using the cylc.remote remote run mechanism. (Should tick a box in #2302.) We'll now have a remote initialisation command for each (host, owner) combo, which runs using via the multi-processing pool. Submission of tasks of a (host, owner) combo are deferred until the command completes, with no hold-up to the main loop.

Commands to select the remote host for tasks are also launched via the multi-processing pool. If multiple ready tasks have the same host select command string, the command will only be run once for the current batch of tasks. When the ready tasks have consumed the results, the logic will reset the results. This change will allow the main loop to be responsive while host select commands are running.

Update log and runtime database only before the jobs-submit command is launched.
Cleaner separation between job file write logic and job submission logic. (Still more to do, however.)

Address host initialisation/preparation part of #2292. (Host selection part of #2292 likely to be considered/superseded by #2199.)

Close #2292.

TO BE FULLY SITE TESTED.

@matthewrmshin matthewrmshin added the efficiency For notable efficiency improvements label Nov 2, 2017
@matthewrmshin matthewrmshin added this to the next release milestone Nov 2, 2017
@matthewrmshin matthewrmshin self-assigned this Nov 2, 2017
@matthewrmshin
Copy link
Contributor Author

(The branch is the same as the original in #2400, but re-based against latest master.)

@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch from 75615d7 to a9eea1a Compare November 3, 2017 09:06
@hjoliver
Copy link
Member

hjoliver commented Nov 5, 2017

(conflicts)

@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch from a9eea1a to b619620 Compare November 6, 2017 08:47
@matthewrmshin
Copy link
Contributor Author

Re-based.

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, tests as working.

@hjoliver
Copy link
Member

hjoliver commented Nov 6, 2017

Are you still intending to do this: #2400 (review) ? (maybe it's not necessary to, essentially, check that pool commands really do happen in the background).

@matthewrmshin
Copy link
Contributor Author

(I am having another look at the logic, and I think something is not quite right, so please hold off reviewing for now.)

@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch from b619620 to 8d61b30 Compare November 8, 2017 15:48
@matthewrmshin matthewrmshin changed the title Init host refactor use multiproc pool Init remote via multiprocessing pool Nov 8, 2017
@matthewrmshin
Copy link
Contributor Author

@hjoliver A question for you. For historical reason, the remote initialisation logic copies the python/ sub-directory of the suite to the remote task host. The use of the python/ sub-directory is apparently deprecated according to the documentation. Should we continue to copy this sub-directory to the remote task host? Or should we be copying lib/python/ (and bin/ and other stuffs) instead?

@hjoliver
Copy link
Member

hjoliver commented Nov 8, 2017

@matthewrmshin good question. As I recall, the deprecation was to do with what is automatically added to sys.path in the scheduler. I don't recall what our policy is, if anything, (I can't look right now), on what is installed from suite dir to job hosts? But without some way of configuring that (migration of rose suite-run will solve this I guess) I suppose we should either copy nothing (i.e. it is up to the user) or everything?

@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch from 8d61b30 to e19b4e8 Compare November 9, 2017 11:26
@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch 6 times, most recently from 0bd8979 to de79487 Compare November 11, 2017 07:44
@matthewrmshin
Copy link
Contributor Author

(The Codacy issues are nonsense in the context of this change.)

@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch from de79487 to 9b00e55 Compare November 13, 2017 10:33
@matthewrmshin
Copy link
Contributor Author

@hjoliver This is now ready for another look.

@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch from 9b00e55 to abe4776 Compare November 13, 2017 12:42
@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch from abe4776 to 106e8ed Compare November 13, 2017 12:51
@matthewrmshin
Copy link
Contributor Author

(Finally got Codacy to not complain.)

@hjoliver
Copy link
Member

I'm hoping to re-review this tomorrow (Tuesday).

@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch from 8a64c7f to 0d35bbb Compare November 24, 2017 20:58
@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch from 0d35bbb to 62f018f Compare January 5, 2018 09:39
@matthewrmshin
Copy link
Contributor Author

Branch re-based and de-conflicted. Codacy failure is not applicable in this case - the logic has been moved from one module to a new one - and STDIN is already being redirected from /dev/null.

@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch from cdb0b50 to bd50196 Compare January 8, 2018 09:16
dvalters and others added 5 commits January 9, 2018 08:51
New commands to initialise/tidy suite directory hierarchy for remote
task runs. Commands now run in the background, and launched on remote
hosts as using `cylc.remote` remote run mechanism.
(Tick a box in cylc#2302.)

Update log and runtime database only before the real submission command.
Task host select commands are now done via the process pool. If multiple
ready tasks have the same host select command string, the command will
only be run once for the current batch of tasks. When the ready tasks
have consumed the results, the logic will reset the results. This change
will allow the main loop to be responsive while host select commands are
running.
@matthewrmshin matthewrmshin force-pushed the init-host-refactor-use-multiproc-pool branch from bd50196 to a6fc760 Compare January 9, 2018 08:51
@matthewrmshin
Copy link
Contributor Author

(Again, the Codacy issues are nonsense in the context of this change. The problem logic is just moved from an old module to a new module.)

Copy link
Member

@hjoliver hjoliver left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good, but I still need to do some testing (tomorrow)

@@ -422,6 +424,8 @@ comsum['broadcast'] = 'Change suite [runtime] settings on the fly'
comsum['jobs-kill'] = '(Internal) Kill task jobs'
comsum['jobs-poll'] = '(Internal) Retrieve status for task jobs'
comsum['jobs-submit'] = '(Internal) Submit task jobs'
comsum['remote-init'] = '(Internal) Initialise a task remote'
comsum['remote-tidy'] = '(Internal) Tidy a task remote'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should document this terminology, which I like, and which I guess comes from task [[[remote]]], and use it consistently - perhaps in another PR though.

I've mostly been using the uglier "task job host account" or similar, I think. Something like this?: a "task remote" is a user account, other than the suite host account, where a task job is submitted to run. It can be on the suite host machine or another machine.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do this in a follow-on PR? (This PR has lasted long enough for more changes.)

(The [user@]host syntax is sometimes called an authority - see https://en.wikipedia.org/wiki/URL for example - but everyone are confused by the phrase.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's fine - and sorry for the review delay on this!

Re-submit only waiting tasks.
bin/cylc-submit Outdated
task_job_mgr.prep_submit_task_jobs(suite, itasks, dry_run=True)
while waiting_tasks:
prep_tasks, bad_tasks = task_job_mgr.prep_submit_task_jobs(
suite, itasks, dry_run=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be waiting_tasks rather than itasks?

if len(ctx.cmd_kwargs['stdin_file_paths']) > 1:
stdin_file = TemporaryFile()
for file_path in ctx.cmd_kwargs['stdin_file_paths']:
stdin_file.write(open(file_path, 'rb').read())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these stdin files are likely to be large it might be worth writing line by line to avoid loading the whole file into memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the main usage here is job files, so no likely to be big enough to cause issues.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

bin/cylc-submit Outdated
@@ -126,7 +136,12 @@ def main():
'Unable to prepare job file for %s' % itask.identity)
ret_code = 1
else:
task_job_mgr.submit_task_jobs(suite, itasks)
while waiting_tasks:
for itask in task_job_mgr.submit_task_jobs(suite, itasks):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be waiting_tasks rather than itasks?

'remote-host-select', cmd, env=dict(os.environ)),
self._remote_host_select_callback, [cmd_str])
self.remote_host_str_map[cmd_str] = None
return self.remote_host_str_map[cmd_str]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this return here?

If called for the first time the is_remote_host logic is bypassed but if the result is cashed the logic is run?

Copy link
Contributor Author

@matthewrmshin matthewrmshin Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, returning None here. The command has only been put into the process pool, so we'll have to wait for it to return.

The follow-on logic is only run for a straightforward host string, or for a host string returned by a successful/cached host select command.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

pass # Not yet initialised
else:
if status == REMOTE_INIT_FAILED:
del self.remote_init_map[(host, owner)] # reset to allow retry
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a mechanism to prevent the potential for an infinite loop of failed remote_init attempts.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we should add this functionality when we tackle #2315.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

self.suite, 'suite run directory', host, owner))
self.proc_pool.put_command(
SuiteProcContext(
'remote-init', cmd, stdin_file_paths=[tmphandle.name]),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to scrap the need for tar files and do something like stdin_file_paths=[path for path, _ in items] at this end (maybe | xargs on the other side ...)?

Copy link
Contributor Author

@matthewrmshin matthewrmshin Jan 11, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I think usage of a TAR file is the cleanest implementation. It allows us to transfer a whole set of files (file names, modes, potentially binary contents, etc) via a single remote command on a single SSH session. #2302 demands that we don't use a shell on the remote side.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK.

items = []
comm_meth = GLOBAL_CFG.get_host_item(
'task communication method', host, owner)
LOG.debug('comm_meth=%s' % comm_meth)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this needed any more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK for debug purpose. I am sure we'll revisit this when we tackle #2214.

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hunky dory!

@oliver-sanders oliver-sanders merged commit 5add31e into cylc:master Jan 12, 2018
@matthewrmshin matthewrmshin deleted the init-host-refactor-use-multiproc-pool branch January 12, 2018 15:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
efficiency For notable efficiency improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants