-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Init remote via multiprocessing pool #2468
Init remote via multiprocessing pool #2468
Conversation
(The branch is the same as the original in #2400, but re-based against latest master.) |
75615d7
to
a9eea1a
Compare
(conflicts) |
a9eea1a
to
b619620
Compare
Re-based. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, tests as working.
Are you still intending to do this: #2400 (review) ? (maybe it's not necessary to, essentially, check that pool commands really do happen in the background). |
(I am having another look at the logic, and I think something is not quite right, so please hold off reviewing for now.) |
b619620
to
8d61b30
Compare
@hjoliver A question for you. For historical reason, the remote initialisation logic copies the |
@matthewrmshin good question. As I recall, the deprecation was to do with what is automatically added to sys.path in the scheduler. I don't recall what our policy is, if anything, (I can't look right now), on what is installed from suite dir to job hosts? But without some way of configuring that (migration of rose suite-run will solve this I guess) I suppose we should either copy nothing (i.e. it is up to the user) or everything? |
8d61b30
to
e19b4e8
Compare
0bd8979
to
de79487
Compare
(The Codacy issues are nonsense in the context of this change.) |
de79487
to
9b00e55
Compare
@hjoliver This is now ready for another look. |
9b00e55
to
abe4776
Compare
abe4776
to
106e8ed
Compare
(Finally got Codacy to not complain.) |
106e8ed
to
8a64c7f
Compare
I'm hoping to re-review this tomorrow (Tuesday). |
8a64c7f
to
0d35bbb
Compare
0d35bbb
to
62f018f
Compare
Branch re-based and de-conflicted. Codacy failure is not applicable in this case - the logic has been moved from one module to a new one - and STDIN is already being redirected from |
cdb0b50
to
bd50196
Compare
New commands to initialise/tidy suite directory hierarchy for remote task runs. Commands now run in the background, and launched on remote hosts as using `cylc.remote` remote run mechanism. (Tick a box in cylc#2302.) Update log and runtime database only before the real submission command.
Task host select commands are now done via the process pool. If multiple ready tasks have the same host select command string, the command will only be run once for the current batch of tasks. When the ready tasks have consumed the results, the logic will reset the results. This change will allow the main loop to be responsive while host select commands are running.
bd50196
to
a6fc760
Compare
(Again, the Codacy issues are nonsense in the context of this change. The problem logic is just moved from an old module to a new module.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good, but I still need to do some testing (tomorrow)
@@ -422,6 +424,8 @@ comsum['broadcast'] = 'Change suite [runtime] settings on the fly' | |||
comsum['jobs-kill'] = '(Internal) Kill task jobs' | |||
comsum['jobs-poll'] = '(Internal) Retrieve status for task jobs' | |||
comsum['jobs-submit'] = '(Internal) Submit task jobs' | |||
comsum['remote-init'] = '(Internal) Initialise a task remote' | |||
comsum['remote-tidy'] = '(Internal) Tidy a task remote' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should document this terminology, which I like, and which I guess comes from task [[[remote]]]
, and use it consistently - perhaps in another PR though.
I've mostly been using the uglier "task job host account" or similar, I think. Something like this?: a "task remote" is a user account, other than the suite host account, where a task job is submitted to run. It can be on the suite host machine or another machine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do this in a follow-on PR? (This PR has lasted long enough for more changes.)
(The [user@]host
syntax is sometimes called an authority - see https://en.wikipedia.org/wiki/URL for example - but everyone are confused by the phrase.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's fine - and sorry for the review delay on this!
Re-submit only waiting tasks.
bin/cylc-submit
Outdated
task_job_mgr.prep_submit_task_jobs(suite, itasks, dry_run=True) | ||
while waiting_tasks: | ||
prep_tasks, bad_tasks = task_job_mgr.prep_submit_task_jobs( | ||
suite, itasks, dry_run=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be waiting_tasks
rather than itasks
?
if len(ctx.cmd_kwargs['stdin_file_paths']) > 1: | ||
stdin_file = TemporaryFile() | ||
for file_path in ctx.cmd_kwargs['stdin_file_paths']: | ||
stdin_file.write(open(file_path, 'rb').read()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these stdin files are likely to be large it might be worth writing line by line to avoid loading the whole file into memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the main usage here is job files, so no likely to be big enough to cause issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
bin/cylc-submit
Outdated
@@ -126,7 +136,12 @@ def main(): | |||
'Unable to prepare job file for %s' % itask.identity) | |||
ret_code = 1 | |||
else: | |||
task_job_mgr.submit_task_jobs(suite, itasks) | |||
while waiting_tasks: | |||
for itask in task_job_mgr.submit_task_jobs(suite, itasks): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be waiting_tasks
rather than itasks
?
'remote-host-select', cmd, env=dict(os.environ)), | ||
self._remote_host_select_callback, [cmd_str]) | ||
self.remote_host_str_map[cmd_str] = None | ||
return self.remote_host_str_map[cmd_str] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this return here?
If called for the first time the is_remote_host
logic is bypassed but if the result is cashed the logic is run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, returning None
here. The command has only been put into the process pool, so we'll have to wait for it to return.
The follow-on logic is only run for a straightforward host string, or for a host string returned by a successful/cached host select command.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
pass # Not yet initialised | ||
else: | ||
if status == REMOTE_INIT_FAILED: | ||
del self.remote_init_map[(host, owner)] # reset to allow retry |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need a mechanism to prevent the potential for an infinite loop of failed remote_init attempts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we should add this functionality when we tackle #2315.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted.
self.suite, 'suite run directory', host, owner)) | ||
self.proc_pool.put_command( | ||
SuiteProcContext( | ||
'remote-init', cmd, stdin_file_paths=[tmphandle.name]), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be possible to scrap the need for tar files and do something like stdin_file_paths=[path for path, _ in items]
at this end (maybe | xargs
on the other side ...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, I think usage of a TAR file is the cleanest implementation. It allows us to transfer a whole set of files (file names, modes, potentially binary contents, etc) via a single remote command on a single SSH session. #2302 demands that we don't use a shell on the remote side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK.
items = [] | ||
comm_meth = GLOBAL_CFG.get_host_item( | ||
'task communication method', host, owner) | ||
LOG.debug('comm_meth=%s' % comm_meth) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this needed any more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK for debug purpose. I am sure we'll revisit this when we tackle #2214.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hunky dory!
Supersede #2400.
Introduce new commands to initialise/tidy suite directory hierarchy for remote task runs. These are launched on a remote
owner@host
using thecylc.remote
remote run mechanism. (Should tick a box in #2302.) We'll now have a remote initialisation command for each(host, owner)
combo, which runs using via the multi-processing pool. Submission of tasks of a(host, owner)
combo are deferred until the command completes, with no hold-up to the main loop.Commands to select the remote host for tasks are also launched via the multi-processing pool. If multiple ready tasks have the same host select command string, the command will only be run once for the current batch of tasks. When the ready tasks have consumed the results, the logic will reset the results. This change will allow the main loop to be responsive while host select commands are running.
Update log and runtime database only before the jobs-submit command is launched.
Cleaner separation between job file write logic and job submission logic. (Still more to do, however.)
Address host initialisation/preparation part of #2292. (Host selection part of #2292 likely to be considered/superseded by #2199.)Close #2292.
TO BE FULLY SITE TESTED.