Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[splash] need workaround to efficiently read job stdio output from large jobs #1420

Closed
garlick opened this issue Apr 3, 2018 · 5 comments
Closed
Assignees

Comments

@garlick
Copy link
Member

garlick commented Apr 3, 2018

From conversation with @trws:

Scaling issues running large jobs wtih stdio redirected to a file (or interpreted by front end) have been discussed in #1406.

A workaround for this poor scaling is to avoid stdio redirection during the job (e.g. do not use -O file option to flux-submit), then copy the job output from the KVS to a file afterwards. Currently this would be done with something like:

#!/bin/bash -e

JOBID=$(flux submit "$@" | sed -e 's/Submitted jobid //')
# wait for job to complete
flux wreck attach ${JOBID} >output.$JOBID

This has the advantage of not competing with the big job for KVS and messaging resources, but it still does not scale well for a large number of tasks because flux wreck attach sets up the same set of parallel kz watchers that are used for in-job redirection.

One thought is to modify flux wreck attach so that it walks all existing data first, and only sets up the iowatchers if the job has not reached a terminal state.

@garlick garlick self-assigned this Apr 3, 2018
@garlick
Copy link
Member Author

garlick commented Apr 3, 2018

I'm working on a PR to libkz that defers installation of the KVS watch until all the data already in the KVS has been read (with continuations). If EOF is encountered, then the watch is avoided entirely.

This should allow flux wreck attach to extract stdio from a completed job without internally creating any kvs watchers.

@garlick
Copy link
Member Author

garlick commented Apr 9, 2018

This is mostly done with the merge of #1424. however there was also a request by @trws for a flag to ignore EOF and terminate a read when the last block is read, so holding it open pending that.

@grondo
Copy link
Contributor

grondo commented Apr 11, 2018

This is mostly done with the merge of #1424. however there was also a request by @trws for a flag to ignore EOF and terminate a read when the last block is read, so holding it open pending that.

I ran into something similar with the job error log streams, which aren't typically terminated with EOF.
It would be nice if there was a way to do this from within a reactor loop, but there is probably no way to cleanly do that. I therefore resorted to plain kz_open and while kz_read != EAGAIN. Unfortunately this is completely anathema to the way flux-wreck attach works (and the ioplex stuff) so there might need to be a completely new command (or some sort of no-async mode for ioplex objects)

@garlick
Copy link
Member Author

garlick commented Apr 12, 2018

I was thinking of something like a KZ_FLAGS_NOFOLLOW or similar flag for kz_open that would cause kz_read to return EOF once the last "block" is consumed.

@grondo
Copy link
Contributor

grondo commented Apr 12, 2018

Ok, that sounds reasonable, I was initially not understanding how the latest flux_kvs_lookup_get had any context about whether it had fetched the last block in the fully asynchronous mode (before it went back to sleep in the reactor). However, looking again I see it is obvious in lookup_next so I can implement that instead of working around it in the client code.

grondo added a commit to grondo/flux-core that referenced this issue Apr 12, 2018
Add -n, --no-follow option to flux-wreck attach to set the "nofollow"
option on all kz io streams. This will result in attach exiting after
all existing output blocks have been read, so that attach will not
block, even if some or all kz streams do not yet have EOF.

Fixes flux-framework#1420
grondo added a commit to grondo/flux-core that referenced this issue Apr 12, 2018
Add -n, --no-follow option to flux-wreck attach to set the "nofollow"
option on all kz io streams. This will result in attach exiting after
all existing output blocks have been read, so that attach will not
block, even if some or all kz streams do not yet have EOF.

Fixes flux-framework#1420
grondo added a commit to grondo/flux-core that referenced this issue Apr 12, 2018
Add -n, --no-follow option to flux-wreck attach to set the "nofollow"
option on all kz io streams. This will result in attach exiting after
all existing output blocks have been read, so that attach will not
block, even if some or all kz streams do not yet have EOF.

Fixes flux-framework#1420
grondo added a commit to grondo/flux-core that referenced this issue Apr 12, 2018
Add -n, --no-follow option to flux-wreck attach to set the "nofollow"
option on all kz io streams. This will result in attach exiting after
all existing output blocks have been read, so that attach will not
block, even if some or all kz streams do not yet have EOF.

Fixes flux-framework#1420
grondo added a commit to grondo/flux-core that referenced this issue Apr 12, 2018
Add -n, --no-follow option to flux-wreck attach to set the "nofollow"
option on all kz io streams. This will result in attach exiting after
all existing output blocks have been read, so that attach will not
block, even if some or all kz streams do not yet have EOF.

Fixes flux-framework#1420
grondo added a commit to grondo/flux-core that referenced this issue Apr 12, 2018
Add -n, --no-follow option to flux-wreck attach to set the "nofollow"
option on all kz io streams. This will result in attach exiting after
all existing output blocks have been read, so that attach will not
block, even if some or all kz streams do not yet have EOF.

Fixes flux-framework#1420
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants