Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-info: stream appended data, do not return all data in 1 giant blob #6516

Open
chu11 opened this issue Dec 16, 2024 · 0 comments · May be fixed by #6518
Open

job-info: stream appended data, do not return all data in 1 giant blob #6516

chu11 opened this issue Dec 16, 2024 · 0 comments · May be fixed by #6518
Assignees

Comments

@chu11
Copy link
Member

chu11 commented Dec 16, 2024

(Similar to the work in #6444)

When a job is completed, there is presumably no more updates to job eventlogs, therefore any watched data does a kvs-lookup on the data. This can be very slow if there is a large amount of data. For example, think of flux job attach JOBID on a job that is completed and has a bajillion lines of output. The kvs-lookup will return all of the standard output in 1 big reply.

Instead we should stream this data no different than if the job was live to improve turnaround time. Right now flux job attach JOBID has the appearance of a hang in this scenario.

Note that started with work in #6456 but decided that it should be split off into a separate issue.

@chu11 chu11 self-assigned this Dec 16, 2024
chu11 added a commit to chu11/flux-core that referenced this issue Dec 16, 2024
Problem: If a job is inactive, all data in an eventlog will be returned
as a single response during an eventlog watch.  This is because we know for
a fact that the data should never change after the job is inactive.

If this data, such as job standard output, is very large, this lookup
can be very slow.  In some cases, the use of something like `flux job attach`
can have the apperance of a hang because the standard output response is
taking so long to lookup and return.

Solution:

When a job is inactive and the user wants to watch a job eventlog,
do not respond with all of the data in a single response.  Instead stream
the response back just as if the job were active.

Utilize the FLUX_KVS_WATCH_APPEND_ONCE to ensure the stream ends once all
data in the KVS is streamed.  Update all variables, functions, etc. from
"lookup" to "watch".

Fixes flux-framework#6516
chu11 added a commit to chu11/flux-core that referenced this issue Dec 16, 2024
Problem: If a job is inactive, all data in an eventlog will be returned
as a single response during an eventlog watch.  This is because we know for
a fact that the data should never change after the job is inactive.

If this data, such as job standard output, is very large, this lookup
can be very slow.  In some cases, the use of something like `flux job attach`
can have the apperance of a hang because the standard output response is
taking so long to lookup and return.

Solution:

When a job is inactive and the user wants to watch a job eventlog,
do not respond with all of the data in a single response.  Instead stream
the response back just as if the job were active.

Utilize the FLUX_KVS_WATCH_APPEND_ONCE to ensure the stream ends once all
data in the KVS is streamed.  Update all variables, functions, etc. from
"lookup" to "watch".

Fixes flux-framework#6516
@chu11 chu11 linked a pull request Dec 16, 2024 that will close this issue
chu11 added a commit to chu11/flux-core that referenced this issue Dec 17, 2024
Problem: If a job is inactive, all data in an eventlog will be retrieved
as a single response during an eventlog watch.  This is because we know for
a fact that the data should never change after the job is inactive.

If this data, such as job standard output, is very large, this lookup
can be very slow.  In some cases, the use of something like `flux job attach`
can have the apperance of a hang because the standard output response is
taking so long to lookup and return.

Solution:

When a job is inactive and the user wants to watch a job eventlog,
do not retrieve all of the data.  Instead, retrieve the data via an internal
eventlog watch, but have the eventlog watch use the new FLUX_KVS_STREAM
flag.

Update all variables, functions, etc. from "lookup" to "watch".

Fixes flux-framework#6516
chu11 added a commit to chu11/flux-core that referenced this issue Dec 18, 2024
Problem: If a job is inactive, all data in an eventlog will be retrieved
as a single response during an eventlog watch.  This is because we know for
a fact that the data should never change after the job is inactive.

If this data, such as job standard output, is very large, this lookup
can be very slow.  In some cases, the use of something like `flux job attach`
can have the appearance of a hang because the standard output response is
taking so long to lookup and return.

Solution:

When a job is inactive and the user wants to watch a job eventlog,
do not retrieve all of the data.  Instead, retrieve the data via an internal
eventlog watch, but have the eventlog watch use the new FLUX_KVS_STREAM
flag.

Fixes flux-framework#6516
chu11 added a commit to chu11/flux-core that referenced this issue Dec 20, 2024
Problem: If a job is inactive, all data in an eventlog will be retrieved
from the KVS in a single lookup.  This is because we know the data should never
change after the job is inactive.

If this data is very large, this lookup can be slow.  In some cases, the
use of something like `flux job attach` can have the appearance of a hang
because the standard output response is taking so long to lookup and return.

Solution:

When a job is inactive and the user wants to watch a job eventlog,
do not retrieve all of the data from the KVS in a single lookup.  Instead,
use the FLUX_KVS_STREAM flag to retrieve the data in smaller chunks.  This data
will be internally read and parsed no differently than when the job is active.

Fixes flux-framework#6516
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant