Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

job-info: stream events even if job is inactive #6518

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

chu11
Copy link
Member

@chu11 chu11 commented Dec 16, 2024

Problem: If a job is inactive, all data in an eventlog will be returned
as a single response during an eventlog watch. This is because we know for
a fact that the data should never change after the job is inactive.

If this data, such as job standard output, is very large, this lookup
can be very slow. In some cases, the use of something like flux job attach
can have the apperance of a hang because the standard output response is
taking so long to lookup and return.

Solution:

When a job is inactive and the user wants to watch a job eventlog,
do not respond with all of the data in a single response. Instead stream
the response back just as if the job were active.

Utilize the FLUX_KVS_WATCH_APPEND_ONCE to ensure the stream ends once all
data in the KVS is streamed. Update all variables, functions, etc. from
"lookup" to "watch".

Fixes #6516


I probably need to add some new tests but wanted to throw this up here for now.

Side note on the "WATCH_APPEND_ONCE" flag, I could make this generic and apply it to all watch types (i.e. even non-appends) but didn't feel it was needed elsewhere so didn't do that.

@garlick
Copy link
Member

garlick commented Dec 16, 2024

Naive question: could the client just do a FLUX_KVS_WATCH_APPEND instead of the lookup and then just cancel it when the clean event is read?

@chu11
Copy link
Member Author

chu11 commented Dec 16, 2024

oh i forgot, doing

> flux run flux lptest 2000000 80
<snip all that output>
> time flux job attach `flux job last`

on master the flux job attach takes about 17 seconds (seen as slow as 20 seconds). On this branch it is now down to about 13 seconds on average on corona (seen as fast as 10 seconds). And the noticeable "pause" at the beginning is no longer there.

@chu11
Copy link
Member Author

chu11 commented Dec 16, 2024

Naive question: could the client just do a FLUX_KVS_WATCH_APPEND instead of the lookup and then just cancel it when the clean event is read?

Perhaps my description of what I'm trying to solve isn't clear. We already know for a fact that the 'clean' event has arleady happened. The job is already done / inactive.

So the issue is if you do a FLUX_KVS_WATCH_APPEND instead of a lookup on a job that is inactive/done, how do you know when to issue the cancel? You don't really know.

So that's why I wanted to put the "WATCH_APPEND_ONCE" flag in on the kvs-watch side to finish the stream when it knows it is done.

@garlick
Copy link
Member

garlick commented Dec 16, 2024

Oh duh, sorry, I was thinking about watching the primary eventlog which always ends in clean but this is about the output eventlog, which doesn't have a definitive end of file.

Since the user is just passing this to flux_kvs_lookup(), maybe the flag name and description should be divorced from the concept of "watch" (even though internally it uses the watch service), e.g.

FLUX_KVS_STREAM
    Return a potentially large value in multiple responses terminated by an ENODATA error response.

@chu11
Copy link
Member Author

chu11 commented Dec 16, 2024

Since the user is just passing this to flux_kvs_lookup(), maybe the flag name and description should be divorced from the concept of "watch" (even though internally it uses the watch service), e.g.

Ahhh, I like this idea. Lemme see how to go about this.

@chu11 chu11 force-pushed the issue6516_job_info_stream_appends branch from 215ba81 to f6d916b Compare December 17, 2024 18:03
@chu11
Copy link
Member Author

chu11 commented Dec 17, 2024

still need to add tests but went with the proposed FLUX_KVS_STREAM (still need tests & documentation)

@garlick
Copy link
Member

garlick commented Dec 18, 2024

Nice! Just a thought, but FLUX_KVS_STREAM is a pretty big feature on its own so maybe it should be proposed in a standalone PR with its own docs and tests? Then this one could make the job-info changes?

@chu11 chu11 force-pushed the issue6516_job_info_stream_appends branch from f6d916b to 9d02627 Compare December 18, 2024 18:15
@chu11 chu11 changed the title WIP: job-info: stream events even if job is inactive job-info: stream events even if job is inactive Dec 18, 2024
@chu11
Copy link
Member Author

chu11 commented Dec 18, 2024

just re-pushed, splitting off the FLUX_KVS_STREAM option into #6523. Removing WIP.

Edit: actually, I'll re-put WIP, just so #6523 is reviewed first.

@chu11 chu11 changed the title job-info: stream events even if job is inactive WIP: job-info: stream events even if job is inactive Dec 18, 2024
Problem: If a job is inactive, all data in an eventlog will be retrieved
from the KVS in a single lookup.  This is because we know the data should never
change after the job is inactive.

If this data is very large, this lookup can be slow.  In some cases, the
use of something like `flux job attach` can have the appearance of a hang
because the standard output response is taking so long to lookup and return.

Solution:

When a job is inactive and the user wants to watch a job eventlog,
do not retrieve all of the data from the KVS in a single lookup.  Instead,
use the FLUX_KVS_STREAM flag to retrieve the data in smaller chunks.  This data
will be internally read and parsed no differently than when the job is active.

Fixes flux-framework#6516
Problem: Internally a "job-info.eventlog-watch" is done where it
used to do a "job-info.lookup".  Some older variables, functions, etc.
now seem to be misnamed.

Rename "main_namespace_lookup" variables, functions, etc. to
"main_namespace_watch".
@chu11 chu11 force-pushed the issue6516_job_info_stream_appends branch from 9d02627 to 0118856 Compare December 20, 2024 18:39
@chu11 chu11 changed the title WIP: job-info: stream events even if job is inactive job-info: stream events even if job is inactive Dec 20, 2024
@chu11
Copy link
Member Author

chu11 commented Dec 20, 2024

rebased and re-pushed now that #6523 is in, just tweaked some commit messages.

Copy link

codecov bot commented Dec 20, 2024

Codecov Report

Attention: Patch coverage is 94.73684% with 2 lines in your changes missing coverage. Please review.

Project coverage is 83.62%. Comparing base (e10d676) to head (0118856).

Files with missing lines Patch % Lines
src/modules/job-info/guest_watch.c 93.33% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master    #6518      +/-   ##
==========================================
- Coverage   83.63%   83.62%   -0.02%     
==========================================
  Files         522      522              
  Lines       87805    87809       +4     
==========================================
- Hits        73434    73427       -7     
- Misses      14371    14382      +11     
Files with missing lines Coverage Δ
src/modules/job-info/watch.c 66.39% <100.00%> (+0.69%) ⬆️
src/modules/job-info/guest_watch.c 74.93% <93.33%> (+0.20%) ⬆️

... and 7 files with indirect coverage changes

* main namespace (main_namespace_lookup()). This is the "easy" case
* main namespace (main_namespace_watch()). This is the "easy" case
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wondering about this name change - once the guest info is copied to the primary namespace, it is static, but "watch" implies that it is changing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, i did go back and forth on it. I went with watch b/c I go through job-info.eventlog-watch instead of calling kvs.lookup (similarly guest_namespace_watch also goes through job-info.eventlog-watch).

I did consider going with stream instead, but that sort of describes internal knowledge of what job-info.eventlog-watch is doing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

job-info: stream appended data, do not return all data in 1 giant blob
2 participants