job-info: stream events even if job is inactive #6518

chu11 · 2024-12-16T19:54:22Z

Problem: If a job is inactive, all data in an eventlog will be returned
as a single response during an eventlog watch. This is because we know for
a fact that the data should never change after the job is inactive.

If this data, such as job standard output, is very large, this lookup
can be very slow. In some cases, the use of something like flux job attach
can have the apperance of a hang because the standard output response is
taking so long to lookup and return.

Solution:

When a job is inactive and the user wants to watch a job eventlog,
do not respond with all of the data in a single response. Instead stream
the response back just as if the job were active.

Utilize the FLUX_KVS_WATCH_APPEND_ONCE to ensure the stream ends once all
data in the KVS is streamed. Update all variables, functions, etc. from
"lookup" to "watch".

Fixes #6516

I probably need to add some new tests but wanted to throw this up here for now.

Side note on the "WATCH_APPEND_ONCE" flag, I could make this generic and apply it to all watch types (i.e. even non-appends) but didn't feel it was needed elsewhere so didn't do that.

garlick · 2024-12-16T20:06:35Z

Naive question: could the client just do a FLUX_KVS_WATCH_APPEND instead of the lookup and then just cancel it when the clean event is read?

chu11 · 2024-12-16T20:09:21Z

oh i forgot, doing

> flux run flux lptest 2000000 80
<snip all that output>
> time flux job attach `flux job last`

on master the flux job attach takes about 17 seconds (seen as slow as 20 seconds). On this branch it is now down to about 13 seconds on average on corona (seen as fast as 10 seconds). And the noticeable "pause" at the beginning is no longer there.

chu11 · 2024-12-16T20:13:29Z

Naive question: could the client just do a FLUX_KVS_WATCH_APPEND instead of the lookup and then just cancel it when the clean event is read?

Perhaps my description of what I'm trying to solve isn't clear. We already know for a fact that the 'clean' event has arleady happened. The job is already done / inactive.

So the issue is if you do a FLUX_KVS_WATCH_APPEND instead of a lookup on a job that is inactive/done, how do you know when to issue the cancel? You don't really know.

So that's why I wanted to put the "WATCH_APPEND_ONCE" flag in on the kvs-watch side to finish the stream when it knows it is done.

garlick · 2024-12-16T20:47:26Z

Oh duh, sorry, I was thinking about watching the primary eventlog which always ends in clean but this is about the output eventlog, which doesn't have a definitive end of file.

Since the user is just passing this to flux_kvs_lookup(), maybe the flag name and description should be divorced from the concept of "watch" (even though internally it uses the watch service), e.g.

FLUX_KVS_STREAM
    Return a potentially large value in multiple responses terminated by an ENODATA error response.

chu11 · 2024-12-16T22:02:17Z

Since the user is just passing this to flux_kvs_lookup(), maybe the flag name and description should be divorced from the concept of "watch" (even though internally it uses the watch service), e.g.

Ahhh, I like this idea. Lemme see how to go about this.

chu11 · 2024-12-17T18:03:48Z

still need to add tests but went with the proposed FLUX_KVS_STREAM (still need tests & documentation)

garlick · 2024-12-18T17:42:56Z

Nice! Just a thought, but FLUX_KVS_STREAM is a pretty big feature on its own so maybe it should be proposed in a standalone PR with its own docs and tests? Then this one could make the job-info changes?

chu11 · 2024-12-18T18:16:17Z

just re-pushed, splitting off the FLUX_KVS_STREAM option into #6523. Removing WIP.

Edit: actually, I'll re-put WIP, just so #6523 is reviewed first.

Problem: If a job is inactive, all data in an eventlog will be retrieved from the KVS in a single lookup. This is because we know the data should never change after the job is inactive. If this data is very large, this lookup can be slow. In some cases, the use of something like `flux job attach` can have the appearance of a hang because the standard output response is taking so long to lookup and return. Solution: When a job is inactive and the user wants to watch a job eventlog, do not retrieve all of the data from the KVS in a single lookup. Instead, use the FLUX_KVS_STREAM flag to retrieve the data in smaller chunks. This data will be internally read and parsed no differently than when the job is active. Fixes flux-framework#6516

Problem: Internally a "job-info.eventlog-watch" is done where it used to do a "job-info.lookup". Some older variables, functions, etc. now seem to be misnamed. Rename "main_namespace_lookup" variables, functions, etc. to "main_namespace_watch".

chu11 · 2024-12-20T18:42:45Z

rebased and re-pushed now that #6523 is in, just tweaked some commit messages.

codecov · 2024-12-20T19:12:12Z

Codecov Report

Attention: Patch coverage is 94.73684% with 2 lines in your changes missing coverage. Please review.

Project coverage is 83.62%. Comparing base (e10d676) to head (0118856).

Files with missing lines	Patch %	Lines
src/modules/job-info/guest_watch.c	93.33%	2 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #6518      +/-   ##
==========================================
- Coverage   83.63%   83.62%   -0.02%     
==========================================
  Files         522      522              
  Lines       87805    87809       +4     
==========================================
- Hits        73434    73427       -7     
- Misses      14371    14382      +11

Files with missing lines	Coverage Δ
src/modules/job-info/watch.c	`66.39% <100.00%> (+0.69%)`	⬆️
src/modules/job-info/guest_watch.c	`74.93% <93.33%> (+0.20%)`	⬆️

... and 7 files with indirect coverage changes

garlick · 2024-12-20T19:38:02Z

src/modules/job-info/guest_watch.c

- *    main namespace (main_namespace_lookup()).  This is the "easy" case
+ *    main namespace (main_namespace_watch()).  This is the "easy" case


Wondering about this name change - once the guest info is copied to the primary namespace, it is static, but "watch" implies that it is changing.

yeah, i did go back and forth on it. I went with watch b/c I go through job-info.eventlog-watch instead of calling kvs.lookup (similarly guest_namespace_watch also goes through job-info.eventlog-watch).

I did consider going with stream instead, but that sort of describes internal knowledge of what job-info.eventlog-watch is doing.

chu11 force-pushed the issue6516_job_info_stream_appends branch from 215ba81 to f6d916b Compare December 17, 2024 18:03

chu11 force-pushed the issue6516_job_info_stream_appends branch from f6d916b to 9d02627 Compare December 18, 2024 18:15

chu11 changed the title ~~WIP: job-info: stream events even if job is inactive~~ job-info: stream events even if job is inactive Dec 18, 2024

chu11 changed the title ~~job-info: stream events even if job is inactive~~ WIP: job-info: stream events even if job is inactive Dec 18, 2024

chu11 added 2 commits December 20, 2024 10:36

job-info: rename main_namespace_lookup

0118856

Problem: Internally a "job-info.eventlog-watch" is done where it used to do a "job-info.lookup". Some older variables, functions, etc. now seem to be misnamed. Rename "main_namespace_lookup" variables, functions, etc. to "main_namespace_watch".

chu11 force-pushed the issue6516_job_info_stream_appends branch from 9d02627 to 0118856 Compare December 20, 2024 18:39

chu11 changed the title ~~WIP: job-info: stream events even if job is inactive~~ job-info: stream events even if job is inactive Dec 20, 2024

garlick reviewed Dec 20, 2024

View reviewed changes

chu11 mentioned this pull request Dec 21, 2024

WIP kvs-watch: do not load all KVS blobs at once #6532

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

job-info: stream events even if job is inactive #6518

job-info: stream events even if job is inactive #6518

chu11 commented Dec 16, 2024

garlick commented Dec 16, 2024

chu11 commented Dec 16, 2024 •

edited

Loading

chu11 commented Dec 16, 2024 •

edited

Loading

garlick commented Dec 16, 2024

chu11 commented Dec 16, 2024

chu11 commented Dec 17, 2024

garlick commented Dec 18, 2024

chu11 commented Dec 18, 2024 •

edited

Loading

chu11 commented Dec 20, 2024

codecov bot commented Dec 20, 2024

garlick Dec 20, 2024

chu11 Dec 20, 2024

		* main namespace (main_namespace_lookup()). This is the "easy" case
		* main namespace (main_namespace_watch()). This is the "easy" case

job-info: stream events even if job is inactive #6518

Are you sure you want to change the base?

job-info: stream events even if job is inactive #6518

Conversation

chu11 commented Dec 16, 2024

garlick commented Dec 16, 2024

chu11 commented Dec 16, 2024 • edited Loading

chu11 commented Dec 16, 2024 • edited Loading

garlick commented Dec 16, 2024

chu11 commented Dec 16, 2024

chu11 commented Dec 17, 2024

garlick commented Dec 18, 2024

chu11 commented Dec 18, 2024 • edited Loading

chu11 commented Dec 20, 2024

codecov bot commented Dec 20, 2024

Codecov Report

garlick Dec 20, 2024

Choose a reason for hiding this comment

chu11 Dec 20, 2024

Choose a reason for hiding this comment

chu11 commented Dec 16, 2024 •

edited

Loading

chu11 commented Dec 16, 2024 •

edited

Loading

chu11 commented Dec 18, 2024 •

edited

Loading