-
Notifications
You must be signed in to change notification settings - Fork 6.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Progress bar to use task/summarize state-api endpoint. #31577
Conversation
This removes our dependency on prometheus for the progress bar but there is now a 10000 task limit per job. Updates the task summarize endpoint to accept job id as a filter Signed-off-by: Alan Guo <aguo@anyscale.com>
timeout=option.timeout, limit=RAY_MAX_LIMIT_FROM_API_SERVER, filters=[] | ||
timeout=option.timeout, | ||
limit=RAY_MAX_LIMIT_FROM_API_SERVER, | ||
filters=option.filters, | ||
) | ||
) | ||
summary = StateSummary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just use the same output that's returned from list_tasks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one comment regarding a test! LGTM in general, but should we delete the metrics-based progress bar now? Maybe we can have both for now and remove it later?
@@ -648,7 +648,9 @@ async def summarize_tasks(self, option: SummaryApiOptions) -> SummaryApiResponse | |||
# For summary, try getting as many entries as possible to minimze data loss. | |||
result = await self.list_tasks( | |||
option=ListApiOptions( | |||
timeout=option.timeout, limit=RAY_MAX_LIMIT_FROM_API_SERVER, filters=[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we have a unit test in test_state_api
? maybe we can just have a very simple test to verify job id filter works for ray summary tasks.
timeout=option.timeout, limit=RAY_MAX_LIMIT_FROM_API_SERVER, filters=[] | ||
timeout=option.timeout, | ||
limit=RAY_MAX_LIMIT_FROM_API_SERVER, | ||
filters=option.filters, | ||
) | ||
) | ||
summary = StateSummary( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can just use the same output that's returned from list_tasks?
const driverExists = !jobId ? false : true; | ||
return { | ||
progress, | ||
progress: summed, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you tell me what this syntax is called? Just to understand this part of code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
javascript has a short-hand for defining dictionaries. If the key name and value variable has the same name, you can write only instead of the other.
So
foo_value = 3
bar = 5
dict = {
foo: foo_value,
bar: bar
}
is the same as:
foo_value = 3
bar = 5
dict = {
foo: foo_value,
bar,
}
I'll remove the metrics progress bar in a follow-up PR |
Signed-off-by: Alan Guo <aguo@anyscale.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like there's some lint failure. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. After mering it, can we test it against large cluster that has 10K+ tasks? (just see how slow it is)
test_state_api seems to fail on windows |
Tested with a job with 30k tasks. Took under 4 seconds to load. Seems good enough |
…oject#31577) This removes our dependency on prometheus for the progress bar but there is now a 10000 task limit per job. Updates the task summarize endpoint to accept job id as a filter Frequency of progress bar updates has increased greatly because previously, prometheus scrapes every 15 seconds. Otherwise, the UI is unchanged: Signed-off-by: Andrea Pisoni <andreapiso@gmail.com>
Signed-off-by: Alan Guo <aguo@anyscale.com> This is no longer necessary after #31577
Signed-off-by: Alan Guo <aguo@anyscale.com> This is no longer necessary after ray-project#31577 Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
This removes our dependency on prometheus for the progress bar but there is now a 10000 task limit per job. Updates the task summarize endpoint to accept job id as a filter
Frequency of progress bar updates has increased greatly because previously, prometheus scrapes every 15 seconds.
data:image/s3,"s3://crabby-images/c0d57/c0d57fec0997672451b3a73abe30593a6947272b" alt="Screenshot 2023-01-10 at 3 27 17 PM"
Otherwise, the UI is unchanged:
Signed-off-by: Alan Guo aguo@anyscale.com
Why are these changes needed?
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.