Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Progress bar to use task/summarize state-api endpoint. #31577

Merged
merged 9 commits into from
Jan 17, 2023

Conversation

alanwguo
Copy link
Contributor

This removes our dependency on prometheus for the progress bar but there is now a 10000 task limit per job. Updates the task summarize endpoint to accept job id as a filter

Frequency of progress bar updates has increased greatly because previously, prometheus scrapes every 15 seconds.
Otherwise, the UI is unchanged:
Screenshot 2023-01-10 at 3 27 17 PM

Signed-off-by: Alan Guo aguo@anyscale.com

Why are these changes needed?

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

This removes our dependency on prometheus for the progress bar but there is now a 10000 task limit per job.
Updates the task summarize endpoint to accept job id as a filter

Signed-off-by: Alan Guo <aguo@anyscale.com>
timeout=option.timeout, limit=RAY_MAX_LIMIT_FROM_API_SERVER, filters=[]
timeout=option.timeout,
limit=RAY_MAX_LIMIT_FROM_API_SERVER,
filters=option.filters,
)
)
summary = StateSummary(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rkooo567 @rickyyx , should I update the SummaryApiResponse num_filtered value?

It's not clear what this is supposed to be.. Should it show the number of results from the summary api (1 per task name after filtering) or the number of tasks after filtering?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use the same output that's returned from list_tasks?

Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one comment regarding a test! LGTM in general, but should we delete the metrics-based progress bar now? Maybe we can have both for now and remove it later?

@@ -648,7 +648,9 @@ async def summarize_tasks(self, option: SummaryApiOptions) -> SummaryApiResponse
# For summary, try getting as many entries as possible to minimze data loss.
result = await self.list_tasks(
option=ListApiOptions(
timeout=option.timeout, limit=RAY_MAX_LIMIT_FROM_API_SERVER, filters=[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a unit test in test_state_api? maybe we can just have a very simple test to verify job id filter works for ray summary tasks.

timeout=option.timeout, limit=RAY_MAX_LIMIT_FROM_API_SERVER, filters=[]
timeout=option.timeout,
limit=RAY_MAX_LIMIT_FROM_API_SERVER,
filters=option.filters,
)
)
summary = StateSummary(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just use the same output that's returned from list_tasks?

const driverExists = !jobId ? false : true;
return {
progress,
progress: summed,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you tell me what this syntax is called? Just to understand this part of code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

javascript has a short-hand for defining dictionaries. If the key name and value variable has the same name, you can write only instead of the other.

So

foo_value = 3
bar = 5

dict = {
  foo: foo_value,
  bar: bar
}

is the same as:

foo_value = 3
bar = 5

dict = {
  foo: foo_value,
  bar,
}

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 11, 2023
@alanwguo
Copy link
Contributor Author

one comment regarding a test! LGTM in general, but should we delete the metrics-based progress bar now? Maybe we can have both for now and remove it later?

I'll remove the metrics progress bar in a follow-up PR

Signed-off-by: Alan Guo <aguo@anyscale.com>
Copy link
Contributor

@rickyyx rickyyx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alanwguo alanwguo removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 11, 2023
@rickyyx
Copy link
Contributor

rickyyx commented Jan 11, 2023

Looks like there's some lint failure.

Signed-off-by: Alan Guo <aguo@anyscale.com>
Copy link
Contributor

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. After mering it, can we test it against large cluster that has 10K+ tasks? (just see how slow it is)

@rkooo567
Copy link
Contributor

test_state_api seems to fail on windows

@rkooo567 rkooo567 added the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 12, 2023
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
Signed-off-by: Alan Guo <aguo@anyscale.com>
@alanwguo alanwguo removed the @author-action-required The PR author is responsible for the next step. Remove tag to send back to the reviewer. label Jan 17, 2023
@rkooo567 rkooo567 merged commit ef78f2c into ray-project:master Jan 17, 2023
@alanwguo
Copy link
Contributor Author

LGTM. After mering it, can we test it against large cluster that has 10K+ tasks? (just see how slow it is)

Tested with a job with 30k tasks. Took under 4 seconds to load. Seems good enough

andreapiso pushed a commit to andreapiso/ray that referenced this pull request Jan 22, 2023
…oject#31577)

This removes our dependency on prometheus for the progress bar but there is now a 10000 task limit per job. Updates the task summarize endpoint to accept job id as a filter

Frequency of progress bar updates has increased greatly because previously, prometheus scrapes every 15 seconds.
Otherwise, the UI is unchanged:

Signed-off-by: Andrea Pisoni <andreapiso@gmail.com>
rkooo567 pushed a commit that referenced this pull request Feb 7, 2023
Signed-off-by: Alan Guo <aguo@anyscale.com>

This is no longer necessary after #31577
edoakes pushed a commit to edoakes/ray that referenced this pull request Mar 22, 2023
Signed-off-by: Alan Guo <aguo@anyscale.com>

This is no longer necessary after ray-project#31577

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants