Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data] Don't reset iteration counter stats #48618

Merged
merged 4 commits into from
Nov 14, 2024

Conversation

rickyyx
Copy link
Contributor

@rickyyx rickyyx commented Nov 7, 2024

Why are these changes needed?

We currently report iter_total_blocked_seconds and iter_user_seconds as Gauge while we tracking them as counters, i.e.:

  • For each iteration, we had a timer that sums locally for each iteration into an aggregated value (which is the sum of total blocked seconds)
  • When the iteration ends or the iterator GCed, the gauge metric value is currently set to 0.
  • This creates confusion for users as a counter value (total time blocked on a dataset) should not be going back to 0, generating charts like below.
image

Related issue number

Checks

With the fix, we will not set the gauge value to 0.

image
  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

Comment on lines 410 to 412
# NOTE(rickyx): We should not be clearing the iter_total_blocked_s and
# iter_user_s metrics because they are technically counters we tracked, and
# should not be reset by each iteration.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rickyyx could you help me understand why this value keeps going up across iterations? Is it because we reset the value here, but not in some other place?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

During my investigation for #44635, where the Rows Outputted value is seen as zero, I found that disabling clear_execution_metrics() here is the fix.

I was thinking it would make sense to just completely remove clear_execution_metrics() and clear_iteration_metrics() calls currently being called after Dataset execution/iteration completes -- @rickyyx do you agree? For more context, I believe the reason why we had this in the first place is to do this hacky "reset" of metrics, to prevent values from persisting at the last value. But I think this is no longer the behavior we want, since we also now show rates on the Grafana dashboard by default -- so we can simply remove the metrics reset.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to just removing it

Why also not converting these to counters as well?

Copy link
Contributor Author

@rickyyx rickyyx Nov 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure - i think removing it makes sense.

Why also not converting these to counters as well?

Yeah, I think this was an alternative, but if I remember correctly, we are less prone to changing actual timeseries definition since there might be customers depending on this? (or what's the policy here for backward compatibility on the metrics)?

I am open to just change the metric type to counters too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you help me understand why this value keeps going up across iterations

I think it's because we reuse the stats field. Let me see if that could be fixed.

@rickyyx
Copy link
Contributor Author

rickyyx commented Nov 13, 2024

Updates

  • Don't reset the iteration stats (since they are essentially cumulative counters).

@rickyyx
Copy link
Contributor Author

rickyyx commented Nov 13, 2024

On the front end - I also wonder if we should update the charts to use rate so that it indicates something like average iteration block time in the last 5min rather than the counter value that might be less intuitive?

People are reading the charts as "metric per iteration" - seems there's a gap in the readbility of the charts.

@@ -624,14 +618,6 @@ def clear_iteration_metrics(self, dataset_tag: str):
if dataset_tag in self._last_iteration_stats:
del self._last_iteration_stats[dataset_tag]
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still wanna clear the iteration stats on the StatsManager so that the async update thread could exit.

But we don't clear the iteration metrics with the StatsActor.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i believe there is a usage of StatsManager.clear_iteration_metrics in test_stats.py which should be removed: https://github.com/scottjlee/ray/blob/d7f7e0f58248b9de145949883989ab597f97a2da/python/ray/data/tests/test_stats.py#L1695

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please capture this context as a comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh we still need to call that I think? Without calling StatsManager.clear_iteration_metrics, the update thread will not exit since we will always have the _last_iteration_metrics here: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L540-L542

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah sorry, i confused myself with the method of the same name in StatsActor, i think you are right Ricky.

@@ -624,14 +618,6 @@ def clear_iteration_metrics(self, dataset_tag: str):
if dataset_tag in self._last_iteration_stats:
del self._last_iteration_stats[dataset_tag]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please capture this context as a comment

Copy link
Contributor

@scottjlee scottjlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I will remove clear_execution_metrics() usage in another PR.
#48745

@rickyyx
Copy link
Contributor Author

rickyyx commented Nov 14, 2024

LGTM. I will remove clear_execution_metrics() usage in another PR. #48745

Ah, thanks, i missed that part of your previous comments.

Signed-off-by: rickyx <rickyx@anyscale.com>
@rickyyx rickyyx enabled auto-merge (squash) November 14, 2024 21:14
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 14, 2024
@rickyyx rickyyx removed the go add ONLY when ready to merge, run all tests label Nov 14, 2024
@rickyyx rickyyx merged commit a1d4cb1 into ray-project:master Nov 14, 2024
6 of 7 checks passed
dentiny pushed a commit to dentiny/ray that referenced this pull request Dec 7, 2024
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

We currently report `iter_total_blocked_seconds` and `iter_user_seconds`
as **Gauge** while we tracking them as counters, i.e.:
- For each iteration, we had a timer that sums locally for each
iteration into an aggregated value (which is the sum of total blocked
seconds)
- When the iteration ends or the iterator GCed, the gauge metric value
is currently set to 0.
- This creates confusion for users as a counter value (total time
blocked on a dataset) should not be going back to 0, generating charts
like below.

---------

Signed-off-by: rickyx <rickyx@anyscale.com>
Signed-off-by: hjiang <dentinyhao@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants