[data] Don't reset iteration counter stats #48618

rickyyx · 2024-11-07T01:10:30Z

Why are these changes needed?

We currently report iter_total_blocked_seconds and iter_user_seconds as Gauge while we tracking them as counters, i.e.:

For each iteration, we had a timer that sums locally for each iteration into an aggregated value (which is the sum of total blocked seconds)
When the iteration ends or the iterator GCed, the gauge metric value is currently set to 0.
This creates confusion for users as a counter value (total time blocked on a dataset) should not be going back to 0, generating charts like below.

Related issue number

Checks

With the fix, we will not set the gauge value to 0.

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

bveeramani · 2024-11-08T01:08:10Z

python/ray/data/_internal/stats.py

+        # NOTE(rickyx): We should not be clearing the iter_total_blocked_s and
+        # iter_user_s metrics because they are technically counters we tracked, and
+        # should not be reset by each iteration.


@rickyyx could you help me understand why this value keeps going up across iterations? Is it because we reset the value here, but not in some other place?

During my investigation for #44635, where the Rows Outputted value is seen as zero, I found that disabling clear_execution_metrics() here is the fix.

I was thinking it would make sense to just completely remove clear_execution_metrics() and clear_iteration_metrics() calls currently being called after Dataset execution/iteration completes -- @rickyyx do you agree? For more context, I believe the reason why we had this in the first place is to do this hacky "reset" of metrics, to prevent values from persisting at the last value. But I think this is no longer the behavior we want, since we also now show rates on the Grafana dashboard by default -- so we can simply remove the metrics reset.

+1 to just removing it

Why also not converting these to counters as well?

Sure - i think removing it makes sense.

Why also not converting these to counters as well?

Yeah, I think this was an alternative, but if I remember correctly, we are less prone to changing actual timeseries definition since there might be customers depending on this? (or what's the policy here for backward compatibility on the metrics)?

I am open to just change the metric type to counters too.

could you help me understand why this value keeps going up across iterations

I think it's because we reuse the stats field. Let me see if that could be fixed.

rickyyx · 2024-11-13T00:08:53Z

Updates

Don't reset the iteration stats (since they are essentially cumulative counters).

rickyyx · 2024-11-13T00:46:26Z

On the front end - I also wonder if we should update the charts to use rate so that it indicates something like average iteration block time in the last 5min rather than the counter value that might be less intuitive?

People are reading the charts as "metric per iteration" - seems there's a gap in the readbility of the charts.

rickyyx · 2024-11-13T22:53:07Z

python/ray/data/_internal/stats.py

@@ -624,14 +618,6 @@ def clear_iteration_metrics(self, dataset_tag: str):
            if dataset_tag in self._last_iteration_stats:
                del self._last_iteration_stats[dataset_tag]


We still wanna clear the iteration stats on the StatsManager so that the async update thread could exit.

But we don't clear the iteration metrics with the StatsActor.

i believe there is a usage of StatsManager.clear_iteration_metrics in test_stats.py which should be removed: https://github.com/scottjlee/ray/blob/d7f7e0f58248b9de145949883989ab597f97a2da/python/ray/data/tests/test_stats.py#L1695

Please capture this context as a comment

Oh we still need to call that I think? Without calling StatsManager.clear_iteration_metrics, the update thread will not exit since we will always have the _last_iteration_metrics here: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L540-L542

ah sorry, i confused myself with the method of the same name in StatsActor, i think you are right Ricky.

alexeykudinkin · 2024-11-14T19:40:53Z

python/ray/data/_internal/stats.py

@@ -624,14 +618,6 @@ def clear_iteration_metrics(self, dataset_tag: str):
            if dataset_tag in self._last_iteration_stats:
                del self._last_iteration_stats[dataset_tag]


Please capture this context as a comment

scottjlee

LGTM. I will remove clear_execution_metrics() usage in another PR.
#48745

rickyyx · 2024-11-14T21:11:37Z

LGTM. I will remove clear_execution_metrics() usage in another PR. #48745

Ah, thanks, i missed that part of your previous comments.

Signed-off-by: rickyx <rickyx@anyscale.com>

## Why are these changes needed? We currently report `iter_total_blocked_seconds` and `iter_user_seconds` as **Gauge** while we tracking them as counters, i.e.: - For each iteration, we had a timer that sums locally for each iteration into an aggregated value (which is the sum of total blocked seconds) - When the iteration ends or the iterator GCed, the gauge metric value is currently set to 0. - This creates confusion for users as a counter value (total time blocked on a dataset) should not be going back to 0, generating charts like below. --------- Signed-off-by: rickyx <rickyx@anyscale.com> Signed-off-by: hjiang <dentinyhao@gmail.com>

init

2886206

rickyyx requested review from scottjlee, bveeramani, raulchen, stephanie-wang, omatthew98, alexeykudinkin and srinathk10 as code owners November 7, 2024 01:10

bveeramani reviewed Nov 8, 2024

View reviewed changes

up

77c6bd8

fix

2005386

rickyyx commented Nov 13, 2024

View reviewed changes

alexeykudinkin approved these changes Nov 14, 2024

View reviewed changes

scottjlee approved these changes Nov 14, 2024

View reviewed changes

nits

f0a6e39

Signed-off-by: rickyx <rickyx@anyscale.com>

rickyyx enabled auto-merge (squash) November 14, 2024 21:14

github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 14, 2024

rickyyx removed the go add ONLY when ready to merge, run all tests label Nov 14, 2024

rickyyx merged commit a1d4cb1 into ray-project:master Nov 14, 2024
6 of 7 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[data] Don't reset iteration counter stats #48618

[data] Don't reset iteration counter stats #48618

rickyyx commented Nov 7, 2024 •

edited

Loading

bveeramani Nov 8, 2024

scottjlee Nov 8, 2024

alexeykudinkin Nov 8, 2024

rickyyx Nov 12, 2024 •

edited

Loading

rickyyx Nov 12, 2024

rickyyx commented Nov 13, 2024

rickyyx commented Nov 13, 2024

rickyyx Nov 13, 2024

scottjlee Nov 14, 2024

alexeykudinkin Nov 14, 2024

rickyyx Nov 14, 2024

scottjlee Nov 14, 2024

alexeykudinkin Nov 14, 2024

scottjlee left a comment •

edited

Loading

rickyyx commented Nov 14, 2024

		@@ -624,14 +618,6 @@ def clear_iteration_metrics(self, dataset_tag: str):
		if dataset_tag in self._last_iteration_stats:
		del self._last_iteration_stats[dataset_tag]

[data] Don't reset iteration counter stats #48618

[data] Don't reset iteration counter stats #48618

Conversation

rickyyx commented Nov 7, 2024 • edited Loading

Why are these changes needed?

Related issue number

Checks

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rickyyx Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rickyyx commented Nov 13, 2024

rickyyx commented Nov 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottjlee left a comment • edited Loading

Choose a reason for hiding this comment

rickyyx commented Nov 14, 2024

rickyyx commented Nov 7, 2024 •

edited

Loading

rickyyx Nov 12, 2024 •

edited

Loading

scottjlee left a comment •

edited

Loading