-
Notifications
You must be signed in to change notification settings - Fork 5.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[data] Don't reset iteration counter stats #48618
Conversation
python/ray/data/_internal/stats.py
Outdated
# NOTE(rickyx): We should not be clearing the iter_total_blocked_s and | ||
# iter_user_s metrics because they are technically counters we tracked, and | ||
# should not be reset by each iteration. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rickyyx could you help me understand why this value keeps going up across iterations? Is it because we reset the value here, but not in some other place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
During my investigation for #44635, where the Rows Outputted
value is seen as zero, I found that disabling clear_execution_metrics()
here is the fix.
I was thinking it would make sense to just completely remove clear_execution_metrics()
and clear_iteration_metrics()
calls currently being called after Dataset execution/iteration completes -- @rickyyx do you agree? For more context, I believe the reason why we had this in the first place is to do this hacky "reset" of metrics, to prevent values from persisting at the last value. But I think this is no longer the behavior we want, since we also now show rates on the Grafana dashboard by default -- so we can simply remove the metrics reset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 to just removing it
Why also not converting these to counters as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure - i think removing it makes sense.
Why also not converting these to counters as well?
Yeah, I think this was an alternative, but if I remember correctly, we are less prone to changing actual timeseries definition since there might be customers depending on this? (or what's the policy here for backward compatibility on the metrics)?
I am open to just change the metric type to counters too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you help me understand why this value keeps going up across iterations
I think it's because we reuse the stats field. Let me see if that could be fixed.
Updates
|
On the front end - I also wonder if we should update the charts to use People are reading the charts as "metric per iteration" - seems there's a gap in the readbility of the charts. |
@@ -624,14 +618,6 @@ def clear_iteration_metrics(self, dataset_tag: str): | |||
if dataset_tag in self._last_iteration_stats: | |||
del self._last_iteration_stats[dataset_tag] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still wanna clear the iteration stats on the StatsManager so that the async update thread could exit.
But we don't clear the iteration metrics with the StatsActor.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i believe there is a usage of StatsManager.clear_iteration_metrics
in test_stats.py
which should be removed: https://github.com/scottjlee/ray/blob/d7f7e0f58248b9de145949883989ab597f97a2da/python/ray/data/tests/test_stats.py#L1695
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please capture this context as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh we still need to call that I think? Without calling StatsManager.clear_iteration_metrics
, the update thread will not exit since we will always have the _last_iteration_metrics
here: https://github.com/ray-project/ray/blob/master/python/ray/data/_internal/stats.py#L540-L542
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah sorry, i confused myself with the method of the same name in StatsActor
, i think you are right Ricky.
@@ -624,14 +618,6 @@ def clear_iteration_metrics(self, dataset_tag: str): | |||
if dataset_tag in self._last_iteration_stats: | |||
del self._last_iteration_stats[dataset_tag] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please capture this context as a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I will remove clear_execution_metrics()
usage in another PR.
#48745
Ah, thanks, i missed that part of your previous comments. |
<!-- Thank you for your contribution! Please review https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before opening a pull request. --> <!-- Please add a reviewer to the assignee section when you create a PR. If you don't have the access to it, we will shortly find a reviewer and assign them to your PR. --> ## Why are these changes needed? We currently report `iter_total_blocked_seconds` and `iter_user_seconds` as **Gauge** while we tracking them as counters, i.e.: - For each iteration, we had a timer that sums locally for each iteration into an aggregated value (which is the sum of total blocked seconds) - When the iteration ends or the iterator GCed, the gauge metric value is currently set to 0. - This creates confusion for users as a counter value (total time blocked on a dataset) should not be going back to 0, generating charts like below. --------- Signed-off-by: rickyx <rickyx@anyscale.com> Signed-off-by: hjiang <dentinyhao@gmail.com>
Why are these changes needed?
We currently report
iter_total_blocked_seconds
anditer_user_seconds
as Gauge while we tracking them as counters, i.e.:Related issue number
Checks
With the fix, we will not set the gauge value to 0.
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.