-
Notifications
You must be signed in to change notification settings - Fork 6.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RLlib] MetricsLogger + Stats overhaul #51639
base: master
Are you sure you want to change the base?
[RLlib] MetricsLogger + Stats overhaul #51639
Conversation
logger.log_value("some_items", value="d") | ||
logger.log_value("some_items", value="b", reduce=None, clear_on_reduce=True) | ||
logger.log_value("some_items", value="c", reduce=None, clear_on_reduce=True) | ||
logger.log_value("some_items", value="d", reduce=None, clear_on_reduce=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The idea behind the change that makes this necessary is that...
a) There should be very few spots in the code where users log the same value (so it's not much work to write out all arguments each time we log to a give name.
b) To enforce that the metrics logger does exactly what is expected. There should be no race conditions where users use different call arguments to log values for the same stats object etc.
if not eval_results: | ||
logger.warning( | ||
"No evaluation results found for this iteration. This can happen if the evaluation worker(s) is/are not healthy." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a branch of code that we hit in eval worker failure tests, where eval results will not be part of the metrics.
rllib/algorithms/dqn/dqn.py
Outdated
convert_to_numpy( | ||
module_results.pop(TD_ERROR_KEY).peek() | ||
) | ||
convert_to_numpy(module_results.pop(TD_ERROR_KEY)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Today: In user-defined functions like training_step(), we assume that results a ResultsDict.
We break this assumption here because we call MetricsLogger.reduce()
at the end of Algorithm._run_one_training_iteration
.
With this PR, we instead return MetricsLogger.compile()
in Algorithm._run_one_training_iteration
.
In the future, we should probably standardize on using the MetricsLogger throughout everything under Algorithm.step()
instead of ResultsDict and call MetricsLogger.compile at the end of Algorithm.step()
to return a dict.
Hi Artur, if you add further test cases for metrics logger, this recent finding may be also relevant: #50294 |
Why are these changes needed?
Today, the MetricsLogger.reduce() and Stats.reduce() methods calculate the throughput.
Also, when calling Stats.push(), the throughput does not change. This can lead to a situation where we push values but Stats.peek() won't be able to pick up a metrics.
This PR mainly introduces ...
Stats._throughput_stats
which keeps tracks of a moving average of throughput stats for Stats that have throughput tracking enabled. This decouples throughput calculation from Stats.reduce() calls.MetricsLogger.compile()
which encapsulates the logic to reduce all metrics AND their throughputs to a single dictionary of numeric values. This makes the abstraction deeper and removes respective logic from Algorithm.Stats.peek()
peek(self, *, previous: Optional[int] = None, throughput: bool = False)
.previous
is only needed once in RLlib (in the MetricsLogger) so we put it into it's own methodStats.get_reduce_history()
Further more, we also introduce...