Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RLlib] MetricsLogger + Stats overhaul #51639

Open
wants to merge 36 commits into
base: master
Choose a base branch
from

Conversation

ArturNiederfahrenhorst
Copy link
Contributor

@ArturNiederfahrenhorst ArturNiederfahrenhorst commented Mar 24, 2025

Why are these changes needed?

Today, the MetricsLogger.reduce() and Stats.reduce() methods calculate the throughput.
Also, when calling Stats.push(), the throughput does not change. This can lead to a situation where we push values but Stats.peek() won't be able to pick up a metrics.

This PR mainly introduces ...

  • Stats._throughput_stats which keeps tracks of a moving average of throughput stats for Stats that have throughput tracking enabled. This decouples throughput calculation from Stats.reduce() calls.
  • MetricsLogger.compile() which encapsulates the logic to reduce all metrics AND their throughputs to a single dictionary of numeric values. This makes the abstraction deeper and removes respective logic from Algorithm.
  • a reduction of the signature of Stats.peek() peek(self, *, previous: Optional[int] = None, throughput: bool = False). previous is only needed once in RLlib (in the MetricsLogger) so we put it into it's own method Stats.get_reduce_history()
  • A safeguard that make it so that calling reduce multiple times without logging additional values does not alter the reduction history (for example, if a Stats metric is passed down some code path and reduced multiple times, possibly involuntarily)

Further more, we also introduce...

  • extensive testing for Stats and MetricsLogger
  • Some safeguards so that stats are always logged with the correct expectations around throughput tracking etc

@ArturNiederfahrenhorst ArturNiederfahrenhorst added the rllib RLlib related issues label Mar 24, 2025
logger.log_value("some_items", value="d")
logger.log_value("some_items", value="b", reduce=None, clear_on_reduce=True)
logger.log_value("some_items", value="c", reduce=None, clear_on_reduce=True)
logger.log_value("some_items", value="d", reduce=None, clear_on_reduce=True)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea behind the change that makes this necessary is that...
a) There should be very few spots in the code where users log the same value (so it's not much work to write out all arguments each time we log to a give name.
b) To enforce that the metrics logger does exactly what is expected. There should be no race conditions where users use different call arguments to log values for the same stats object etc.

if not eval_results:
logger.warning(
"No evaluation results found for this iteration. This can happen if the evaluation worker(s) is/are not healthy."
)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a branch of code that we hit in eval worker failure tests, where eval results will not be part of the metrics.

convert_to_numpy(
module_results.pop(TD_ERROR_KEY).peek()
)
convert_to_numpy(module_results.pop(TD_ERROR_KEY))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Today: In user-defined functions like training_step(), we assume that results a ResultsDict.
We break this assumption here because we call MetricsLogger.reduce() at the end of Algorithm._run_one_training_iteration.

With this PR, we instead return MetricsLogger.compile() in Algorithm._run_one_training_iteration.
In the future, we should probably standardize on using the MetricsLogger throughout everything under Algorithm.step() instead of ResultsDict and call MetricsLogger.compile at the end of Algorithm.step() to return a dict.

@ArturNiederfahrenhorst ArturNiederfahrenhorst marked this pull request as ready for review March 25, 2025 14:18
@ArturNiederfahrenhorst ArturNiederfahrenhorst changed the title [RLlib] Adjust MetricsLogger and Stats to calculate throughputs with moving average [RLlib] Adjust MetricsLogger and Stats to calculate throughputs with moving average and other improvements Mar 25, 2025
@ArturNiederfahrenhorst ArturNiederfahrenhorst changed the title [RLlib] Adjust MetricsLogger and Stats to calculate throughputs with moving average and other improvements [RLlib] MetricsLogger Stats overhaul Mar 27, 2025
@ArturNiederfahrenhorst ArturNiederfahrenhorst changed the title [RLlib] MetricsLogger Stats overhaul [RLlib] MetricsLogger + Stats overhaul Mar 27, 2025
@PhilippWillms
Copy link

Hi Artur, if you add further test cases for metrics logger, this recent finding may be also relevant: #50294

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
rllib RLlib related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants