[RLlib] MetricsLogger + Stats overhaul #51639

ArturNiederfahrenhorst · 2025-03-24T15:21:59Z

Why are these changes needed?

Today, the MetricsLogger.reduce() and Stats.reduce() methods calculate the throughput.
Also, when calling Stats.push(), the throughput does not change. This can lead to a situation where we push values but Stats.peek() won't be able to pick up a metrics.

This PR mainly introduces ...

Stats._throughput_stats which keeps tracks of a moving average of throughput stats for Stats that have throughput tracking enabled. This decouples throughput calculation from Stats.reduce() calls.
MetricsLogger.compile() which encapsulates the logic to reduce all metrics AND their throughputs to a single dictionary of numeric values. This makes the abstraction deeper and removes respective logic from Algorithm.
a reduction of the signature of Stats.peek() peek(self, *, previous: Optional[int] = None, throughput: bool = False). previous is only needed once in RLlib (in the MetricsLogger) so we put it into it's own method Stats.get_reduce_history()
A safeguard that make it so that calling reduce multiple times without logging additional values does not alter the reduction history (for example, if a Stats metric is passed down some code path and reduced multiple times, possibly involuntarily)

Further more, we also introduce...

extensive testing for Stats and MetricsLogger
Some safeguards so that stats are always logged with the correct expectations around throughput tracking etc

ArturNiederfahrenhorst · 2025-03-25T09:34:42Z

doc/source/rllib/metrics-logger.rst

-    logger.log_value("some_items", value="d")
+    logger.log_value("some_items", value="b", reduce=None, clear_on_reduce=True)
+    logger.log_value("some_items", value="c", reduce=None, clear_on_reduce=True)
+    logger.log_value("some_items", value="d", reduce=None, clear_on_reduce=True)


The idea behind the change that makes this necessary is that...
a) There should be very few spots in the code where users log the same value (so it's not much work to write out all arguments each time we log to a give name.
b) To enforce that the metrics logger does exactly what is expected. There should be no race conditions where users use different call arguments to log values for the same stats object etc.

ArturNiederfahrenhorst · 2025-03-25T13:39:58Z

rllib/algorithms/algorithm.py

+            if not eval_results:
+                logger.warning(
+                    "No evaluation results found for this iteration. This can happen if the evaluation worker(s) is/are not healthy."
+                )


This is a branch of code that we hit in eval worker failure tests, where eval results will not be part of the metrics.

ArturNiederfahrenhorst · 2025-03-25T13:51:36Z

rllib/algorithms/dqn/dqn.py

-                                    convert_to_numpy(
-                                        module_results.pop(TD_ERROR_KEY).peek()
-                                    )
+                                    convert_to_numpy(module_results.pop(TD_ERROR_KEY))


Today: In user-defined functions like training_step(), we assume that results a ResultsDict.
We break this assumption here because we call MetricsLogger.reduce() at the end of Algorithm._run_one_training_iteration.

With this PR, we instead return MetricsLogger.compile() in Algorithm._run_one_training_iteration.
In the future, we should probably standardize on using the MetricsLogger throughout everything under Algorithm.step() instead of ResultsDict and call MetricsLogger.compile at the end of Algorithm.step() to return a dict.

…reduce call

… and compile deal with these

PhilippWillms · 2025-03-29T17:24:56Z

Hi Artur, if you add further test cases for metrics logger, this recent finding may be also relevant: #50294

ArturNiederfahrenhorst added 7 commits March 18, 2025 18:31

add throughputs to steps metrics

f2c01be

Merge branch 'master' into addthroughputs

8b58c71

Add push throughputs

ac66a47

wip

1de7916

wip

2d3815d

wip

8316134

metrics logger

02cf30b

ArturNiederfahrenhorst added the rllib RLlib related issues label Mar 24, 2025

ArturNiederfahrenhorst added 7 commits March 24, 2025 16:36

fix tests

62f2f05

add compile method

8a7eb21

Add tests to BUILD file

17ea477

linter errors

7e2657e

add if __name__ == __main__ to tests

773401a

put back default

1cb9bed

fix docstring for sphinx

b517f8c

ArturNiederfahrenhorst commented Mar 25, 2025

View reviewed changes

ArturNiederfahrenhorst added 2 commits March 25, 2025 10:55

fix algorithm .compile() and some docstrings

7039f6c

fit dqn and fix peeking sub-structures

8eb2527

ArturNiederfahrenhorst commented Mar 25, 2025

View reviewed changes

ArturNiederfahrenhorst added 2 commits March 25, 2025 15:11

fix multiple reduce calls breaking history

d024da3

Merge branch 'master' into addthroughputs

f33e60e

ArturNiederfahrenhorst marked this pull request as ready for review March 25, 2025 14:18

ArturNiederfahrenhorst requested review from sven1977, maxpumperla, simonsays1980 and a team as code owners March 25, 2025 14:18

ArturNiederfahrenhorst changed the title ~~[RLlib] Adjust MetricsLogger and Stats to calculate throughputs with moving average~~ [RLlib] Adjust MetricsLogger and Stats to calculate throughputs with moving average and other improvements Mar 25, 2025

ArturNiederfahrenhorst added 2 commits March 25, 2025 16:19

fix throughput get/set state

fc1386b

Fix new values added

47360db

ArturNiederfahrenhorst added 6 commits March 26, 2025 19:10

wip

a165b25

Enable merge and log dicts while passing down Stats objects on first …

5ef1a78

…reduce call

Fix DQN after passing down Stats only on first reduce

071f758

lint

87db6fa

lint

dcc9ca3

Support only stats dicts merge (no logger merge)

81d78ae

ArturNiederfahrenhorst changed the title ~~[RLlib] Adjust MetricsLogger and Stats to calculate throughputs with moving average and other improvements~~ [RLlib] MetricsLogger Stats overhaul Mar 27, 2025

ArturNiederfahrenhorst changed the title ~~[RLlib] MetricsLogger Stats overhaul~~ [RLlib] MetricsLogger + Stats overhaul Mar 27, 2025

ArturNiederfahrenhorst added 3 commits March 27, 2025 12:50

remove debugging artifact

dde4cec

Make stats reduce and peek return lists of values, MetricsLogger peek…

ca9bf30

… and compile deal with these

Fixed exponential buildup of history for sum stats

e0ce846

ArturNiederfahrenhorst added 7 commits March 31, 2025 22:53

wip

1b8e5a9

Add multi-level tests

f708371

Make tests more concise and specific

504c9ba

lint

ad51603

lint

8c881fa

fix log dicts with default throughput ema

78fa77f

fix vpg torch learner

e081b98

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLlib] MetricsLogger + Stats overhaul #51639

[RLlib] MetricsLogger + Stats overhaul #51639

ArturNiederfahrenhorst commented Mar 24, 2025 •

edited

Loading

ArturNiederfahrenhorst Mar 25, 2025

ArturNiederfahrenhorst Mar 25, 2025

ArturNiederfahrenhorst Mar 25, 2025

PhilippWillms commented Mar 29, 2025

[RLlib] MetricsLogger + Stats overhaul #51639

Are you sure you want to change the base?

[RLlib] MetricsLogger + Stats overhaul #51639

Conversation

ArturNiederfahrenhorst commented Mar 24, 2025 • edited Loading

Why are these changes needed?

ArturNiederfahrenhorst Mar 25, 2025

Choose a reason for hiding this comment

ArturNiederfahrenhorst Mar 25, 2025

Choose a reason for hiding this comment

ArturNiederfahrenhorst Mar 25, 2025

Choose a reason for hiding this comment

PhilippWillms commented Mar 29, 2025

ArturNiederfahrenhorst commented Mar 24, 2025 •

edited

Loading