Skip to content

Commit

Permalink
fix(metrics): Prevent NaNs in sampling from occuring
Browse files Browse the repository at this point in the history
A code path in the metric readout code could lead to the generation of
a metric value of "NaN" (see counter_buffer.hpp explanatory
comment for more details).

As metrics are recorded in an accumulated fashion, a NaN metric value
will break the metric for the rest of the recording.

This commit deletes that code path, as there is no record of the bug it
tries to fix ever existing.

This commit further improves the resilence of the metric readout code by
introducing further checks for values of diff_running/enabled/value that
can generate a NaN.
  • Loading branch information
cvonelm committed May 5, 2023
1 parent 30677e2 commit b265e55
Showing 1 changed file with 28 additions and 8 deletions.
36 changes: 28 additions & 8 deletions include/lo2s/perf/counter/counter_buffer.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -65,24 +65,44 @@ class CounterBuffer

for (std::size_t i = 0; i < accumulated_.size(); ++i)
{
// Only a limited amount of hardware counters can be recorded per-CPU.
// If a user opens more counters than the processor can handle, perf multiplexes the
// hardware counters by running each of them for just a part of the overall
// runtime. For users of perf to be able to account for this multiplexing, perf keeps
// track of two statistics per counter:
//
// diff_enabled - time the counter was enabled and _could_ have run
// diff_running - time the counter _actually_ ran
//
// Using diff_enabled / diff_running as a scaling factor, one can estimate the value for
// the counter if it ran for the whole diff_enabled duration.
//
// So for example, if the counter was only active for half of the time, a scaling factor
// of 2 would be applied.
//
// Due to floating point edge cases, several combinations of diff_enabled, diff_running
// and diff_value where some of those are 0 can result in NaNs, which due to the
// accumulated nature of counters will break them.
//
// A counter reading of 0 is assumed in all of these cases, as diff_enabled or
// diff_running being 0 indicates that the counter did not run during the last sampling
// interval

auto diff_enabled = crtp_this->diff_enabled(i);
auto diff_running = crtp_this->diff_running(i);
auto diff_value = crtp_this->diff_value(i);

if (diff_enabled == 0 || diff_running == diff_enabled)
if (diff_enabled == 0 || diff_running == 0 || diff_value == 0)
{
accumulated_[i] += diff_value;
continue;
}
// Due to a perf bug diff_enabled and diff_running may be swapped
// diff_enabled is always smaller than diff_running so swap them if diff_enabled >
// diff_running
else if (diff_enabled > diff_running)
else if (diff_enabled == diff_running)
{
accumulated_[i] += (static_cast<double>(diff_enabled) / diff_running) * diff_value;
accumulated_[i] += diff_value;
}
else
{
accumulated_[i] += (static_cast<double>(diff_running) / diff_enabled) * diff_value;
accumulated_[i] += (static_cast<double>(diff_enabled) / diff_running) * diff_value;
}
}
}
Expand Down

0 comments on commit b265e55

Please sign in to comment.