Event loop lag should be a summary, not a gauge #309

dhet · 2019-12-28T08:59:13Z

Hi,
as stated in the title, I believe the event loop lag metric should be a summary, not a gauge. The reason is that event loop lag can fluctuate quite substantially in a very short time. Think of a web server: when a web server handles a request, and we take a lag sample at exactly that moment then the lag value is is probably high. If we take a sample in idle state on the other hand, lag should be close to zero. Now when we plot the gauge metric in a line chart we can see seemingly random lag spikes. The spike might be there or not depending on whether a sample was taken during a request or during idle time. A summary would produce much more meaningful results as it records not only a single point in time but also considers all previous samples.

Have a look at the official docs: monitorEventLoopDelay also uses a histogram.

I can open a PR if you want, otherwise #278 would be a good opportunity to make this change.

The text was updated successfully, but these errors were encountered:

dhet · 2019-12-28T20:21:23Z

This is what it looks like currently:

Quite random I would say

Here's what I think it should look like (Ignore the .999 percentile. Those are outliers):

sam-github · 2020-01-28T22:08:41Z

Shouldn't it be a histogram?

siimon · 2020-02-03T13:22:02Z

Yeah, histogram is the way to go

dhet · 2020-02-03T14:49:49Z

Care to explain why? Isn't a summary just a histogram with percentiles as buckets?

sam-github · 2020-02-03T15:23:49Z

Histograms have fixed buckets, with counts in them, so its possible to sum the histograms across all the instances (pods for example). Summaries the data is analyzed on the fly, each pod has different values for the percentiles, and they can't be summed. If you google around you'll find a bunch of info on this. Summaries look great until someone tries to aggregate into a dashboard, then it goes badly.

There is a case to be made that the user should be able to provide either a histogram or summary object to be used for the observation of the values (go does this with some metrics it collects), but I haven't seen the Node.js client do this, so if there is only one, it should be histogram.

zbjornson · 2021-09-19T01:52:32Z

The reason is that event loop lag can fluctuate quite substantially in a very short time. Think of a web server: when a web server handles a request, and we take a lag sample at exactly that moment then the lag value is is probably high. If we take a sample in idle state on the other hand, lag should be close to zero.

The advanced event loop monitoring added in v12.0.0 (#278) directly exposes the underlying histogram from Node.js v11.10.0+, which fixes this specific problem.

As far as the metric type, I don't think we can convert this into a Prometheus Histogram because the data we get from Node.js is the "processed histogram" -- i.e. "lag at Nth percentile" whereas we'd need "counts in bin LE=0.05". I think @dhet is correct to suggest a Summary for that reason, but @sam-github is correct about Summaries being a pain to work with and aggregate.

I'm inclined to leave this as-is unless someone suggests otherwise.

anthonyalayo · 2023-05-10T07:53:45Z

Maybe once native histograms are on the main branch, that would be the best solution?

https://github.com/prometheus/prometheus/milestone/10

ChristianBoehlke mentioned this issue May 4, 2020

Improve Event Loop Lag Metric #370

Open

yarsky-tgz mentioned this issue Aug 21, 2021

Reset internal histogram of monitorEventLoopDelay after each collect() invocation #459

Merged

zbjornson changed the title ~~Event loop lag should be a summary, not a gauge~~ Event loop lag should be a ~~summary~~ histogram, not a gauge Sep 19, 2021

zbjornson changed the title ~~Event loop lag should be a ~~summary~~ histogram, not a gauge~~ Event loop lag should be a <s>summary</s> histogram, not a gauge Sep 19, 2021

zbjornson changed the title ~~Event loop lag should be a <s>summary</s> histogram, not a gauge~~ Event loop lag should be a ~~summary~~ histogram, not a gauge Sep 19, 2021

zbjornson changed the title ~~Event loop lag should be a ~~summary~~ histogram, not a gauge~~ Event loop lag should be a summary, not a gauge Sep 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event loop lag should be a summary, not a gauge #309

Event loop lag should be a summary, not a gauge #309

dhet commented Dec 28, 2019

dhet commented Dec 28, 2019 •

edited

Loading

sam-github commented Jan 28, 2020

siimon commented Feb 3, 2020

dhet commented Feb 3, 2020

sam-github commented Feb 3, 2020

zbjornson commented Sep 19, 2021

anthonyalayo commented May 10, 2023

Event loop lag should be a summary, not a gauge #309

Event loop lag should be a summary, not a gauge #309

Comments

dhet commented Dec 28, 2019

dhet commented Dec 28, 2019 • edited Loading

sam-github commented Jan 28, 2020

siimon commented Feb 3, 2020

dhet commented Feb 3, 2020

sam-github commented Feb 3, 2020

zbjornson commented Sep 19, 2021

anthonyalayo commented May 10, 2023

dhet commented Dec 28, 2019 •

edited

Loading