Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(perf_hooks) monitorEventLoopDelay stats are skewed towards small measurements over time #34661

Open
jomi-se opened this issue Aug 7, 2020 · 7 comments
Labels
perf_hooks Issues and PRs related to the implementation of the Performance Timing API.

Comments

@jomi-se
Copy link

jomi-se commented Aug 7, 2020

I'm not sure if this is more of a feature request or a bug report.

It's a feature request in the sense that I'm requesting more info in the data returned by node's event loop monitor, but it's a bug report in the sense that the current stats are biased towards small measurements.

In their current state, monitorEventLoopDelay is not very practical to report on sporadic event loop delays do to excessive synchronous work.

This is because it skews the stats towards measurements closer to the resolution sampling rate because no samples are taken while big delays occur, but these missing samples are not taken into account when computing the stats.

  • Version: v12.18.1
  • Platform: OS X

What steps will reproduce the bug?

Take this example where we do some sync work for 30s, report the event loop delay stats every 60s and use a resolution of 50ms

import { monitorEventLoopDelay } from 'perf_hooks';

const resolution = 50;
const eventLoopDelayReportFreqInMs = 60000;

const doSomeSyncWorkFor30s = () => {
    setTimeout(() => {
        const s = Date.now();
        while (true) {
            const arr = new Array(1000);
            arr.fill('hello');
            if (Date.now() - s > 30000) {
                break;
            }
        }
    }, 150);
};

export const startEventLoopDelayMonitor = () => {
    const monitorHistogram = monitorEventLoopDelay({ resolution });
    monitorHistogram.enable();

    setInterval(() => {
        // Values are returned in nanoseconds. We transform them to milliseconds
        // We subtract the resolution from the values to get a more meaningful metric
        // However, sometimes the internal timer gets trigger in less time than the
        // "resolution" value. This is way during idle times we may get negative
        // "min" values
        const percentile25 = monitorHistogram.percentile(25) / 1000000 - resolution;
        const percentile50 = monitorHistogram.percentile(50) / 1000000 - resolution;
        const percentile75 = monitorHistogram.percentile(75) / 1000000 - resolution;
        const percentile95 = monitorHistogram.percentile(95) / 1000000 - resolution;
        const percentile99 = monitorHistogram.percentile(99) / 1000000 - resolution;
        const min = monitorHistogram.min / 1000000 - resolution;
        const max = monitorHistogram.max / 1000000 - resolution;
        const mean = monitorHistogram.mean / 1000000 - resolution;
        const stddev = monitorHistogram.stddev / 1000000;

        console.log({
            percentile25,
            percentile50,
            percentile75,
            percentile95,
            percentile99,
            min,
            max,
            mean,
            stddev
        });
    }, eventLoopDelayReportFreqInMs);
};

startEventLoopDelayMonitor();
doSomeSyncWorkFor30s();

If you run this, the first log stats would be something like:

{
  percentile25: 0.3971829999999983,
  percentile50: 1.8717430000000022,
  percentile75: 3.477375000000002,
  percentile95: 5.050238999999998,
  percentile99: 5.115774999999999,
  min: 0.003968000000000416,
  max: 30014.771071,
  mean: 54.257005300869565,
  stddev: 1250.17883463233
}

Besides the max value, the fail to tell the story about how the event loop was blocked for 50% of the time between the start of the script and the first log 60s later.

What is the expected behavior?

The main point of monitoring nodejs's event loop delay is to see if async work is being delayed because of excessive synchronous work.

If this was running on a server, I would have like these stats to give me some information on what percentage of requests would have been delayed because of heavy sync work over one minute.

In this scenario, if I had a server that handled 1 request/second, 50% of requests over 1 minute would have been delayed because of sync work. However, the event loop monitor stats give a picture that everything is "mostly fine" except 1 sample.

What do you see instead?

It seems to me that the percentiles are based on the number of samples but fail to take into account the varying amount of samples over fixed periods of time.

IIUC, there is a timer running at the libuv level which is set to trigger every resolutionms. However, if this timer gets delayed by resolution * Nms, then we will have around N less samples over the same period of time if there wasn't any sync work. This means that the bigger the delay, the less samples we take of big delays. This biases the stats towards small measurements.

In numbers:
30s is 50% of 60s, so I would have expected the percentiles to show something that could have been interpreted as "the event loop was blocked around 50% of the time in the last sampled period".

However:
During 30s we only took 1 sample
During the other 30s, without any delay and with a resolution of 50ms, we took ~600 samples (which are usually around 0 - 5 ms)

1 / 601 = 0.0017

So from this point of view, yeah, the 1 sample is less that 1% of all samples, but that doesn't seem really helpful.

I understand that from a technical standpoint, if the event loop is blocked, you can't take any more samples. However, as a thought experiment imagine that even if it is blocked you could still start a timer every resolutionms. Their reported delay values would be something like

timer 1: 30000ms
timer 2: 29950ms
timer 3: 29900ms
.
.
.
timer29: 50ms

If it had these samples + 600 samples of 1.8ms delays, the stats would look much different:

{
  percentile25: 1.8,
  percentile50: 25.9,
  percentile75: 15012.5,
  percentile95: 27002.5,
  percentile99: 29400.5,
  min: 1.8,
  max: 30000,
  mean: 7513.4,
  stddev: 9691.441089607537
}

I came up with these number by doing this in Python:

import numpy as np
smol = [1.8]*600
big = range(50, 30001, 50)
samples  = np.array(smol + big)
np.percentile(samples, 25)
np.percentile(samples, 50)
# ... etc

Additional information

Given that it's not actually feasible to add extra samples during periods where the event loop is blocked, I think that from the user perspective, we're missing the total number of samples taken. I think it would be able to approximatively "un-skew" these values based on the number of samples, the resolution value and the total time during which the event loop monitor has been running. Although, that would only give very rough approximations 🤔

@jasnell
Copy link
Member

jasnell commented Aug 7, 2020

It seems to me that the percentiles are based on the number of samples but fail to take into account the varying amount of samples over fixed periods of time.

The irony here, of course, is that the event loop delay histogram is actually recording the variation in the sampling rate that occurs over time and using that as only an approximation of an accumulated event loop delay that could be very misleading.

For instance, on one turn of the event loop, we may have only 10 async requests to handle, all of which are handled efficiently and quickly, with the event loop turning over in roughly resolution time. On a future event loop turn, we might have 100,000 async requests to handle, all of which are handled equally as efficiently on their own, but because there's more for the event loop to do in that turn, the sampled event loop delay will be much larger because it'll take longer to get back around to triggering the sample timer, giving us the false impression that the event loop has been blocked whereas instead it has been progressing but has just been highly utilized. Event loop delay is always going to be an imperfect measurement here because Node.js' event loop does not operate in terms of a determined frame rate and "delay" becomes highly relative.

That said, you're right, we can do a bit better here. We can either (a) record the number of "skipped" samples at each sample point (e.g. if the resolution is 10 ms and the current sample took 100 ms to complete, we record 9 missing samples) or (b) calculate the what the total number of samples should have been and adjust when we query the histogram. The former would allow us to deskew the results more incrementally, I believe.

Ultimately, however, I think that event loop delay (in general, not specific to the histogram implementation we have in core) is simply the wrong metric given how it is currently defined. A far more useful metric will be the event loop utilization that @trevnorris has been working on (see libuv/libuv#2725). Once that lands here, we'll have a much better tool available.

@jomi-se
Copy link
Author

jomi-se commented Aug 8, 2020

Hello @jasnell , thanks for the reply!

I took a look at the event loop utilization PR, but I fail to see how this metric can be used in replacement for an event loop delay metric. Maybe I didn't get something about it but it seems as a replacement for CPU usage metrics, to be able to monitor when a node process is close to full capacity? I'm not sure I understand how it could be used to signal high synchronous workloads (even if approximately). By correlating high CPU+low ELU periods?

@jasnell
Copy link
Member

jasnell commented Aug 8, 2020

It does not replace the event loop delay. It augments it, and gives what is likely a far more valuable metric when combined with the existing delay measurement.

ELD (delay) tells you how quickly or slowly the event loop is turning, while ELU will tell you how much work it's actually doing during those delays.

@jasnell
Copy link
Member

jasnell commented Aug 8, 2020

I'm thinking that for the delay monitor, we should add a new point in time ratio metric that is actual samples over expected samples.

@jomi-se
Copy link
Author

jomi-se commented Aug 10, 2020

Ah, I see what you mean now. Indeed, both metrics would give a clearer picture of what's going on.

I kept on thinking about your previous example:

... giving us the false impression that the event loop has been blocked whereas instead it has been progressing but has just been highly utilized

On one hand, it makes sense and I agree, the event loop can be blocked/delayed by something more than just synchronous work on a single async request. A large spike of async requests happening simultaneously would have a similar effect.

On the other hand, I still kinda feel like the event delay loop is a good metric by itself to signal whether node is "internally overworking itself". On the case where I'm using nodejs as a server, if I'm logging all incoming requests somewhere, I can quickly correlate whether high ELD is a result of a high incoming workload, or if it happened while incoming traffic hasn't changed. I guess one can't assume that it's necessarily sync work going on (a request could have triggered an abnormal amount of async work) but it still is something worth looking at given that the end result is similar; response-times for subsequent requests is increased regardless of their own workload.

So I'm still trying to figure out a way to make abnormal delays visible using this module. It seems that to avoid this issue, either:

  • I decrease the log rate, to be close to a good (small) sample rate, so that the sample size is kept small. The downside is that I end up with a lot more logs to store.
  • I increase the sample size close to the logging rate, but in this case delays slower than the logging rate start to go unnoticed

🤔

@targos targos added the perf_hooks Issues and PRs related to the implementation of the Performance Timing API. label Dec 27, 2020
@matej-prokop
Copy link

matej-prokop commented Jun 18, 2021

A far more useful metric will be the event loop utilization that @trevnorris has been working on (see libuv/libuv#2725). Once that lands here, we'll have a much better tool available.

@jasnell As libuv/libuv#2725 was merged some time ago, does node provide any API to meassure event loop utilization?

@Flarna
Copy link
Member

Flarna commented Jun 18, 2021

Yes, there is one for current JS thread and one to get data from a worker.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
perf_hooks Issues and PRs related to the implementation of the Performance Timing API.
Projects
None yet
Development

No branches or pull requests

5 participants