-
Notifications
You must be signed in to change notification settings - Fork 29.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
(perf_hooks) monitorEventLoopDelay stats are skewed towards small measurements over time #34661
Comments
The irony here, of course, is that the event loop delay histogram is actually recording the variation in the sampling rate that occurs over time and using that as only an approximation of an accumulated event loop delay that could be very misleading. For instance, on one turn of the event loop, we may have only 10 async requests to handle, all of which are handled efficiently and quickly, with the event loop turning over in roughly That said, you're right, we can do a bit better here. We can either (a) record the number of "skipped" samples at each sample point (e.g. if the resolution is 10 ms and the current sample took 100 ms to complete, we record 9 missing samples) or (b) calculate the what the total number of samples should have been and adjust when we query the histogram. The former would allow us to deskew the results more incrementally, I believe. Ultimately, however, I think that event loop delay (in general, not specific to the histogram implementation we have in core) is simply the wrong metric given how it is currently defined. A far more useful metric will be the event loop utilization that @trevnorris has been working on (see libuv/libuv#2725). Once that lands here, we'll have a much better tool available. |
Hello @jasnell , thanks for the reply! I took a look at the event loop utilization PR, but I fail to see how this metric can be used in replacement for an event loop delay metric. Maybe I didn't get something about it but it seems as a replacement for CPU usage metrics, to be able to monitor when a node process is close to full capacity? I'm not sure I understand how it could be used to signal high synchronous workloads (even if approximately). By correlating high CPU+low ELU periods? |
It does not replace the event loop delay. It augments it, and gives what is likely a far more valuable metric when combined with the existing delay measurement. ELD (delay) tells you how quickly or slowly the event loop is turning, while ELU will tell you how much work it's actually doing during those delays. |
I'm thinking that for the delay monitor, we should add a new point in time ratio metric that is actual samples over expected samples. |
Ah, I see what you mean now. Indeed, both metrics would give a clearer picture of what's going on. I kept on thinking about your previous example:
On one hand, it makes sense and I agree, the event loop can be blocked/delayed by something more than just synchronous work on a single async request. A large spike of async requests happening simultaneously would have a similar effect. On the other hand, I still kinda feel like the event delay loop is a good metric by itself to signal whether node is "internally overworking itself". On the case where I'm using nodejs as a server, if I'm logging all incoming requests somewhere, I can quickly correlate whether high ELD is a result of a high incoming workload, or if it happened while incoming traffic hasn't changed. I guess one can't assume that it's necessarily sync work going on (a request could have triggered an abnormal amount of async work) but it still is something worth looking at given that the end result is similar; response-times for subsequent requests is increased regardless of their own workload. So I'm still trying to figure out a way to make abnormal delays visible using this module. It seems that to avoid this issue, either:
🤔 |
@jasnell As libuv/libuv#2725 was merged some time ago, does node provide any API to meassure |
Yes, there is one for current JS thread and one to get data from a worker. |
I'm not sure if this is more of a feature request or a bug report.
It's a feature request in the sense that I'm requesting more info in the data returned by node's event loop monitor, but it's a bug report in the sense that the current stats are biased towards small measurements.
In their current state,
monitorEventLoopDelay
is not very practical to report on sporadic event loop delays do to excessive synchronous work.This is because it skews the stats towards measurements closer to the
resolution
sampling rate because no samples are taken while big delays occur, but these missing samples are not taken into account when computing the stats.What steps will reproduce the bug?
Take this example where we do some sync work for 30s, report the event loop delay stats every 60s and use a resolution of 50ms
If you run this, the first log stats would be something like:
Besides the
max
value, the fail to tell the story about how the event loop was blocked for 50% of the time between the start of the script and the first log 60s later.What is the expected behavior?
The main point of monitoring nodejs's event loop delay is to see if async work is being delayed because of excessive synchronous work.
If this was running on a server, I would have like these stats to give me some information on what percentage of requests would have been delayed because of heavy sync work over one minute.
In this scenario, if I had a server that handled 1 request/second, 50% of requests over 1 minute would have been delayed because of sync work. However, the event loop monitor stats give a picture that everything is "mostly fine" except 1 sample.
What do you see instead?
It seems to me that the percentiles are based on the number of samples but fail to take into account the varying amount of samples over fixed periods of time.
IIUC, there is a timer running at the libuv level which is set to trigger every
resolution
ms. However, if this timer gets delayed byresolution * N
ms, then we will have aroundN
less samples over the same period of time if there wasn't any sync work. This means that the bigger the delay, the less samples we take of big delays. This biases the stats towards small measurements.In numbers:
30s is 50% of 60s, so I would have expected the percentiles to show something that could have been interpreted as "the event loop was blocked around 50% of the time in the last sampled period".
However:
During 30s we only took
1 sample
During the other 30s, without any delay and with a resolution of 50ms, we took
~600 samples
(which are usually around0 - 5 ms
)1 / 601 = 0.0017
So from this point of view, yeah, the 1 sample is less that 1% of all samples, but that doesn't seem really helpful.
I understand that from a technical standpoint, if the event loop is blocked, you can't take any more samples. However, as a thought experiment imagine that even if it is blocked you could still start a timer every
resolution
ms. Their reported delay values would be something likeIf it had these samples + 600 samples of 1.8ms delays, the stats would look much different:
I came up with these number by doing this in Python:
Additional information
Given that it's not actually feasible to add extra samples during periods where the event loop is blocked, I think that from the user perspective, we're missing the total number of samples taken. I think it would be able to approximatively "un-skew" these values based on the number of samples, the resolution value and the total time during which the event loop monitor has been running. Although, that would only give very rough approximations 🤔
The text was updated successfully, but these errors were encountered: