-
Notifications
You must be signed in to change notification settings - Fork 30
Conversation
This make a few changes to fix issues with the APM agent under heavy load and with a slow or non-responsive APM server. 1. A new `maxQueueSize` config option is added (default 1024 for now) to control how many events (transactions, spans, errors, metricsets) will be queued before being dropped if events are incoming faster than can be sent to APM server. Dev Note: This is a FIFO, so we will tend to drop new events and keep older ones, whose relevance fades. That isn't ideal. The current architecture of using node.js streams to buffer doesn't make it easy to drop older events. Dropping newer events under overload is deemed good enough for now. 2. JSON encoding of events (when uncorking) is done in limited size batches to control the amount of single chunk CPU eventloop blocking time. (See MAX_WRITE_BATCH_SIZE in Client._writev.) Internal stats are collected to watch for long(est) batch processing times. Dev Note: Eventually an investigation into putting JSON encoding into a separate worker thread might be interesting. 3. The handling of individual requests to the APM Server intake API has be rewritten to handle some error cases -- especially from a non-responsive APM server -- and to ensure that only one intake request is being performed at a time. 4. Support for backoff on intake API requests has been implemented per https://github.com/elastic/apm/blob/master/specs/agents/transport.md#transport-errors Fixes: #136 * * * This commit is a first carryover from https://github.com/trentm/apm-nodejs-http-client/tree/play there is still work to be done before this is complete and reviewable.
💔 Build Failed
Expand to view the summary
Build stats
Test stats 🧪
Trends 🧪Steps errors
Expand to view the steps failures
|
This now does *not* unzip the payload. Is that a compat issue?
…0s: unref it *and* clean it up for that particular case in the conclusion handling
Also, allow any 2xx from APM server's intake API to be a "success". Really it should be only 202, but previous behaviour was any 2xx which was relied upon by some tests and some mock APM servers.
…tly the same for the case 0)
… emitting the close event in one case
… events/shutdown/etc.
… docs on intake req conclusion cases
…t cases for these timeouts
…s in corrupted process.title that breaks tests
Some perf-ish results from testing example-blocking-behavior.js with and without APM and talking to an APM server with different behaviours. Typically (unless otherwise noted) we are hitting it with load via:
which is: make requests as fast as possible for I also observed the memory usage ("REM") using htop while running each load test. tl;drSome observations from the data below:
baseline, no APMhtop observed: RES ~80MB, CPU ~104%
with APM, local mockapmserver.js (accidentally with payloadLogFile)
with APM, local mockapmserver.js that is accepting, reading, but not responding (accidentally with payloadLogFile)htop: RES around 130-140MB
with APM agent, but no APM server (accidentally with payloadLogFile)htop: RES around 130-140MB
We see backoff delays, which is good: 4s, 9s, 16s, 25s
with APM agent, against APM server v7.12.0 in Google Cloud (us-west1)htop: RES ~150MB
comparing to client without these fixes
htop: RES up to 1600MB with APM talking to cloud APM server, with default autocannon load
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add some notes for reviewers.
@@ -12,58 +12,6 @@ const assertMetadata = utils.assertMetadata | |||
const assertEvent = utils.assertEvent | |||
const validOpts = utils.validOpts | |||
|
|||
test('Event: close - if ndjson stream ends', function (t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to reviewer: These two tests were identical to the following two. I'm guessing they date back from before stream-chopper was in use, but I'm not sure.
this._chopper.destroy() | ||
this._agent.destroy() | ||
cb(err) | ||
} | ||
|
||
function onStream (client, onerror) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note to reviewer: A total re-write of onStream
-- now called getChoppedStreamHandler
and makeIntakeRequest
-- is below (after the _getBackoffDelay() function). There is large block comment that should mostly explain it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First review pass done -- nits and questions below.
I'm going to take a second pass after I've given the client a try with a few toy applications to make sure I understand the flow of things and to look for an unexpected/surprising changes in behavior we might need to address/document.
It's worth saying out loud that overall the new handler returned by getChoppedStreamHandler
is more locally complex than the previous onStream
handler. However, all we've really done here is surface the complexity that was hidden away under the pump
library and given ourselves the necessary control to address the performance issues and edge cases. It's complex code, but it's also as no more complex than it needs to be. 👍
const intakeResTimeoutOnEnd = client._conf.intakeResTimeoutOnEnd | ||
|
||
// `_active` is used to coordinate the callback to `client.flush(db)`. | ||
client._active = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Proforma question -- are we sure there will only be one intake request active at a time? (if not, is the use client._active
introducing the possibility of a race-ish condition if there's any overlap between intake request?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, the code is designed to only have one intake request active at a time. Otherwise there is a bug. The completePart()
function is written to handle all the (documented in its top comment) ways the 3 streams can conclude, and only call StreamChopper's next()
when those 3 streams are all cleanly finished or destroyed.
Note that this client._active
handling was in the earlier onStream
code. However, because the earlier code didn't wait on the completion of the HTTP response from APM server it was possible to have multiple intake requests going at the same time.
Testing notes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One last question before stamping this -- do you have any thoughts on whether we want to log a periodic message if/when the client ends up dropping spans?
I suspect there's going to be some span heavy applications/services that, with the previous version of the agent, didn't run into significant performance issues but who will end up hitting the span limit and dropping spans. Do you think we need logging in the agent for this, or will the changelog(s) and documentation for maxQueueSize
be enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 thanks for talking those through. GHA tests are passing, and the jenkins failures (per previous comments) appear to be related to the code coverage tooling. Approving.
This make a few changes to fix issues with the APM agent under heavy
load and with a slow or non-responsive APM server.
A new
maxQueueSize
config option is added (default 1024 for now) tocontrol how many events (transactions, spans, errors, metricsets)
will be queued before being dropped if events are incoming faster
than can be sent to APM server.
Dev Note: This is a FIFO, so we will tend to drop new events and keep
older ones, whose relevance fades. That isn't ideal. The current
architecture of using node.js streams to buffer doesn't make it easy
to drop older events. Dropping newer events under overload is deemed
good enough for now.
JSON encoding of events (when uncorking) is done in limited size
batches to control the amount of single chunk CPU eventloop blocking
time. (See MAX_WRITE_BATCH_SIZE in Client._writev.) Internal stats
are collected to watch for long(est) batch processing times.
Dev Note: Eventually an investigation into putting JSON encoding into
a separate worker thread might be interesting.
The handling of individual requests to the APM Server intake API has
be rewritten to handle some error cases -- especially from a
non-responsive APM server -- and to ensure that only one intake
request is being performed at a time.
Support for backoff on intake API requests has been implemented per
https://github.com/elastic/apm/blob/master/specs/agents/transport.md#transport-errors
Fixes: #136
This commit is a first carryover from https://github.com/trentm/apm-nodejs-http-client/tree/play
there is still work to be done before this is complete and reviewable.