fix: blocking behaviour under load #144

trentm · 2021-03-26T18:34:35Z

This make a few changes to fix issues with the APM agent under heavy
load and with a slow or non-responsive APM server.

A new maxQueueSize config option is added (default 1024 for now) to
control how many events (transactions, spans, errors, metricsets)
will be queued before being dropped if events are incoming faster
than can be sent to APM server.

Dev Note: This is a FIFO, so we will tend to drop new events and keep
older ones, whose relevance fades. That isn't ideal. The current
architecture of using node.js streams to buffer doesn't make it easy
to drop older events. Dropping newer events under overload is deemed
good enough for now.
JSON encoding of events (when uncorking) is done in limited size
batches to control the amount of single chunk CPU eventloop blocking
time. (See MAX_WRITE_BATCH_SIZE in Client._writev.) Internal stats
are collected to watch for long(est) batch processing times.

Dev Note: Eventually an investigation into putting JSON encoding into
a separate worker thread might be interesting.
The handling of individual requests to the APM Server intake API has
be rewritten to handle some error cases -- especially from a
non-responsive APM server -- and to ensure that only one intake
request is being performed at a time.
Support for backoff on intake API requests has been implemented per
https://github.com/elastic/apm/blob/master/specs/agents/transport.md#transport-errors

Fixes: #136

This commit is a first carryover from https://github.com/trentm/apm-nodejs-http-client/tree/play
there is still work to be done before this is complete and reviewable.

This make a few changes to fix issues with the APM agent under heavy load and with a slow or non-responsive APM server. 1. A new `maxQueueSize` config option is added (default 1024 for now) to control how many events (transactions, spans, errors, metricsets) will be queued before being dropped if events are incoming faster than can be sent to APM server. Dev Note: This is a FIFO, so we will tend to drop new events and keep older ones, whose relevance fades. That isn't ideal. The current architecture of using node.js streams to buffer doesn't make it easy to drop older events. Dropping newer events under overload is deemed good enough for now. 2. JSON encoding of events (when uncorking) is done in limited size batches to control the amount of single chunk CPU eventloop blocking time. (See MAX_WRITE_BATCH_SIZE in Client._writev.) Internal stats are collected to watch for long(est) batch processing times. Dev Note: Eventually an investigation into putting JSON encoding into a separate worker thread might be interesting. 3. The handling of individual requests to the APM Server intake API has be rewritten to handle some error cases -- especially from a non-responsive APM server -- and to ensure that only one intake request is being performed at a time. 4. Support for backoff on intake API requests has been implemented per https://github.com/elastic/apm/blob/master/specs/agents/transport.md#transport-errors Fixes: #136 * * * This commit is a first carryover from https://github.com/trentm/apm-nodejs-http-client/tree/play there is still work to be done before this is complete and reviewable.

apmmachine · 2021-03-26T18:40:47Z

💔 Build Failed

the below badges are clickable and redirect to their specific view in the CI or DOCS

Expand to view the summary

Build stats

Build Cause: astorm commented: jenkins run the tests please
Start Time: 2021-04-01T21:50:54.368+0000
Duration: 6 min 58 sec
Commit: 0dac0b1

Test stats 🧪

Test	Results
Failed	0
Passed	14478
Skipped	0
Total	14478

Trends 🧪

Steps errors

Expand to view the steps failures

`Run Tests`

Took 1 min 48 sec . View more details on here
Description: .ci/scripts/test.sh '15'

`Error signal`

Took 0 min 0 sec . View more details on here
Description: hudson.AbortException: script returned exit code 1

Log output

Expand to view the last 100 lines of log output

[2021-04-01T21:56:46.783Z] | |___| (_) | (_| |  __/ (_| (_) \ V /
[2021-04-01T21:56:46.783Z]  \_____\___/ \__,_|\___|\___\___/ \_/
[2021-04-01T21:56:46.783Z]                               Bash-20210309-2b87ace
[2021-04-01T21:56:46.783Z] 
[2021-04-01T21:56:46.783Z] 
[2021-04-01T21:56:46.783Z] ==> git version 2.17.1 found
[2021-04-01T21:56:46.783Z] ==> curl 7.58.0 (x86_64-pc-linux-gnu) libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
[2021-04-01T21:56:46.783Z] Release-Date: 2018-01-24
[2021-04-01T21:56:46.783Z] Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smb smbs smtp smtps telnet tftp 
[2021-04-01T21:56:46.783Z] Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL 
[2021-04-01T21:56:46.783Z] ==> Jenkins CI detected.
[2021-04-01T21:56:46.784Z]     project root: .
[2021-04-01T21:56:46.784Z] --> token set from env
[2021-04-01T21:56:46.784Z]     Yaml found at: .codecov.yml
[2021-04-01T21:56:46.784Z] ==> Running gcov in . (disable via -X gcov)
[2021-04-01T21:56:46.784Z] ==> Python coveragepy not found
[2021-04-01T21:56:46.784Z] ==> Searching for coverage reports in:
[2021-04-01T21:56:46.784Z]     + .
[2021-04-01T21:56:46.784Z]     -> Found 1 reports
[2021-04-01T21:56:46.784Z] ==> Detecting git/mercurial file structure
[2021-04-01T21:56:46.784Z] ==> Reading reports
[2021-04-01T21:56:46.784Z]     + ./coverage.lcov bytes=19546
[2021-04-01T21:56:46.784Z] ==> Appending adjustments
[2021-04-01T21:56:46.784Z]     https://docs.codecov.io/docs/fixing-reports
[2021-04-01T21:56:47.046Z]     -> No adjustments found
[2021-04-01T21:56:47.047Z] ==> Gzipping contents
[2021-04-01T21:56:47.047Z]         8.0K	/tmp/codecov.l6ZOBC.gz
[2021-04-01T21:56:47.047Z] ==> Uploading reports
[2021-04-01T21:56:47.047Z]     url: https://codecov.io
[2021-04-01T21:56:47.047Z]     query: branch=elastic%2Ftrentm%2Fblocking-behavior&commit=0dac0b1582f7b34130e06894ca383f1624578ef3&build=20&build_url=https%3A%2F%2Fapm-ci.elastic.co%2Fjob%2Fapm-agent-nodejs%2Fjob%2Fapm-nodejs-http-client-mbp%2Fjob%2FPR-144%2F20%2F&name=&tag=&slug=elastic%2Fapm-nodejs-http-client&service=jenkins&flags=&pr=144&job=&cmd_args=
[2021-04-01T21:56:47.047Z] ->  Pinging Codecov
[2021-04-01T21:56:47.047Z] https://codecov.io/upload/v4?package=bash-20210309-2b87ace&token=secret&branch=elastic%2Ftrentm%2Fblocking-behavior&commit=0dac0b1582f7b34130e06894ca383f1624578ef3&build=20&build_url=https%3A%2F%2Fapm-ci.elastic.co%2Fjob%2Fapm-agent-nodejs%2Fjob%2Fapm-nodejs-http-client-mbp%2Fjob%2FPR-144%2F20%2F&name=&tag=&slug=elastic%2Fapm-nodejs-http-client&service=jenkins&flags=&pr=144&job=&cmd_args=
[2021-04-01T21:56:47.309Z] ->  Uploading to
[2021-04-01T21:56:47.309Z] https://storage.googleapis.com/codecov/v4/raw/2021-04-01/5447F42A29F360B6E08CC683F6CEFF6D/0dac0b1582f7b34130e06894ca383f1624578ef3/07fca202-16d2-493b-a335-5a29d5f7fef8.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=GOOG1EQX6OZVJGHKK3633AAFGLBUCOOATRACRQRQF6HMSMLYUP6EAD6XSWAAY%2F20210401%2FUS%2Fs3%2Faws4_request&X-Amz-Date=20210401T215647Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=89e9407ad6b9f2a98bb49cf4b9c6ba327b193a368bcce1904ca7fa200183f809
[2021-04-01T21:56:47.309Z]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
[2021-04-01T21:56:47.309Z]                                  Dload  Upload   Total   Spent    Left  Speed
[2021-04-01T21:56:47.309Z] 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  5752    0     0  100  5752      0  26264 --:--:-- --:--:-- --:--:-- 26264
[2021-04-01T21:56:47.309Z]     -> Reports have been successfully queued for processing at https://codecov.io/github/elastic/apm-nodejs-http-client/commit/0dac0b1582f7b34130e06894ca383f1624578ef3
[2021-04-01T21:56:47.426Z] /tmp/.npm-global/bin/tap-xunit -> /tmp/.npm-global/lib/node_modules/tap-xunit/bin/tap-xunit
[2021-04-01T21:56:47.426Z] /tmp/.npm-global/bin/txunit -> /tmp/.npm-global/lib/node_modules/tap-xunit/bin/tap-xunit
[2021-04-01T21:56:47.426Z] + tap-xunit@2.4.1
[2021-04-01T21:56:47.426Z] added 21 packages from 21 contributors in 1.904s
[2021-04-01T21:56:47.443Z] Recording test results
[2021-04-01T21:56:48.277Z] [Checks API] No suitable checks publisher found.
[2021-04-01T21:56:48.306Z] Running in /var/lib/jenkins/workspace/pm-nodejs-http-client-mbp_PR-144/src/github.com/elastic/apm-nodejs-http-client
[2021-04-01T21:56:48.318Z] [INFO] Codecov: Getting branch ref...
[2021-04-01T21:56:48.419Z] Masking supported pattern matches of $GITHUB_TOKEN
[2021-04-01T21:56:48.473Z] [INFO] Codecov: Sending data...
[2021-04-01T21:56:48.874Z] + curl -sSLo codecov.sh https://codecov.io/bash
[2021-04-01T21:56:49.206Z] + bash codecov.sh
[2021-04-01T21:56:49.206Z] 
[2021-04-01T21:56:49.206Z]   _____          _
[2021-04-01T21:56:49.206Z]  / ____|        | |
[2021-04-01T21:56:49.206Z] | |     ___   __| | ___  ___ _____   __
[2021-04-01T21:56:49.206Z] | |    / _ \ / _` |/ _ \/ __/ _ \ \ / /
[2021-04-01T21:56:49.206Z] | |___| (_) | (_| |  __/ (_| (_) \ V /
[2021-04-01T21:56:49.206Z]  \_____\___/ \__,_|\___|\___\___/ \_/
[2021-04-01T21:56:49.206Z]                               Bash-20210309-2b87ace
[2021-04-01T21:56:49.206Z] 
[2021-04-01T21:56:49.206Z] 
[2021-04-01T21:56:49.206Z] ==> git version 2.17.1 found
[2021-04-01T21:56:49.206Z] ==> curl 7.58.0 (x86_64-pc-linux-gnu) libcurl/7.58.0 OpenSSL/1.1.1 zlib/1.2.11 libidn2/2.0.4 libpsl/0.19.1 (+libidn2/2.0.4) nghttp2/1.30.0 librtmp/2.3
[2021-04-01T21:56:49.206Z] Release-Date: 2018-01-24
[2021-04-01T21:56:49.206Z] Protocols: dict file ftp ftps gopher http https imap imaps ldap ldaps pop3 pop3s rtmp rtsp smb smbs smtp smtps telnet tftp 
[2021-04-01T21:56:49.206Z] Features: AsynchDNS IDN IPv6 Largefile GSS-API Kerberos SPNEGO NTLM NTLM_WB SSL libz TLS-SRP HTTP2 UnixSockets HTTPS-proxy PSL 
[2021-04-01T21:56:49.206Z] ==> Jenkins CI detected.
[2021-04-01T21:56:49.207Z]     project root: .
[2021-04-01T21:56:49.207Z] --> token set from env
[2021-04-01T21:56:49.207Z]     Yaml found at: .codecov.yml
[2021-04-01T21:56:49.207Z] ==> Running gcov in . (disable via -X gcov)
[2021-04-01T21:56:49.207Z] ==> Python coveragepy not found
[2021-04-01T21:56:49.207Z] ==> Searching for coverage reports in:
[2021-04-01T21:56:49.207Z]     + .
[2021-04-01T21:56:49.207Z]     -> Found 1 reports
[2021-04-01T21:56:49.207Z] ==> Detecting git/mercurial file structure
[2021-04-01T21:56:49.207Z] ==> Reading reports
[2021-04-01T21:56:49.207Z]     + ./coverage.lcov bytes=19549
[2021-04-01T21:56:49.207Z] ==> Appending adjustments
[2021-04-01T21:56:49.207Z]     https://docs.codecov.io/docs/fixing-reports
[2021-04-01T21:56:49.470Z]     -> No adjustments found
[2021-04-01T21:56:49.470Z] ==> Gzipping contents
[2021-04-01T21:56:49.470Z]         8.0K	/tmp/codecov.jq7rrc.gz
[2021-04-01T21:56:49.470Z] ==> Uploading reports
[2021-04-01T21:56:49.470Z]     url: https://codecov.io
[2021-04-01T21:56:49.470Z]     query: branch=elastic%2Ftrentm%2Fblocking-behavior&commit=0dac0b1582f7b34130e06894ca383f1624578ef3&build=20&build_url=https%3A%2F%2Fapm-ci.elastic.co%2Fjob%2Fapm-agent-nodejs%2Fjob%2Fapm-nodejs-http-client-mbp%2Fjob%2FPR-144%2F20%2F&name=&tag=&slug=elastic%2Fapm-nodejs-http-client&service=jenkins&flags=&pr=144&job=&cmd_args=
[2021-04-01T21:56:49.470Z] ->  Pinging Codecov
[2021-04-01T21:56:49.470Z] https://codecov.io/upload/v4?package=bash-20210309-2b87ace&token=secret&branch=elastic%2Ftrentm%2Fblocking-behavior&commit=0dac0b1582f7b34130e06894ca383f1624578ef3&build=20&build_url=https%3A%2F%2Fapm-ci.elastic.co%2Fjob%2Fapm-agent-nodejs%2Fjob%2Fapm-nodejs-http-client-mbp%2Fjob%2FPR-144%2F20%2F&name=&tag=&slug=elastic%2Fapm-nodejs-http-client&service=jenkins&flags=&pr=144&job=&cmd_args=
[2021-04-01T21:56:49.732Z] ->  Uploading to
[2021-04-01T21:56:49.733Z] https://storage.googleapis.com/codecov/v4/raw/2021-04-01/5447F42A29F360B6E08CC683F6CEFF6D/0dac0b1582f7b34130e06894ca383f1624578ef3/cb0584d0-bb04-44f7-98bb-9a29bc1d1156.txt?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=GOOG1EQX6OZVJGHKK3633AAFGLBUCOOATRACRQRQF6HMSMLYUP6EAD6XSWAAY%2F20210401%2FUS%2Fs3%2Faws4_request&X-Amz-Date=20210401T215649Z&X-Amz-Expires=10&X-Amz-SignedHeaders=host&X-Amz-Signature=e186c62942ebdb57eea7da6e70247474dbf15b5a43b0e7944cbff07625555e0a
[2021-04-01T21:56:49.733Z]   % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
[2021-04-01T21:56:49.733Z]                                  Dload  Upload   Total   Spent    Left  Speed
[2021-04-01T21:56:49.733Z] 
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  5747    0     0  100  5747      0  42257 --:--:-- --:--:-- --:--:-- 42257
[2021-04-01T21:56:49.733Z]     -> Reports have been successfully queued for processing at https://codecov.io/github/elastic/apm-nodejs-http-client/commit/0dac0b1582f7b34130e06894ca383f1624578ef3
[2021-04-01T21:56:51.523Z] Running on Jenkins in /var/lib/jenkins/workspace/pm-nodejs-http-client-mbp_PR-144
[2021-04-01T21:56:51.555Z] [INFO] getVaultSecret: Getting secrets
[2021-04-01T21:56:51.801Z] Masking supported pattern matches of $VAULT_ADDR or $VAULT_ROLE_ID or $VAULT_SECRET_ID
[2021-04-01T21:56:52.402Z] + chmod 755 generate-build-data.sh
[2021-04-01T21:56:52.402Z] + ./generate-build-data.sh https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-nodejs-http-client-mbp/PR-144/ https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-nodejs-http-client-mbp/PR-144/runs/20 FAILURE 357773
[2021-04-01T21:56:52.652Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-nodejs-http-client-mbp/PR-144/runs/20/steps/?limit=10000 -o steps-info.json
[2021-04-01T21:56:53.203Z] INFO: curl https://apm-ci.elastic.co/blue/rest/organizations/jenkins/pipelines/apm-agent-nodejs/apm-nodejs-http-client-mbp/PR-144/runs/20/tests/?status=FAILED -o tests-errors.json

This now does *not* unzip the payload. Is that a compat issue?

…0s: unref it *and* clean it up for that particular case in the conclusion handling

Also, allow any 2xx from APM server's intake API to be a "success". Really it should be only 202, but previous behaviour was any 2xx which was relied upon by some tests and some mock APM servers.

…tly the same for the case 0)

… emitting the close event in one case

… events/shutdown/etc.

… docs on intake req conclusion cases

…t cases for these timeouts

…s in corrupted process.title that breaks tests

trentm · 2021-03-29T23:29:04Z

Some perf-ish results from testing example-blocking-behavior.js with and without APM and talking to an APM server with different behaviours.

Typically (unless otherwise noted) we are hitting it with load via:

npx autocannon -d 90 -c 97 -p 10 localhost:3000/

which is: make requests as fast as possible for 90 seconds, using 97 concurrent connections, and 10 "pipelined requests". This is a higher level of concurrency in load than the autocannon
default of -c 10 -p 1.

I also observed the memory usage ("REM") using htop while running each load test.

tl;dr

Some observations from the data below:

In all load cases, the resident memory usage of the app stayed less than 150MB. The equivalent cases with an APM agent without these fixes would rise up to 1300MB.

This translated to much better latency figures for higher percentiles:

│ Stat     │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev     │ Max     │
├──────────┼────────┼────────┼────────┼────────┼───────────┼───────────┼─────────┤
│ no fix   │ 138 ms │ 155 ms │ 458 ms │ 665 ms │ 187.46 ms │ 237.45 ms │ 4876 ms │
│ with fix │ 126 ms │ 145 ms │ 198 ms │ 208 ms │ 150.13 ms │  21.12 ms │  410 ms │

The "no APM server" case below demonstrates backoff working to not pepper the APM server with new requests when it is failing.
In all of the "using APM" cases, the internal APM agent stats show zero or a single APM event encoding event-loop bubble over 10ms. This is compared to without these fixes getting bubbles frequently under load, and up to over full seconds.

baseline, no APM

htop observed: RES ~80MB, CPU ~104%

│ Stat    │ 2.5%  │ 50%   │ 97.5% │ 99%   │ Avg      │ Stdev   │ Max    │
├─────────┼───────┼───────┼───────┼───────┼──────────┼─────────┼────────┤
│ Latency │ 11 ms │ 17 ms │ 31 ms │ 36 ms │ 19.94 ms │ 7.39 ms │ 157 ms │

with APM, local mockapmserver.js (accidentally with payloadLogFile)

│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max    │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼────────┤
│ Latency │ 125 ms │ 138 ms │ 173 ms │ 180 ms │ 142.28 ms │ 18.25 ms │ 584 ms │

APM Agent stats: { apmclient:
   { numEvents: 611271,
     numEventsDropped: 597907,
     numEventsEnqueued: 13364,
     numEventsSent: 13364,
     slowWriteBatch:
      { numOver10Ms: 0,
        encodeTimeMs: 2.441588,
        fullTimeMs: 5.245072,
        numEvents: 32,
        numBytes: 25086 },
     backoffReconnectCount: 0 } }

with APM, local mockapmserver.js that is accepting, reading, but not responding (accidentally with payloadLogFile)

htop: RES around 130-140MB

│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max    │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼────────┤
│ Latency │ 126 ms │ 142 ms │ 181 ms │ 195 ms │ 145.81 ms │ 19.02 ms │ 584 ms │

APM Agent stats: { apmclient:
   { numEvents: 572840,
     numEventsDropped: 564596,
     numEventsEnqueued: 7316,
     numEventsSent: 5492,
     slowWriteBatch:
      { numOver10Ms: 0,
        encodeTimeMs: 2.647948,
        fullTimeMs: 5.601515,
        numEvents: 32,
        numBytes: 25085 },
     backoffReconnectCount: 3 } }
{"log.level":"error","@timestamp":"2021-03-29T21:38:23.224Z","log":{"logger":"elastic-apm-node"},"ecs":{"version":"1.6.0"},"message":"APM Server transport error: intake response timeout: APM server did not respond within 10s of gzip stream finish"}

with APM agent, but no APM server (accidentally with payloadLogFile)

htop: RES around 130-140MB

│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max    │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼────────┤
│ Latency │ 125 ms │ 137 ms │ 171 ms │ 178 ms │ 141.42 ms │ 17.01 ms │ 541 ms │

We see backoff delays, which is good: 4s, 9s, 16s, 25s

[2021-03-29T21:41:14.406Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:41:14.761Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:41:16.152Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:41:20.260Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:41:30.339Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:41:47.043Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200
[2021-03-29T21:42:11.879Z] ERROR (elastic-apm-node): APM Server transport error (ECONNREFUSED): connect ECONNREFUSED 127.0.0.1:8200

with APM agent, against APM server v7.12.0 in Google Cloud (us-west1)

htop: RES ~150MB

│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev    │ Max    │
├─────────┼────────┼────────┼────────┼────────┼───────────┼──────────┼────────┤
│ Latency │ 126 ms │ 145 ms │ 198 ms │ 208 ms │ 150.13 ms │ 21.12 ms │ 410 ms │

APM Agent stats: { apmclient:
   { numEvents: 569401,
     numEventsDropped: 552964,
     numEventsEnqueued: 16085,
     numEventsSent: 15541,
     slowWriteBatch:
      { numOver10Ms: 0,
        encodeTimeMs: 5.624694,
        fullTimeMs: 6.186725,
        numEvents: 32,
        numBytes: 25084 },
     backoffReconnectCount: 0 } }

comparing to client without these fixes

│ Stat    │ 2.5%   │ 50%    │ 97.5%  │ 99%    │ Avg       │ Stdev     │ Max     │
├─────────┼────────┼────────┼────────┼────────┼───────────┼───────────┼─────────┤
│ Latency │ 138 ms │ 155 ms │ 458 ms │ 665 ms │ 187.46 ms │ 237.45 ms │ 4876 ms │

htop: RES up to 1600MB

with APM talking to cloud APM server, with default autocannon load

% npx autocannon -c 10 -d 90 -p 1 localhost:3000/
...
│ Stat    │ 2.5% │ 50%  │ 97.5% │ 99%  │ Avg     │ Stdev   │ Max   │
├─────────┼──────┼──────┼───────┼──────┼─────────┼─────────┼───────┤
│ Latency │ 1 ms │ 1 ms │ 5 ms  │ 5 ms │ 1.73 ms │ 1.04 ms │ 25 ms │


APM Agent stats: { apmclient:
   { numEvents: 392594,
     numEventsDropped: 54,
     numEventsEnqueued: 392540,
     numEventsSent: 387102,
     slowWriteBatch:
      { numOver10Ms: 1,
        encodeTimeMs: 2.460121,
        fullTimeMs: 12.261069,
        numEvents: 32,
        numBytes: 25088 },
     backoffReconnectCount: 0 } }

trentm

add some notes for reviewers.

trentm · 2021-03-30T00:12:36Z

test/edge-cases.js

@@ -12,58 +12,6 @@ const assertMetadata = utils.assertMetadata
 const assertEvent = utils.assertEvent
 const validOpts = utils.validOpts

-test('Event: close - if ndjson stream ends', function (t) {


note to reviewer: These two tests were identical to the following two. I'm guessing they date back from before stream-chopper was in use, but I'm not sure.

trentm · 2021-03-30T00:15:21Z

index.js

  this._chopper.destroy()
  this._agent.destroy()
  cb(err)
 }

-function onStream (client, onerror) {


note to reviewer: A total re-write of onStream -- now called getChoppedStreamHandler and makeIntakeRequest -- is below (after the _getBackoffDelay() function). There is large block comment that should mostly explain it.

astorm

First review pass done -- nits and questions below.

I'm going to take a second pass after I've given the client a try with a few toy applications to make sure I understand the flow of things and to look for an unexpected/surprising changes in behavior we might need to address/document.

It's worth saying out loud that overall the new handler returned by getChoppedStreamHandler is more locally complex than the previous onStream handler. However, all we've really done here is surface the complexity that was hidden away under the pump library and given ourselves the necessary control to address the performance issues and edge cases. It's complex code, but it's also as no more complex than it needs to be. 👍

.ci/.jenkins_nodejs.yml

test/lib/unref-client.js

index.js

astorm · 2021-03-31T17:11:05Z

index.js

+    const intakeResTimeoutOnEnd = client._conf.intakeResTimeoutOnEnd
+
+    // `_active` is used to coordinate the callback to `client.flush(db)`.
+    client._active = true


Proforma question -- are we sure there will only be one intake request active at a time? (if not, is the use client._active introducing the possibility of a race-ish condition if there's any overlap between intake request?)

Yup, the code is designed to only have one intake request active at a time. Otherwise there is a bug. The completePart() function is written to handle all the (documented in its top comment) ways the 3 streams can conclude, and only call StreamChopper's next() when those 3 streams are all cleanly finished or destroyed.

Note that this client._active handling was in the earlier onStream code. However, because the earlier code didn't wait on the completion of the HTTP response from APM server it was possible to have multiple intake requests going at the same time.

index.js

trentm · 2021-03-31T23:00:51Z

Testing notes:

the "Test" check failure above is a spurious failure in the codecov.io reporting ('npm run coverage' errors; breaks checks for node v15 #146)
fix: blocking behaviour under load apm-agent-nodejs#2024 demonstrates the APM agent test suite passing using these http client changes.

astorm

One last question before stamping this -- do you have any thoughts on whether we want to log a periodic message if/when the client ends up dropping spans?

I suspect there's going to be some span heavy applications/services that, with the previous version of the agent, didn't run into significant performance issues but who will end up hitting the span limit and dropping spans. Do you think we need logging in the agent for this, or will the changelog(s) and documentation for maxQueueSize be enough?

index.js

astorm

👍 thanks for talking those through. GHA tests are passing, and the jenkins failures (per previous comments) appear to be related to the code coverage tooling. Approving.

trentm self-assigned this Mar 26, 2021

github-actions bot added the agent-nodejs Make available for APM Agents project planning. label Mar 26, 2021

trentm added 11 commits March 26, 2021 12:04

restore support for 'payloadLogFile' config

e742ae7

This now does *not* unzip the payload. Is that a compat issue?

fix a case where the intakeResTimer could hold the process open for 1…

f60a02c

…0s: unref it *and* clean it up for that particular case in the conclusion handling

restore the slow (re-unzipping) behaviour of payloadLogStream

be25b72

get tests passing; re-instate socket.unref() for proper process ending

622ea3d

Also, allow any 2xx from APM server's intake API to be a "success". Really it should be only 202, but previous behaviour was any 2xx which was relied upon by some tests and some mock APM servers.

restart actual stream-chopper dep

4136b58

drop redundant test cases

4692412

drop the old 'onStream' handling of an intake request

1ed4f76

node docs suggest Buffer.concat will suffice (possibly not using exac…

c278f7d

…tly the same for the case 0)

clean XXXs; doc process completion handling

cf39141

switch to gzip 'end' event to work around node v12 bug(?) with it not…

d2d7d91

… emitting the close event in one case

good time to start testing with node v15 while we are deep in it with…

c7cbce9

… events/shutdown/etc.

trentm mentioned this pull request Mar 29, 2021

'npm run coverage' errors; breaks checks for node v15 #146

Closed

trentm added 6 commits March 29, 2021 11:41

add maxQueueSize config

332e402

make intakeResTimeout configurable

c0078d7

force intakeResTimeout to max 1s when ending; add test case for that;…

810d54c

… docs on intake req conclusion cases

make intakeResTimeoutOnEnd configurable (helps with testing); add tes…

f4f2e08

…t cases for these timeouts

add tests for Client._getBackoffDelay

c8aadb6

let tape do globbing to avoid too-long process.argv on CI that result…

961676f

…s in corrupted process.title that breaks tests

trentm added 2 commits March 29, 2021 17:07

changelog entry

4313fa5

mention node v15 testing in changelog

05aea1a

trentm commented Mar 30, 2021

View reviewed changes

trentm marked this pull request as ready for review March 30, 2021 00:16

trentm requested a review from astorm March 30, 2021 00:16

astorm reviewed Mar 31, 2021

View reviewed changes

drop debugging log stmt

84f7f57

trentm requested a review from astorm March 31, 2021 21:10

always make next() callback to stream-chopper async, per CR feedback

0dac0b1

astorm suggested changes Apr 1, 2021

View reviewed changes

index.js Show resolved Hide resolved

trentm mentioned this pull request Apr 2, 2021

use process.emitWarning for user warnings elastic/apm-agent-nodejs#2029

Open

astorm approved these changes Apr 2, 2021

View reviewed changes

trentm merged commit 157d8ce into master Apr 5, 2021

trentm deleted the trentm/blocking-behavior branch April 5, 2021 16:38

trentm mentioned this pull request Apr 20, 2021

High CPU load when APM-SERVER is down elastic/apm-agent-nodejs#1984

Closed

1 task

joshdover mentioned this pull request Apr 28, 2021

Bump elastic-apm-node from v3.10.0 to v3.14.0 elastic/kibana#97509

Merged

1 task

trentm mentioned this pull request Jun 14, 2021

tracking bug for agent performance impact on Kibana elastic/apm-agent-nodejs#2028

Closed

This was referenced Oct 12, 2021

feat: add incremental back-off in case of APM Server issues #17

Closed

Add grace period in case of APM Server errors or is offline elastic/apm-agent-nodejs#684

Closed

trentm mentioned this pull request Aug 3, 2023

Kibana Instrumentation and APM Server transport error (ECONNRESET): socket hang up Log Messages #127

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: blocking behaviour under load #144

fix: blocking behaviour under load #144

trentm commented Mar 26, 2021

apmmachine commented Mar 26, 2021 •

edited

Loading

Build stats

Test stats 🧪

Trends 🧪

`Run Tests`

`Error signal`

trentm commented Mar 29, 2021

trentm left a comment

trentm Mar 30, 2021

trentm Mar 30, 2021

astorm left a comment

astorm Mar 31, 2021

trentm Mar 31, 2021

trentm commented Mar 31, 2021

astorm left a comment

astorm left a comment

fix: blocking behaviour under load #144

fix: blocking behaviour under load #144

Conversation

trentm commented Mar 26, 2021

apmmachine commented Mar 26, 2021 • edited Loading

💔 Build Failed

Build stats

Test stats 🧪

Trends 🧪

Steps errors

Run Tests

Error signal

Log output

trentm commented Mar 29, 2021

tl;dr

baseline, no APM

with APM, local mockapmserver.js (accidentally with payloadLogFile)

with APM, local mockapmserver.js that is accepting, reading, but not responding (accidentally with payloadLogFile)

with APM agent, but no APM server (accidentally with payloadLogFile)

with APM agent, against APM server v7.12.0 in Google Cloud (us-west1)

comparing to client without these fixes

with APM talking to cloud APM server, with default autocannon load

trentm left a comment

Choose a reason for hiding this comment

trentm Mar 30, 2021

Choose a reason for hiding this comment

trentm Mar 30, 2021

Choose a reason for hiding this comment

astorm left a comment

Choose a reason for hiding this comment

astorm Mar 31, 2021

Choose a reason for hiding this comment

trentm Mar 31, 2021

Choose a reason for hiding this comment

trentm commented Mar 31, 2021

astorm left a comment

Choose a reason for hiding this comment

astorm left a comment

Choose a reason for hiding this comment

apmmachine commented Mar 26, 2021 •

edited

Loading

`Run Tests`

`Error signal`