Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tracking bug for agent performance impact on Kibana #2028

Closed
trentm opened this issue Apr 1, 2021 · 5 comments
Closed

tracking bug for agent performance impact on Kibana #2028

trentm opened this issue Apr 1, 2021 · 5 comments
Assignees
Labels
agent-nodejs Make available for APM Agents project planning. kibana performance
Milestone

Comments

@trentm
Copy link
Member

trentm commented Apr 1, 2021

This is a tracking bug for a timeboxed effort to look at improving the Node.js APM agent's perf impact in Kibana.
See elastic/kibana#78792 for most details. This ticket will help for tracking agent-side work.

@trentm trentm added this to the 7.13 milestone Apr 1, 2021
@trentm trentm self-assigned this Apr 1, 2021
@github-actions github-actions bot added the agent-nodejs Make available for APM Agents project planning. label Apr 1, 2021
@trentm
Copy link
Member Author

trentm commented Apr 23, 2021

Yesterday my thoughts on a plan for this issue were:

  1. Investigate each of the top items over ~2% top-of-stack from flamegraphs
    of the loaded demo "pathosapp.js"
    (truncation/encoding, tracecontext, Transaction._captureBreakdown)
  2. Implement the obvious "captureSpanStackTraces=false by default", and
    a new setting to only capture span stack traces for slow spans.
  3. Profile and possibly improve stack trace capturing (used for
    captured errors and ^^^ slow spans).
  4. Run Kibana DemoJourney with APM agent stripped to barebones core
    async-hook that is the foundation of tracing async context as a
    guide for "how low we can go" on overhead.
  5. Learn (and possibly integrate) agent ability to get profiles of
    part of an app run using pprof-nodejs (building on Dario's PoC
    https://github.com/dgieselaar/apm-agent-nodejs/tree/pprof-poc/).
    This could be used for debugging customer perf issues (get a
    profile from prod to analyze offline) and is the likely basis
    for a profiling feature in the Node.js APM agent.
  6. Add one or more benchmarks to the agent's current benchmark system
    that covers encoding of spans/transactions/errors for ongoing
    tracking.

Timebox to another 2 weeks max, then time to move on.

Future work:

@trentm
Copy link
Member Author

trentm commented Apr 23, 2021

1. Investigate each of the top items over ~2% top-of-stack from flamegraphs
   of the loaded demo "pathosapp.js"
   (truncation/encoding, tracecontext, Transaction._captureBreakdown)

That flamegraph is here: https://gist.github.com/trentm/c70154cc6674a000f6b4c0827e3f0183
Screenshot:
screenshot
Top "top-of-stack" items there are:

  • breadthFilter (top of stack: 4.8% of samples in this profile) and Client._encode (top of stack: 3%) in apm-nodejs-http-client. This is the so-called "truncation" work. My thoughts/comments on this are in this comment below.
  • "lib/tracecontext/..." (top of stack: 2.7%) spread over a number of stacks. Alan looks at this in Perf Work: Investigate TraceContext/TraceState #2056
  • Transaction._captureBreakdown (top of stack: 2.5%)
  • ... next is 1.3% top-of-stack or lower.

@trentm
Copy link
Member Author

trentm commented Apr 23, 2021

a look at Client._encode et al perf

Here I show data from a number of DemoJourney kibana-load-testing runs against a dev mode Kibana v7.12.0 on my laptop (2019 MBP, 2.4GHz Core i9, 64G RAM) with tweaks to the APM agent.

Base APM config is:

{
  active: true,
  serverUrl: 'http://127.0.0.1:8200',
  centralConfig: false,
  breakdownMetrics: false,
  transactionSampleRate: 0.1,
  metricsInterval: '120s',
  captureHeaders: false,
  captureBody: 'off',
  logUncaughtExceptions: true,
  captureSpanStackTraces: false
}

The data is the "Global Information" from the DemoJourney report for each run. The values are in milliseconds.
Here is the data in a spreadsheet.

                                min     p50     p75     p95     p99       max    mean   stddev
baseline (no APM)
  demojourney-20210422210946639  11      82     195     578      877     1720     152     188
  demojourney-20210422211752793  11      84     213     602      924     1624     159     195
  demojourney-20210422212632139  11      85     248     609      970     1453     168     209
  demojourney-20210422214929820  11      89     216     593      967     1423     163     202

with APM (yarn add link:../apm-agent-nodejs; i.e. v3.14.0 of the agent)
  demojourney-20210422191919074  12      90     214     696     1231     2072     176     241
  demojourney-20210422192943594  12     104     244     666      975     1705     179     221
  demojourney-20210422194110834  12      96     230     618      955     1536     168     206

with APM (back to agent v7.10.0, as a sanity check)
  demojourney-20210422220144348  12      84     208     592      911     1497     158     196
  demojourney-20210422221044599  12      83     200     586      961     1466     157     198

with APM (yarn add elastic-apm-node@3.14.0)
  demojourney-20210422222345996  12      91     216     551     1003     1559     161     198
  demojourney-20210422223249288  12      87     219     612      962     1575     163     205
  demojourney-20210422225030336  12      82     203     590      980     1693     158     208
  demojourney-20210422225827807  12      86     197     685      960     1292     165     211
  demojourney-20210422230632355  12      85     222     616      991     1455     163     208

baseline (no APM, *again* as sanity check)
  demojourney-20210422232018753  11      74     182     522      871     1237     142     177
  demojourney-20210422232815319  11      85     211     602      908     1308     160     195
  demojourney-20210422233613017  11      71     187     543     1000     1337     144     196

with APM (noop redact, noop Client._encode)
  demojourney-20210422202202398  12      97     238     655      969     1579     176     213
  demojourney-20210422203146572  12      96     230     594      933     1400     166     199
  demojourney-20210422210103812  12     112     262     673     1012     2019     193     239
... same again later in the day (me trying harder not to make computer busy):
  demojourney-20210422235152556  12      83     191     535      933     1444     150     190
  demojourney-20210422235950181  12      96     235     597      928     1189     166     196
  demojourney-20210423000747805  12      84     198     562      940     1297     150     186

The "noop redact" and "noop Client._encode" are patches against the Node.js APM agent to mostly eliminate that processing.

Earlier runs (e.g. those with "demojourney-2021042220" and "demojourney-2021042221" prefix) were a bit noisier as I wasn't completely hands-off my laptop during those runs -- or at least that's the hypothesis for why those showed slightly higher p50 values.

Conclusions

Looking (mostly) at the p50, my conclusion here is that there is too much noise in the results here to clearly separate out impact of Client._encode and redacting APM agent code.

  • The range of p50 values (the spreadsheet shows the calculated stddev) "with APM" overlaps with the range with "no APM".
  • The percentage increase in p50 "with APM" (over the baseline) is less than with APM with the Client._encode work noop'd out, which obviously doesn't make sense.

With many runs of the DemoJourney we can eek out a general trend of "yes there is some APM agent impact", but not precise enough to measure small (single-digit percentage points) improvements in agent perf.

I think we should move on to steps 2 and later in the "plan" from the earlier comment. Small perf improvements in the agent should still be looked at, but not in this block of work for Kibana+APM. Subsequent DemoJourney comparisons should be done against cloud deployments (as described at https://github.com/elastic/kibana-load-testing#readme) to be somewhat closer to the real thing.

@trentm
Copy link
Member Author

trentm commented May 5, 2021

Some updates on my perf look at async_hooks usage in the agent:

  • If an app is very async-op heavy -- the number of promises, nextTicks, setImmediates, setTimeouts are large per transaction -- then the impact of node's async_hooks can be significant. (Using node pathosapp.js deep_callstack can simulate an extreme version of this: with a 1000-deep nextTick stack per request and no other processing, loading to 1000 req/s, without the agent CPU<40%, with the agent CPU is pinned and can only sustain ~600 req/s.)
  • Our agent's async-hook.js does more bookkeeping than it should, but I doubt this will be a significant impact on Kibana. Profiling actual Kibana is the best next way to prove that. I've looked into Investigation about async-hooks #1697 some, but wasn't able to produce the extreme perf issue hinted at by the two user examples given there.
  • Using asyncHooks=false does significantly reduce overhead for the extreme simulated cases. However, I'm not clear if we want to recommend using that config for Kibana -- mainly because I'm not sure one what we are trading off using the "patch-async.js" context manager.
    TODO: some more DemoJourney runs (I've started some) should be able to
    show if the p50 can be improved with asyncHooks=false

trentm added a commit that referenced this issue May 31, 2021
Refactor agent.captureError(...) and stacktrace collection for better
performance. This moves the stackman dep into lib/stacktraces.js and
lib/errors.js to allow breaking its compatibility for perf improvements.

Refs: #2028
trentm added a commit that referenced this issue May 31, 2021
Refactor agent.captureError(...) and stacktrace collection for better
performance. This moves the stackman dep into lib/stacktraces.js and
lib/errors.js to allow breaking its compatibility for perf improvements.

Refs: #2028
trentm added a commit that referenced this issue Jun 9, 2021
Refactor agent.captureError(...) and stacktrace collection for better
performance. This moves the stackman dep into lib/stacktraces.js and
lib/errors.js to allow breaking its compatibility for perf improvements.

Perf issues:
- `{error,span}.stacktrace` logic was largely in the "stackman" module.
  It had a few perf issues:
    1. It monkey-patched (or wrapped?) v8 native CallSite objects with
       a bunch of properties for convenience/sugar. Using these in
       stacktrace generation was very slow (as identified by flame graphs).
    2. It had a bug in its caching of sourcemap file loading for each
       stackframe: when a file had no separate sourcemap file (the common
       case), this wasn't cached. That means we would be stat'ing for
       a sourcemap file for almost every frame.
    3. A stack frame's `filename` is a relative path to cwd. The code ended
       up calling `process.cwd()` for *every* frame, which is expensive.
  Dealing with the first issue can't be done without either completely breaking
  "stackman" backwards compat, or just no longer using "stackman" and vendoring
  some of its code. I've done the latter.
- This adds caching to the conversion of a CallSite object to an APM error
  stacktrace frame (frameFromCallSite), which had a significant impact on
  CPU usage in the perf test mentioned above (it was about 6-7% of the diff).

Other issues:
- Fix sourcemap "sourcesContent" usage.
  Test case: test/stacktraces/stacktraces.test.js

Refs: #2028
Fixes: #2095
@trentm
Copy link
Member Author

trentm commented Jun 14, 2021

This issue was intended to track a timeboxed perf effort on the Node.js APM agent -- mainly motivated by usage of the Node.js agent by the Kibana team. I'm closing this issue with a summary of what improvements we've made lately and some thoughts for future perf work.

Improvements:

A before vs. after run of a simulated workload (80% transaction, 17% transaction + some spans, 3% errors) of an Express.js app loaded to 500 req/s show a significant reduction in CPU usage:

apm-agent-nodejs-perf-before-after

and latency percentiles:

Percentile Before After
10% 0.0064s 0.0053s
25% 0.0124s 0.0103s
50% 0.0217s 0.0201s
75% 0.0305s 0.0287s
90% 0.0363s 0.0341s
95% 0.0405s 0.0373s
99% 0.0846s 0.0479s

Future work

  • We'll be adding some scripts/tooling to the repo to help with subsequent performance analysis.
  • Flame graphs for simulated load runs (provided by this tooling) show more top stacks that should be consider, in particular breakdown calculation.
  • The performance label marks a number of issues for future work.
  • Work on continuous profiling as an APM feature and for use in one-off in-production performance analysis should continue.
  • The Node.js APM agent should grow Async Local Storage-based context tracking.
  • Investigate cachegrind technique for stable benchmarking (https://github.com/pythonspeed/cachegrind-benchmarking).

@trentm trentm closed this as completed Jun 14, 2021
dgieselaar pushed a commit to dgieselaar/apm-agent-nodejs that referenced this issue Sep 10, 2021
Refactor agent.captureError(...) and stacktrace collection for better
performance. This moves the stackman dep into lib/stacktraces.js and
lib/errors.js to allow breaking its compatibility for perf improvements.

Perf issues:
- `{error,span}.stacktrace` logic was largely in the "stackman" module.
  It had a few perf issues:
    1. It monkey-patched (or wrapped?) v8 native CallSite objects with
       a bunch of properties for convenience/sugar. Using these in
       stacktrace generation was very slow (as identified by flame graphs).
    2. It had a bug in its caching of sourcemap file loading for each
       stackframe: when a file had no separate sourcemap file (the common
       case), this wasn't cached. That means we would be stat'ing for
       a sourcemap file for almost every frame.
    3. A stack frame's `filename` is a relative path to cwd. The code ended
       up calling `process.cwd()` for *every* frame, which is expensive.
  Dealing with the first issue can't be done without either completely breaking
  "stackman" backwards compat, or just no longer using "stackman" and vendoring
  some of its code. I've done the latter.
- This adds caching to the conversion of a CallSite object to an APM error
  stacktrace frame (frameFromCallSite), which had a significant impact on
  CPU usage in the perf test mentioned above (it was about 6-7% of the diff).

Other issues:
- Fix sourcemap "sourcesContent" usage.
  Test case: test/stacktraces/stacktraces.test.js

Refs: elastic#2028
Fixes: elastic#2095
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
agent-nodejs Make available for APM Agents project planning. kibana performance
Projects
None yet
Development

No branches or pull requests

1 participant