Skip to content
This repository has been archived by the owner on May 3, 2024. It is now read-only.

feat(tracing): add OpenTelemetry tracing #1214

Merged
merged 77 commits into from
Mar 12, 2024
Merged

Conversation

10xLaCroixDrinker
Copy link
Member

@10xLaCroixDrinker 10xLaCroixDrinker commented Dec 21, 2023

Description

Adds OTel tracing to v6.

The tracer initialization needs to be required via node's --require flag for instrumentation to work properly. In order to facilitate this with the least burden to the user a new startup script was added to use this flag when the OTEL_EXPORTER_OTLP_TRACES_ENDPOINT env var is set.

The fastify automatic instrumentation was very noisy, adding a span to every plugin. If anyone thinks this should still be included, I'm open to a discussion, but it didn't seem to provide enough value to outweigh the cost/DX of sheer number of spans added.

Automatic HTTP and DNS tracing is included.

The manually created spans match those in the v5 tracer. I added phase attributes to the spans to match the spans in the v5 tracer. Need to have a brief discussion on whether these attributes should be left in.

Fixed a bug where ending a request timer more than once would cause a crash.

Motivation and Context

Improves DX, observability & monitoring. Once this is in we will be able to begin work on a new API allowing modules to add their own spans.

How Has This Been Tested?

  • Running one-app locally, sent logs and traces to both local and remote OTel collectors (separate tests for just logs, just traces, neither & both)
  • Reviewed traces in UI
  • Validated span and trace IDs are added to OTel log records
  • Unit tests
  • Enabled tracing in integration tests

Performance results

High Volume simple SSR

Test configuration
    { duration: '1m', target: 100 }, // normal load
    { duration: '2m', target: 100 },
    { duration: '1m', target: 300 }, // high load
    { duration: '2m', target: 300 },
    { duration: '1m', target: 400 }, // breaking point before circuit trips
    { duration: '2m', target: 400 },
    { duration: '1m', target: 100 },
    { duration: '2m', target: 100 },
    { duration: '1m', target: 1 }, // scale down. Recovery stage.
    { duration: '2m', target: 0 },
K6 results with tracing
  scenarios: (100.00%) 1 scenario, 400 max VUs, 15m30s max duration (incl. graceful stop):
           * default: Up to 400 looping VUs for 15m0s over 10 stages (gracefulRampDown: 30s, gracefulStop: 30s)


     ✓ status is 200

     checks.........................: 100.00% ✓ 133313     ✗ 0
     data_received..................: 1.2 GB  1.4 MB/s
     data_sent......................: 17 MB   18 kB/s
     http_req_blocked...............: avg=431.81µs min=1.26µs  med=3.71µs  max=1.09s   p(90)=5.69µs   p(95)=6.8µs
     http_req_connecting............: avg=912ns    min=0s      med=0s      max=13.61ms p(90)=0s       p(95)=0s
   ✓ http_req_duration..............: avg=216.97ms min=1.68ms  med=36.52ms max=2.28s   p(90)=917.9ms  p(95)=1.33s
       { expected_response:true }...: avg=216.97ms min=1.68ms  med=36.52ms max=2.28s   p(90)=917.9ms  p(95)=1.33s
     http_req_failed................: 0.00%   ✓ 0          ✗ 133313
     http_req_receiving.............: avg=93.49µs  min=26.55µs med=81.03µs max=13.7ms  p(90)=138.51µs p(95)=167.9µs
     http_req_sending...............: avg=17.95µs  min=5.42µs  med=12.92µs max=14.23ms p(90)=34.56µs  p(95)=43.77µs
     http_req_tls_handshaking.......: avg=425.93µs min=0s      med=0s      max=1.09s   p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=216.86ms min=1.6ms   med=36.41ms max=2.28s   p(90)=917.82ms p(95)=1.33s
     http_reqs......................: 133313  147.963568/s
     iteration_duration.............: avg=1.21s    min=1s      med=1.03s   max=3.29s   p(90)=1.92s    p(95)=2.34s
     iterations.....................: 133313  147.963568/s
     vus............................: 1       min=1        max=400
     vus_max........................: 400     min=400      max=400


running (15m01.0s), 000/400 VUs, 133313 complete and 0 interrupted iterations
K6 results without tracing
  scenarios: (100.00%) 1 scenario, 400 max VUs, 15m30s max duration (incl. graceful stop):
           * default: Up to 400 looping VUs for 15m0s over 10 stages (gracefulRampDown: 30s, gracefulStop: 30s)


     ✓ status is 200

     checks.........................: 100.00% ✓ 136942     ✗ 0
     data_received..................: 1.3 GB  1.4 MB/s
     data_sent......................: 17 MB   19 kB/s
     http_req_blocked...............: avg=239.62µs min=1.22µs med=3.66µs  max=931.06ms p(90)=5.7µs    p(95)=6.86µs
     http_req_connecting............: avg=1.06µs   min=0s     med=0s      max=21.04ms  p(90)=0s       p(95)=0s
   ✓ http_req_duration..............: avg=184.86ms min=1.28ms med=21.19ms max=2.13s    p(90)=797.61ms p(95)=1.24s
       { expected_response:true }...: avg=184.86ms min=1.28ms med=21.19ms max=2.13s    p(90)=797.61ms p(95)=1.24s
     http_req_failed................: 0.00%   ✓ 0          ✗ 136942
     http_req_receiving.............: avg=93.38µs  min=24µs   med=80.29µs max=14.78ms  p(90)=139.96µs p(95)=170.62µs
     http_req_sending...............: avg=18.58µs  min=5.32µs med=12.83µs max=6.89ms   p(90)=36.96µs  p(95)=46.15µs
     http_req_tls_handshaking.......: avg=233.64µs min=0s     med=0s      max=930.78ms p(90)=0s       p(95)=0s
     http_req_waiting...............: avg=184.75ms min=1.23ms med=21.07ms max=2.13s    p(90)=797.53ms p(95)=1.24s
     http_reqs......................: 136942  152.127241/s
     iteration_duration.............: avg=1.18s    min=1s     med=1.02s   max=3.13s    p(90)=1.79s    p(95)=2.24s
     iterations.....................: 136942  152.127241/s
     vus............................: 1       min=1        max=400
     vus_max........................: 400     min=400      max=400


running (15m00.2s), 000/400 VUs, 136942 complete and 0 interrupted iterations
Grafana visualization without tracing on the left, with tracing on the right

localhost_3030_d_tDGvrq7Mz_one-app-performance_orgId=1 from=1709713706951 to=1709715669389

Low Volume complex SSR

Test configuration
    { duration: '30s', target: 10 }, // normal load
    { duration: '1m', target: 10 },
    { duration: '30s', target: 30 }, // high load
    { duration: '1m', target: 30 },
    { duration: '30s', target: 10 },
    { duration: '1m', target: 10 },
    { duration: '30s', target: 1 }, // scale down. Recovery stage.
    { duration: '1m', target: 0 },
K6 results with tracing
  scenarios: (100.00%) 1 scenario, 30 max VUs, 6m30s max duration (incl. graceful stop):
           * default: Up to 30 looping VUs for 6m0s over 8 stages (gracefulRampDown: 30s, gracefulStop: 30s)


     ✓ status is 200

     checks.........................: 100.00% ✓ 990     ✗ 0
     data_received..................: 3.0 GB  8.3 MB/s
     data_sent......................: 107 kB  297 B/s
     http_req_blocked...............: avg=53.47µs min=2.81µs  med=5.24µs  max=5.26ms   p(90)=8.31µs  p(95)=27.68µs
     http_req_connecting............: avg=40.93µs min=0s      med=0s      max=5.03ms   p(90)=0s      p(95)=0s
   ✗ http_req_duration..............: avg=3.7s    min=1.89s   med=3.63s   max=7.83s    p(90)=4.81s   p(95)=5.09s
       { expected_response:true }...: avg=3.7s    min=1.89s   med=3.63s   max=7.83s    p(90)=4.81s   p(95)=5.09s
     http_req_failed................: 0.00%   ✓ 0       ✗ 990
     http_req_receiving.............: avg=25.75ms min=14.46ms med=20.14ms max=460.59ms p(90)=33.57ms p(95)=44.07ms
     http_req_sending...............: avg=30.23µs min=9.46µs  med=19.89µs max=960.88µs p(90)=53.35µs p(95)=72.93µs
     http_req_tls_handshaking.......: avg=0s      min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=3.68s   min=1.86s   med=3.6s    max=7.81s    p(90)=4.79s   p(95)=5.07s
     http_reqs......................: 990     2.74577/s
     iteration_duration.............: avg=4.71s   min=2.89s   med=4.64s   max=8.83s    p(90)=5.81s   p(95)=6.09s
     iterations.....................: 990     2.74577/s
     vus............................: 1       min=1     max=30
     vus_max........................: 30      min=30    max=30


running (6m00.6s), 00/30 VUs, 990 complete and 0 interrupted iterations
default ✓ [======================================] 00/30 VUs  6m0s
K6 results without tracing
  scenarios: (100.00%) 1 scenario, 30 max VUs, 6m30s max duration (incl. graceful stop):
           * default: Up to 30 looping VUs for 6m0s over 8 stages (gracefulRampDown: 30s, gracefulStop: 30s)


     ✓ status is 200

     checks.........................: 100.00% ✓ 988      ✗ 0
     data_received..................: 3.0 GB  8.3 MB/s
     data_sent......................: 107 kB  296 B/s
     http_req_blocked...............: avg=48.18µs min=3.14µs  med=5.27µs  max=3.58ms   p(90)=7.67µs  p(95)=29.09µs
     http_req_connecting............: avg=35.76µs min=0s      med=0s      max=2.99ms   p(90)=0s      p(95)=0s
   ✗ http_req_duration..............: avg=3.73s   min=1.64s   med=3.68s   max=7.41s    p(90)=5.06s   p(95)=5.37s
       { expected_response:true }...: avg=3.73s   min=1.64s   med=3.68s   max=7.41s    p(90)=5.06s   p(95)=5.37s
     http_req_failed................: 0.00%   ✓ 0        ✗ 988
     http_req_receiving.............: avg=23.22ms min=14.56ms med=19.12ms max=266.61ms p(90)=29.51ms p(95)=39.67ms
     http_req_sending...............: avg=27.88µs min=11.52µs med=19.56µs max=374.83µs p(90)=47.88µs p(95)=67.86µs
     http_req_tls_handshaking.......: avg=0s      min=0s      med=0s      max=0s       p(90)=0s      p(95)=0s
     http_req_waiting...............: avg=3.7s    min=1.62s   med=3.66s   max=7.4s     p(90)=5.04s   p(95)=5.35s
     http_reqs......................: 988     2.740983/s
     iteration_duration.............: avg=4.73s   min=2.65s   med=4.69s   max=8.41s    p(90)=6.06s   p(95)=6.37s
     iterations.....................: 988     2.740983/s
     vus............................: 1       min=1      max=30
     vus_max........................: 30      min=30     max=30


running (6m00.5s), 00/30 VUs, 988 complete and 0 interrupted iterations
default ✓ [======================================] 00/30 VUs  6m0s
Grafana visualization with tracing on the left, without tracing on the right

localhost_3030_d_tDGvrq7Mz_one-app-performance_orgId=1 from=1709708037933 to=1709710030429 (5)

Types of Changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation (adding or updating documentation)
  • Dependency update
  • Security update

Checklist:

  • My change requires a change to the documentation and I have updated the documentation accordingly.
  • These changes should be applied to a maintenance branch.
  • This change requires cross browser checks.
  • Performance tests should be ran against the server prior to merging.
  • This change impacts caching for client browsers.
  • This change impacts HTTP headers.
  • This change adds additional environment variable requirements for One App users.
  • I have added the Apache 2.0 license header to any new files created.

Copy link
Contributor

github-actions bot commented Dec 21, 2023

Size Change: +353 B (0%)

Total Size: 719 kB

Filename Size Change
./build/app/app~vendors.js 411 kB +353 B (0%)
ℹ️ View Unchanged
Filename Size
./build/app/app.js 171 kB
./build/app/runtime.js 7.07 kB
./build/app/service-worker-client.js 7.25 kB
./build/app/vendors.js 123 kB

compressed-size-action

@10xLaCroixDrinker 10xLaCroixDrinker marked this pull request as ready for review March 6, 2024 18:10
@10xLaCroixDrinker 10xLaCroixDrinker requested review from a team as code owners March 6, 2024 18:10
Co-authored-by: Jonny Adshead <JAdshead@users.noreply.github.com>
JAdshead
JAdshead previously approved these changes Mar 7, 2024
@JAdshead JAdshead requested a review from a team March 7, 2024 22:28
JAdshead
JAdshead previously approved these changes Mar 8, 2024
__tests__/server/metricsServer.spec.js Outdated Show resolved Hide resolved
src/server/utils/tracer.js Show resolved Hide resolved
get setAttributes() { return () => noopSpan; },
get setStatus() { return () => noopSpan; },
get spanContext() { return () => {}; },
get updateName() { return () => noopSpan; },
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the use of get if set isn't use? I don't see the value of it

Copy link
Member Author

@10xLaCroixDrinker 10xLaCroixDrinker Mar 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is the point. Using a getter only is a common pattern. This object is reused over and over again, and this helps prevent changes.

src/server/plugins/noopTracer.js Show resolved Hide resolved
src/server/plugins/requestLogging.js Outdated Show resolved Hide resolved
src/server/utils/logging/monkeyPatchConsole.js Outdated Show resolved Hide resolved
src/server/utils/tracer.js Outdated Show resolved Hide resolved
@10xLaCroixDrinker 10xLaCroixDrinker merged commit aad2538 into main Mar 12, 2024
9 checks passed
@10xLaCroixDrinker 10xLaCroixDrinker deleted the feature/tracing branch March 12, 2024 16:40
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants