Performance benchmarking for general SDK overhead #3940

martinkuba · 2023-06-22T16:15:20Z

While every application is different and there are many factors to consider when measuring performance, it would be useful to give users some idea of the performance characteristics of the SDK. This issue is intended as a discussion of what type of benchmark tests would be useful, how often to run them etc.

This spec describes performance benchmark testing to measure the overhead of OTel SDKs. Specifically, it describes measuring

throughput - how many spans can be created and exported in 1s
instrumentation cost - CPU overhead of generating and exporting X number of spans per second

The second one is of particular importance because it translates to scaling and computing costs of running a service in the cloud.

I am planning to do some testing based on the spec and provide the numbers here. Other outcomes that I think might be useful

add a tool that allows anyone to run these tests
add a github action that runs perf tests automatically (e.g. per main commit or release)
provide guideline to instrumentation authors to quantify overhead of their instrumentation

Looking back at the history of the JS SDK, I see that there used to be a basic benchmarking tool. I am curious why it was removed
#390

This may affect other libraries, but I would like to get opinions here first

adcharre · 2023-07-07T13:51:18Z

Adding our experience with moving our stack from logging to OpenTelemetry, we saw a large increase in the number of pods required to service requests and a significant rise in CPU usage. Using profiling and flame charts, we tracked down this increase to the number of times that the garbage collector was being called when we had instrumented with OpenTelemetry.

Our assumption is that due to the additional objects being created for spans and attributes, we had saturated the short-lived object memory pools in NodeJS, resulting in objects being copied between the memory spaces frequently. We managed to bring down the increase in CPU by changing the amount of memory allocated to these young pools using the --max-semi-space-size CLI option.

For the majority of our components an increase from 16MiB -> 64MiB max-semi-space-size resolved the issues with increased GC'ing. Some of our more highly utilised components required us to go further and increase this to 128MiB along with tuning some of the batch span processor options, to increase the max export batch size (OTEL_BSP_MAX_EXPORT_BATCH_SIZE=1024) and the max queue size (OTEL_BSP_MAX_QUEUE_SIZE=4096). We also had to increase the pods CPU requests by 20% to prevent overscaling.

Advice on tuning performance, I believe, would be extremely useful to add to the OpenTelemetry Javascript documentation.

Finally, we had initially chosen to use the GRPC exporter, expecting this to provide the best performance however we are now re-evaluating this decision and testing HTTP/JSON exporter rather than making assumptions!

martinkuba · 2023-07-10T18:35:58Z

@adcharre I have run some benchmark tests here and have observed that the gRPC exporter seems to have the highest overhead (in both micro-benchmark tests and a long-running app).

adcharre · 2023-08-16T13:09:06Z

@martinkuba I've finally found some time to test the difference that http/json exporter might have over grpc. In our test system I change the single highest scaling component to use http/grpc and the result were - no difference.
Changing to http/json seemed to make no difference to the cpu usage or number of pods. I was a bit surprised so did double check we were using the code to use http/json and we were.

HosseinAgha · 2023-10-08T07:17:56Z

Why doesn’t this library offload the transformation and log shipping to another thread, like Pino does?
Wouldn't it unblock the Node.js thread and decrease memory usage?

github-actions · 2023-12-18T06:35:13Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

github-actions · 2024-01-08T06:35:13Z

This issue was closed because it has been stale for 14 days with no activity.

HosseinAgha · 2024-01-08T15:39:05Z

This is not stale.

martinkuba mentioned this issue Aug 31, 2023

Introduce benchmark tests #4105

Merged

3 tasks

github-actions bot added the stale label Dec 18, 2023

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance benchmarking for general SDK overhead #3940

Performance benchmarking for general SDK overhead #3940

martinkuba commented Jun 22, 2023 •

edited

Loading

adcharre commented Jul 7, 2023

martinkuba commented Jul 10, 2023

adcharre commented Aug 16, 2023

HosseinAgha commented Oct 8, 2023 •

edited

Loading

github-actions bot commented Dec 18, 2023

github-actions bot commented Jan 8, 2024

HosseinAgha commented Jan 8, 2024

Performance benchmarking for general SDK overhead #3940

Performance benchmarking for general SDK overhead #3940

Comments

martinkuba commented Jun 22, 2023 • edited Loading

adcharre commented Jul 7, 2023

martinkuba commented Jul 10, 2023

adcharre commented Aug 16, 2023

HosseinAgha commented Oct 8, 2023 • edited Loading

github-actions bot commented Dec 18, 2023

github-actions bot commented Jan 8, 2024

HosseinAgha commented Jan 8, 2024

martinkuba commented Jun 22, 2023 •

edited

Loading

HosseinAgha commented Oct 8, 2023 •

edited

Loading