Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance benchmarking for general SDK overhead #3940

Closed
1 task done
martinkuba opened this issue Jun 22, 2023 · 7 comments
Closed
1 task done

Performance benchmarking for general SDK overhead #3940

martinkuba opened this issue Jun 22, 2023 · 7 comments
Labels

Comments

@martinkuba
Copy link
Contributor

martinkuba commented Jun 22, 2023

While every application is different and there are many factors to consider when measuring performance, it would be useful to give users some idea of the performance characteristics of the SDK. This issue is intended as a discussion of what type of benchmark tests would be useful, how often to run them etc.

This spec describes performance benchmark testing to measure the overhead of OTel SDKs. Specifically, it describes measuring

  • throughput - how many spans can be created and exported in 1s
  • instrumentation cost - CPU overhead of generating and exporting X number of spans per second

The second one is of particular importance because it translates to scaling and computing costs of running a service in the cloud.

I am planning to do some testing based on the spec and provide the numbers here. Other outcomes that I think might be useful

  • add a tool that allows anyone to run these tests
  • add a github action that runs perf tests automatically (e.g. per main commit or release)
  • provide guideline to instrumentation authors to quantify overhead of their instrumentation

Looking back at the history of the JS SDK, I see that there used to be a basic benchmarking tool. I am curious why it was removed
#390

  • This may affect other libraries, but I would like to get opinions here first
@adcharre
Copy link

adcharre commented Jul 7, 2023

Adding our experience with moving our stack from logging to OpenTelemetry, we saw a large increase in the number of pods required to service requests and a significant rise in CPU usage. Using profiling and flame charts, we tracked down this increase to the number of times that the garbage collector was being called when we had instrumented with OpenTelemetry.

Our assumption is that due to the additional objects being created for spans and attributes, we had saturated the short-lived object memory pools in NodeJS, resulting in objects being copied between the memory spaces frequently. We managed to bring down the increase in CPU by changing the amount of memory allocated to these young pools using the --max-semi-space-size CLI option.

For the majority of our components an increase from 16MiB -> 64MiB max-semi-space-size resolved the issues with increased GC'ing. Some of our more highly utilised components required us to go further and increase this to 128MiB along with tuning some of the batch span processor options, to increase the max export batch size (OTEL_BSP_MAX_EXPORT_BATCH_SIZE=1024) and the max queue size (OTEL_BSP_MAX_QUEUE_SIZE=4096). We also had to increase the pods CPU requests by 20% to prevent overscaling.

Advice on tuning performance, I believe, would be extremely useful to add to the OpenTelemetry Javascript documentation.

Finally, we had initially chosen to use the GRPC exporter, expecting this to provide the best performance however we are now re-evaluating this decision and testing HTTP/JSON exporter rather than making assumptions!

@martinkuba
Copy link
Contributor Author

@adcharre I have run some benchmark tests here and have observed that the gRPC exporter seems to have the highest overhead (in both micro-benchmark tests and a long-running app).

@adcharre
Copy link

@martinkuba I've finally found some time to test the difference that http/json exporter might have over grpc. In our test system I change the single highest scaling component to use http/grpc and the result were - no difference.
Changing to http/json seemed to make no difference to the cpu usage or number of pods. I was a bit surprised so did double check we were using the code to use http/json and we were.

@HosseinAgha
Copy link

HosseinAgha commented Oct 8, 2023

Why doesn’t this library offload the transformation and log shipping to another thread, like Pino does?
Wouldn't it unblock the Node.js thread and decrease memory usage?

Copy link

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days.

@github-actions github-actions bot added the stale label Dec 18, 2023
Copy link

github-actions bot commented Jan 8, 2024

This issue was closed because it has been stale for 14 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 8, 2024
@HosseinAgha
Copy link

This is not stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants