Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

apmbench: Define benchmark scenarios and topologies #7858

Closed
Tracked by #7540
marclop opened this issue Apr 12, 2022 · 7 comments
Closed
Tracked by #7540

apmbench: Define benchmark scenarios and topologies #7858

marclop opened this issue Apr 12, 2022 · 7 comments
Assignees
Milestone

Comments

@marclop
Copy link
Contributor

marclop commented Apr 12, 2022

Description

Benchmark scenarios

We currently have 4 benchmarks that leverage the new event handler piece in apmbench to load pre-recorded APM Agent events and replay them to a target APM Server. The current benchmarks are split by language agent but that isn't necessarily the best way to benchmark the APM Server. The current generated data has been gathered from the existing opbeans applications, but that is also not necessarily the best type of application to use for our benchmarks.

We should discuss what kind of benchmark scenarios we'd like to include to be run on a daily basis and the purpose they serve.

Benchmark topologies

Additionally, we should look into the benchmark size matrix we'd like to support, for example:

Objectives / Outcomes:

  1. Max throughput, APM Server performance with high APM Agent number
  2. Undersized ES, APM Server should not OOM
APM Server Size Elasticsearch size Agent # Objective / Outcome
1GB x 1 zone 16GB x 2 zones Medium (600) 1
8GB x 1 zone 16GB x 2 zones High (2400) 1
8GB x 1 zone 1GB x 2 zones High (2400+) 2
8GB x 1 zone 1GB x 2 zones High (2400+) 2
@simitt
Copy link
Contributor

simitt commented Apr 19, 2022

Building on top of what you already suggested, I'd appreciate having a few more scenarios covered:

  • Test APM Server's max throughput by sending more events/second than it can process
    One goal of these tests is also to ensure linear scalability.
APM Server Size Elasticsearch size Agent #
1GB x 1 zone 16GB x 2 zones Medium (1000)
4GB x 1 zone 16GB x 2 zones Large (2500)
8GB x 1 zone 16GB x 3 zones Large (2500)
8GB x 2 zones 32GB x 2 zones X-Large (5000)
  • Test APM Server's behavior with a specific throughput and observing resource usage
APM Server Size Elasticsearch size # events/second
1GB x 1 zone 16GB x 2 zones TBD
4GB x 1 zone 16GB x 2 zones TBD
8GB x 1 zone 16GB x 3 zones TBD

The concrete number of events/second needs to be defined after the current limits of the APM Server per size are defined; it should be close to the maximum load it can process.

  • Test APM Server's behavior when ES is undersized, expecting a sensible resource usage and response pattern
APM Server Size Elasticsearch size Agent #
1GB x 1 zone 1 x 1 zones Medium (1000)
8GB x 1 zone 1 x 1 zones Large (2500)

We might need to tweak these concrete numbers eventually.

@simitt simitt added this to the 8.3 milestone Apr 19, 2022
@simitt simitt modified the milestone: 8.3 May 24, 2022
@simitt simitt added this to the 8.3 milestone May 24, 2022
@simitt
Copy link
Contributor

simitt commented May 30, 2022

@marclop and @lahsivjar for finishing up the 8.3 work, let's set up the initial benchmarks with the numbers I suggested in #7858 (comment). For now, let's use the events defined in https://github.com/elastic/apm-server/tree/main/systemtest/benchtest/events for ruby,go,python and nodejs(e.g. 1000 agents overall-> 250 sending ruby events, 250 sending go events and so on).

Is there anything else that is required for finishing this task? I expect that these concrete cases need to be incorporated into the automation tooling that the engineering productivity team is currently doing. Please reference here if any more effort on the APM Server team side is required for this.

@pazone
Copy link
Contributor

pazone commented May 31, 2022

For convenience we can maybe split performance and hardware profiles into 2 .properties files

@simitt simitt modified the milestones: 8.3, 8.4 Jun 3, 2022
@simitt simitt mentioned this issue Jun 3, 2022
5 tasks
@marclop
Copy link
Contributor Author

marclop commented Jun 6, 2022

#8275 added a new benchmark BenchmarkAgentAll which will simulate the scenario defined in #7858 (comment).

@marclop
Copy link
Contributor Author

marclop commented Jun 9, 2022

@simitt After my benchmarking efforts testing the gomaxprocs changes (#8278 (comment)). I think the number of agents we initially proposed in this issue may be too high. Using using more than 64 agents for a 1GB APM Server may be more than it can handle and thus hinder the benchmark results, particularly if no -max-rate is used (all agents would be sending at max throughput).

A good rule of thumb seemed to be incrementing the agent count by ~64 or double the pervious size for each size increment yielded close to optimal results with the default server settings, we could try with other increments as well; 96 or 128. This table could look like:

size agents ES topology
1g 64 2x 16gb
2g 128 2x 16gb
4g 256 2x 32gb
8g 512 2x 32gb
16g (2x 8gb) 1024 2x 64gb

Perhaps we can separate the objectives that different number of agents have into different jobs or at least analyze them differently since many factors will affect the server's throughput, not only the server size.

As we are all aware in the team, the ultimate bottleneck to APM Server performance isn't the APM Server itself, since it can only process events as fast as they are coming our of the APM Server (with some capacity to absorb peaks in its modelindexer buffer), but rather the rate at which Elasticsearch can index the documents that we are sending in our _bulk requests.

For that reason, expecting linear scalability out of the APM Server wouldn't be reasonable without tuning the index.number_of_shards as we scale up and out. It is still acceptable to benchmark the performance out of the box with default settings, since many users will run with those, but may be good to start documenting more openly how users would want to increment the number of primary shards for some data streams as they start to scale up and out.

@marclop
Copy link
Contributor Author

marclop commented Jun 15, 2022

Quick update on this. We've moved forward with the different topologies defined in #7858 (comment) and have defined them in https://github.com/elastic/apm-server/tree/main/testing/benchmark/system-profiles. Different configurations outside of the topologies such as index.number_of_shards is not something we're looking to explore at the moment. Would it be better to open up a new issue with the contents of the comment above and close this one?

The remaining automation work is tracked in #7846.

@simitt
Copy link
Contributor

simitt commented Jun 15, 2022

@marclop your proposal makes sense; we can iterate on and fine-tune the scenarios over time. Please create a follow up issue with relevant config options that are currently out of scope of the tests, such as index.number_of_shards. It will be good to revisit these in the future.
This issue can be closed then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants