-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Fleet] Improve performance of Fleet setup API / package installation #110500
Comments
Pinging @elastic/fleet (Team:Fleet) |
I spent some time investigating this yesterday and unfortunately, I've come to a preliminary conclusion that there are not any major opportunities for optimization from the Kibana side here. The longest running operations during the setup (and general package install process) is creating the ingest assets in Elasticsearch: ingest pipelines, index templates, component templates, and data streams. We're currently parallelizing these Elasticsearch API calls as much as we realistically can already, however it seems that these requests are getting queued on the Elasticsearch side. I discovered this by looking at the APM spans for the Elasticsearch calls and noticed that many of these operations are taking 5-7s to return a response. However, if I time a single one of these API calls, they return very quickly (in the 200-500 ms range). This leads me to believe that Elasticsearch is queueing these requests, I suspect because each of these assets are stored in the cluster state which needs to be consistent across the cluster, requiring a write lock to be acquired and replication across nodes to be completed before confirming that the asset has been committed. I identified additional evidence of this behavior in Elasticsearch by changing these API calls to be executed serially, rather than in parallel from the Kibana side. In this scenario, each API call completed quite quickly (again in the 200-500 ms range), while the overall setup process took the same amount of time (39s locally). I also confirmed that these requests are not being queued on the Kibana side by increasing the I believe the next steps here would be connecting with the Elasticsearch team to:
|
Here's an example of what I'm seeing. In this first APM screenshot, you'll see that creating each ingest pipeline is taking ~9s to receive a response from ES when we're running many things in parallel. In this second screenshot, where I changed assets to install serially, we see each ingest pipeline take ~500ms to receive a response. In both cases, the overall setup process takes 39s. This behavior is very consistent across runs as well. |
@elastic/es-distributed @colings86 I believe you all would be the best to confirm if my suspicions here are correct regarding these requests being blocked on cluster state updates. Is this analysis accurate and if so, do you all have any ideas for how we could improve this? I'm also happy to open a new issue in the ES repo to discuss this further. |
@henningandersen Could you or someone in the distributed team help @joshdover investigate the behaviour he is seeing here? Something feels off here because unless the cluster is under extremely heavy load, adding 5 ingest pipelines should not be taking ~9 seconds even with requests sent in parallel. @joshdover can you confirm the ES version you are using and the specs of the instance? Also, is this test done on an otherwise idle cluster? |
This was tested against a 8.0.0-SNAPSHOT from last week on a local 1 node cluster on my laptop with no other load at all. We're seeing similar performance in production instances of 7.14.x.
It's worth noting that there are other operations happening at the same time outside this screenshot also likely requiring cluster state updates (PUTs on component templates and index templates). These also exhibit the same behavior where each individual request is quite fast when run individually but very slow when we try parallelize this. The total runtime is about the same whether we run the requests in parallel or serially. |
We need to improve and continue to monitor the performance of the
/api/fleet/setup
endpoint. This API can take upwards of 40s, even on a local environment. More details TBD.Related to #109072
The text was updated successfully, but these errors were encountered: