Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Index Management] TESTING Added logger to fetch indices route #126169

Closed

Conversation

yuliacech
Copy link
Contributor

@yuliacech yuliacech commented Feb 22, 2022

This PR adds a logger to Index Management "list/reload indices" route to test where the loading time is spent when a large list of indices is being retrieved. The code is not intended to be merged and the main goal of this PR is to test the indices list performance on Cloud (see #126242).

The logger is added on following "checkpoints":

  1. Before and after Get all Indices request to ES is completed
  2. Before and after Get Indices Stats request to ES is completed
  3. Before and after each index data enricher:
  • ILM Explain lifecycle request
  • Rollup job capabilities request
  • CCR follower indices request

@yuliacech yuliacech changed the title [Index Management] Added logger for Indices list requests [Index Management] Added logger for Indices list requests TEST Feb 23, 2022
@yuliacech yuliacech changed the title [Index Management] Added logger for Indices list requests TEST [Index Management] TESTING Added logger to fetch indices route Feb 23, 2022
@yuliacech yuliacech marked this pull request as ready for review February 23, 2022 14:21
@yuliacech yuliacech requested a review from a team as a code owner February 23, 2022 14:21
@sebelga
Copy link
Contributor

sebelga commented Feb 23, 2022

Not sure how precise the Logger is, what I had in mind was to use console.time with labels (https://www.geeksforgeeks.org/node-js-console-time-method/)

@tylersmalley
Copy link
Contributor

Not sure if it's related to the changes here - but the Kibana instance in the cloud deployment ran out of memory and was restarted.

@tylersmalley
Copy link
Contributor

I am actually thinking it's most likely related to this change, as it's happened three more times since and we have yet to experience it on any other deployments.

@yuliacech
Copy link
Contributor Author

Thanks a lot for checking on this deployment, @tylersmalley!
I added the logger to log some info only when indices list is loaded in Kibana, so I'm wondering if Kibana running out of memory could be related to something else. For example that the monitoring has been enabled on the deployment 5 days ago?
Also, I added 1000 small indices (only 1 doc) to test indices list performance, can this be related as well?

@yuliacech
Copy link
Contributor Author

@elasticmachine merge upstream

@jbudz
Copy link
Member

jbudz commented Feb 28, 2022

Logs are mostly filled with Elasticsearch GC. Mind if we scale the cluster up? I'm hoping we can get stack monitoring working again to keep a closer eye on memory.

@yuliacech
Copy link
Contributor Author

Sure, that would be great, @jbudz! I currently have 1000 indices, but I'm planning to run my test with a large amount of indices: 2000, 3000 etc maybe up to 5000-10000 indices. Do you think the deployment has problems because of that?

@jbudz
Copy link
Member

jbudz commented Feb 28, 2022

It's hard to say with the current logs. I see random spikes that look pretty consistent with user access from a browser(4 hours ago for example) and browser refreshes so I'm wondering if that's possible. Adding more shards could definitely help us narrow that down.

Bumping ES to 4GB - will report back once it's back up.

@jbudz
Copy link
Member

jbudz commented Feb 28, 2022

Okay it's back up - I'll keep a tab open to monitor.

@kibana-ci
Copy link
Collaborator

kibana-ci commented Feb 28, 2022

💔 Build Failed

Failed CI Steps

Test Failures

  • [job] [logs] Jest Tests #6 / [Index management API Routes] fetch indices lib function data stream index
  • [job] [logs] Jest Tests #6 / [Index management API Routes] fetch indices lib function frozen index
  • [job] [logs] Jest Tests #6 / [Index management API Routes] fetch indices lib function hidden index
  • [job] [logs] Jest Tests #6 / [Index management API Routes] fetch indices lib function index missing in stats call
  • [job] [logs] Jest Tests #6 / [Index management API Routes] fetch indices lib function index with aliases
  • [job] [logs] Jest Tests #6 / [Index management API Routes] fetch indices lib function regular index

Metrics [docs]

Unknown metric groups

ESLint disabled in files

id before after diff
apm 15 14 -1

ESLint disabled line counts

id before after diff
apm 85 82 -3

Total ESLint disabled count

id before after diff
apm 100 96 -4

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@yuliacech
Copy link
Contributor Author

FIY, I'm currently adding another 1000 indices into the deployment to test performance with 2000 indices.

@yuliacech
Copy link
Contributor Author

@jbudz I'm at about 5000 small indices in the deployment and would like to get to 10 000 to complete my testing. But I've started getting this error when creating new indices

{"error":{"root_cause":[{"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [1920271724/1.7gb], which is larger than the limit of [1860802969/1.7gb], real usage: [1920271544/1.7gb], new bytes reserved: [180/180b], usages [fielddata=914/914b, request=24019000/22.9mb, inflight_requests=180/180b, model_inference=0/0b, eql_sequence=0/0b]","bytes_wanted":1920271724,"bytes_limit":1860802969,"durability":"TRANSIENT"}],"type":"circuit_breaking_exception","reason":"[parent] Data too large, data for [<http_request>] would be [1920271724/1.7gb], which is larger than the limit of [1860802969/1.7gb], real usage: [1920271544/1.7gb], new bytes reserved: [180/180b], usages [fielddata=914/914b, request=24019000/22.9mb, inflight_requests=180/180b, model_inference=0/0b, eql_sequence=0/0b]","bytes_wanted":1920271724,"bytes_limit":1860802969,"durability":"TRANSIENT"},"status":429}

Do you maybe know what that is related to and if it's possible to re-configure the deployment to handle this?

@jbudz
Copy link
Member

jbudz commented Mar 3, 2022

I just bumped the cluster to 8gb of RAM. It's definitely the number of shards/indices that's causing things to slow down. Given we're the only client ATM - I expect Kibana isn't very friendly to heavily sharded deployments.

It could be a lot of things - alerting, monitoring loading a list of all indices ( 1mb+ per xhr request, auto reloading every 10 seconds) and so on.

This is probably something that should be added to our performance working group - cc @tylersmalley @danielmitterdorfer . Recap: a 4gb cluster with 1000-5000 one document indices is going OOM with ~1 active Kibana user.

@yuliacech
Copy link
Contributor Author

Thank you @jbudz!
Here is also the script that I use to add the indices

#!/bin/bash

USERNAME=${USERNAME:-elastic}
PASSWORD=${PASSWORD:-password}
COUNT=${COUNT:-1}
START=${START:-1}
HOST=${HOST:-"https://kibana-pr-126169.es.us-west2.gcp.elastic-cloud.com:9243"}


curl -X PUT -u $USERNAME:$PASSWORD "$HOST/_cluster/settings" -H "Content-Type: application/json" -d '{ "persistent": { "cluster.max_shards_per_node": "6000" } }'

for i in $(seq "$START" "$COUNT")
do
 echo 
 echo 'test_index - create mapping'
 curl -X PUT -u $USERNAME:$PASSWORD "$HOST/test_index_$i" -H "Content-Type: application/json" -d'
 {
    "settings": {
      "index": {
        "number_of_replicas": 0
      }
    }
  }
 '
 echo 
 echo 'text_index - add doc'
 curl -X PUT -u $USERNAME:$PASSWORD "$HOST/test_index_$i/_doc/1" -H "Content-Type: application/json" -d'
 {
   "timestamp": 21347237412
  }
 '
done

@danielmitterdorfer
Copy link
Member

This is probably something that should be added to our performance working group - cc @tylersmalley @danielmitterdorfer . Recap: a 4gb cluster with 1000-5000 one document indices is going OOM with ~1 active Kibana user.

Improving the efficiency of Elasticsearch with many indices/shards is actively tackled by the Elasticsearch's distributed team at the moment. See for example the blog post Three ways we've improved Elasticsearch scalability for recent improvements in 7.16. Note that these improvements do not change our recommended of shards/GB RAM yet though.

For this targeted one-off test, I could see two options:

  1. We increase the cluster size
  2. Instead of testing against a real Elasticsearch cluster, we mock the responses

@yuliacech
Copy link
Contributor Author

I have now tested with about 8000 indices and I think the results give us some good insights into what we can work on to improve indices list performance. I will add all the findings to #126242.
This PR can be closed and the deployment deleted. Thanks a lot for the support, @jbudz!
@danielmitterdorfer Yes, the scenario I was testing is not currently recommended for deployments and this PR was to research limitations of Index Management regarding to the "many shards" project. I hope we can improve performance and handle many indices in Kibana in future.

@yuliacech yuliacech closed this Mar 4, 2022
@tylersmalley tylersmalley added ci:cloud-deploy Create or update a Cloud deployment and removed ci:deploy-cloud labels Aug 17, 2022
@yuliacech yuliacech deleted the indices_list_performance_logger branch February 15, 2024 12:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci:cloud-deploy Create or update a Cloud deployment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants