-
Notifications
You must be signed in to change notification settings - Fork 8.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Elasticsearch client's maxSockets defaults to 256 #112756
Comments
Pinging @elastic/kibana-core (Team:Core) |
cc @delvedor this should be as easy as changing the |
Correct.
You should configure a higher socket number. Be aware that if you configure Some info about how the |
@kobelb I guess the best option is, as you suggested, to allow users to configure this value via the config file. Would the current |
Let me be the devil's advocate. I'd suggest increasing
|
The agent is configured to use the available sockets in a lifo fashion, which means that if there is a peak of requests and many sockets get opened, then most of those will be closed quickly, so it's not a big issue having a higher number of |
That's great information, @delvedor. It's revealed another difference in the defaults between the modern and the legacy clients. The legacy Elasticsearch client defaults to As a result, the legacy client will allocate as many sockets as needed when there's a burst of HTTP requests and reuse the sockets for subsequent requests. I'd prefer we stick with this behavior for Kibana's use of the modern Elasticsearch client as well. Elasticsearch supports a massive amount of open sockets for incoming requests and I don't think we should be setting |
@mshustov if we default to using IMO, there's benefit in allowing users to do so. There are situations where it could benefit Kibana's performance by setting |
I managed to find a precise number neither in ES docs https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-network.html#tcp-settings nor in ES codebase. Do you know if there is any limit at all? Maybe it's an environment-specific configuration value?
I'm not opposed to it considering that I was the default value in the Legacy ES client.
ES client falls back to kibana/src/core/server/elasticsearch/client/client_config.ts Lines 76 to 80 in cc1598c
Kibana plugins can create a custom client instance with keepAlive: true but non of them use this option:
elaticsearch.createClient('my-client', { keepAlive:true });
I'm not a big fan of increasing API surface until it's proven to be necessary.
Do you have examples in mind? I can think of higher memory pressure. If we see a real benefit in providing this config option, okay, let's add it. However, I don't remember we've ever had any requests to support adjusting @kobelb do you want the Core team to take care of the enhancement or do you have the capacity to work on the task yourself? |
As far as I'm aware, it's environment-specific. Warning, I am not an expert of the following. For most network interfaces, there's a maximum limit of 65,535 sockets for each application, but to even get near this limit you have to first increase the maximum number of file descriptors that a process can use. I've had to further adjust my kernel settings to even get near this limit, the Gatling docs have some good docs on this: https://gatling.io/docs/gatling/reference/current/general/operations/
@delvedor can you confirm this? Your prior comment made it sound like this wasn't the case.
I suggested that we add a If we want to treat the addition of the
I threw up #113644 to set the |
The client uses |
After a user reported their memory kept growing on 7.12 I did some investigation. 7.12 still had many uses of the legacy elasticsearch client which uses Until 7.16 #113644 the new platform elasticsearch client used So our current configuration of Testing on 7.16 with the following snippet: const startTime = process.hrtime();
for (let i = 0; i < 10; i++) {
const promises = new Array(10000).fill(null).map(() => client.indices.get({ index: '*' }));
await Promise.allSettled(promises);
}
console.log('10*10k spike lasted: ', process.hrtime(startTime)); Resulted in the following: keepAlive: false, maxSockets: 1000 keepAlive: true, maxSockets: Infinity keepAlive: false, maxSockets: Infinity This is an artificial scenario, so it might be that real world conditions are different, but it seems like a smaller number of maxSockets generally improves performance in terms of total time to complete and the event loop delay. So I think it's worthwhile prioritizing defaulting to 1000 and making this configurable. It would also be very helpful to have visibility into the amount of Elasticsearch connections from Kibana in e.g. the status API and monitoring UI. |
This is a great discovery, @rudolf. It makes me think that we need to add a configurable maxSockets setting sooner rather than later so we can specify this when it makes sense. Is increased memory usage the only indicator that we have that this is an issue? |
yeah, it's probably causing spikes in event loop delay, but we don't have a way to confirm and/or trace this under real world conditions. |
@rudolf do we log anything when the outbound request is initiated before the socket is created? If so, we should be able to see a large amount of outbound request, right? |
with elasticsearch.logQueries we only log when the request completes, but it can give us some idea of the volume. Didn't think about it previously but we could also use cloud proxy logs, if we aggregate by cluster_id and kibana IP into small buckets like 5 minutes we could get an idea of which kibana's had the most completing requests in a short period. If we start with the top results we could see if they have corresponding event loop delays and also get an idea for just how many outgoing requests could be created in such a burst. |
The ES client falls back to #113644 made this logic explicit but at the same time it changed
I'm not sure yet how to prove 1000 is a good default, we need more real-world data to define a reasonable value. |
👍 |
By querying our logs for high concurrent connections I came across several clusters affected by #96123 My current assumption is that this bug is the biggest reason for high concurrent connection count and associated memory growth we've seen. I'm waiting to hear back from users reporting the high memory consumption to confirm that the workaround solves their problems. |
This change came up again when we were looking into some SDHs with
I just double-checked the options that are passed when the legacy and new Elasticsearch clients create an instance of legacy
new
@rudolf what am I missing? Where were you seeing |
I think I might have made a mistake in my notes. From my artificial testing scenario the keepAlive value has a very small effect. The real problem is the |
So, do we believe setting this value to non-infinite will cause most/all of the socket hang up issues we see in deployments to go away? What value would be good to use here? 256? |
Socket hang ups are most likely caused by large event loop delays. A spike in requests/new sockets from the Elasticsearch client is one cause of such large event loop delays, but it's hard to say how many real world socket timeouts are due to this particular event loop hog. Either way, It would be helpful to create a benchmark to try to tune this variable, but the best value will depend very much on the exact load on Kibana. If each outgoing requests has a large CPU overhead (such as receiving a large response payload and processing it with a tight loop), then a lower maxSockets value will protect Kibana from event loop spikes and socket timeouts. But if each outgoing request has a low CPU overhead then throughput could be increased by increasing maxSockets. (and this uncertainty is probably the main reason we haven't done anything to fix this yet) It might be useful to add logging for when we're reaching the limit for the http agent socket pool and requests start being queued. That way we at least have a signal that the socket pool could be limiting performance and if event loop delays are low we could recommend a customer increase this. This would also give us some real world data to go by to try to tune it. If we can get the logging in place I would feel much more confident in shipping a lower value like 1000. |
I had a very busy deployment running, that was generating ~10 socket hang up (SHU) related errors per hour. Decided to try turning down the maxSockets to see if it would help. Doesn't appear to have helped. Getting about the same number of SHU / hr, so far. And performance of the rules has degraded - execution duration has gone up The SHU's in this case may be due to how busy I made this deployment. I have 100 rules running, 1s interval, and task manager set to 50 workers (default: 10) and a 1s polling interval (default: 3s). There's just too much stuff running concurrently in this deployment. May be too artificial ... |
The modern Elasticsearch client's
http.Agent
defaults to allowing a maximum of 256 sockets. However, the legacy Elasticsearch client allowed there to be an infinite number of sockets.Was this accidentally changed with the introduction of the modern Elasticsearch client? If so, we should probably revert the accidental change. I do see benefit in allowing this to be configured via the
kibana.yml
so we or our users can alter this behavior./cc @pmuellr
The text was updated successfully, but these errors were encountered: