Skip to content

Number of open connections spikes when retryable errors occur #2087

Closed
@dtuck9

Description

@dtuck9

🐛 Bug Report

The retry logic appears to open connections without explicitly closing them, and consequently exhausts resources.

Example

We had an incident in which an ESS customer cluster reached its high disk watermark, and the cluster began returning 408s and 429s, as well as 5XX as the cluster became overloaded and unable to respond. We have also seen this occur with auth errors (api key not found) via beats agents.

The distinct number of clients hovered around 2,000 on average before and during the incident. So the customer did not seem to apply additional clients to explain the sudden increase number of connections.

With the ~2000 clients, the requests reached the Proxy-imposed 5000 per-Proxy connection limit across all 62 Proxies in the region, so roughly 310,000 open connections from 2000 clients.

The graph below shows the distinct number of clients (green), concurrent connections per Proxy (blue), and the request errors as defined by status_code >= 400 (pink, at 1/1000 scale):

Screenshot 2023-11-29 at 1 10 37 PM

https://platform-logging.kb.eastus.azure.elastic-cloud.com/app/r/s/e46ep

The top 2 user agents with the most open connections are:

Screenshot 2023-11-29 at 1 35 37 PM

To Reproduce

Steps to reproduce the behavior:

  1. Create a small cluster in azure-eastus in ESS (I say a specific region only because our connection limits are configured on a per-region basis based on the hardware type and this one has a 5000 concurrent connection limit)
  2. Create a few thousand clients
  3. Force a retryable error on the cluster

Expected behavior

Connections should be closed to avoid a build up of open connections over time

Your Environment

  • node version: 6,8,10
  • @elastic/elasticsearch version: >=7.0.0
  • os: Mac, Windows, Linux
  • any other relevant information

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions