Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Number of open connections spikes when retryable errors occur #2087

Closed
dtuck9 opened this issue Nov 29, 2023 · 15 comments
Closed

Number of open connections spikes when retryable errors occur #2087

dtuck9 opened this issue Nov 29, 2023 · 15 comments

Comments

@dtuck9
Copy link

dtuck9 commented Nov 29, 2023

🐛 Bug Report

The retry logic appears to open connections without explicitly closing them, and consequently exhausts resources.

Example

We had an incident in which an ESS customer cluster reached its high disk watermark, and the cluster began returning 408s and 429s, as well as 5XX as the cluster became overloaded and unable to respond. We have also seen this occur with auth errors (api key not found) via beats agents.

The distinct number of clients hovered around 2,000 on average before and during the incident. So the customer did not seem to apply additional clients to explain the sudden increase number of connections.

With the ~2000 clients, the requests reached the Proxy-imposed 5000 per-Proxy connection limit across all 62 Proxies in the region, so roughly 310,000 open connections from 2000 clients.

The graph below shows the distinct number of clients (green), concurrent connections per Proxy (blue), and the request errors as defined by status_code >= 400 (pink, at 1/1000 scale):

Screenshot 2023-11-29 at 1 10 37 PM

https://platform-logging.kb.eastus.azure.elastic-cloud.com/app/r/s/e46ep

The top 2 user agents with the most open connections are:

Screenshot 2023-11-29 at 1 35 37 PM

To Reproduce

Steps to reproduce the behavior:

  1. Create a small cluster in azure-eastus in ESS (I say a specific region only because our connection limits are configured on a per-region basis based on the hardware type and this one has a 5000 concurrent connection limit)
  2. Create a few thousand clients
  3. Force a retryable error on the cluster

Expected behavior

Connections should be closed to avoid a build up of open connections over time

Your Environment

  • node version: 6,8,10
  • @elastic/elasticsearch version: >=7.0.0
  • os: Mac, Windows, Linux
  • any other relevant information
@dtuck9
Copy link
Author

dtuck9 commented Dec 5, 2023

I talked to Tomas about this and I think I understand the current functionality - one potential improvement would be to add a backoff to the retries.

@JoshMock
Copy link
Member

JoshMock commented Dec 6, 2023

Thanks for opening this @dtuck9. Did you see it happening in any 8.x versions of the JS client, even if it was less often?

@dtuck9
Copy link
Author

dtuck9 commented Dec 6, 2023

Unfortunately given the scale of ESS, I am unable to identify/diagnose this issue on a widespread basis unless there's an incident or customer ticket.

I would assume that it's happening in 8.x, and based on having seen a similar issue originating from beats user agents, I'd also assume similar logic in the go client.

@dtuck9
Copy link
Author

dtuck9 commented Feb 29, 2024

Original Beats SDH

@JoshMock
Copy link
Member

JoshMock commented May 2, 2024

Going to pick this up next week for investigation/remediation.

From a Slack convo with @dgieselaar and some Kibana folks:

I’m seeing retries again after timeouts on Serverless. The request to Elasticsearch is canceled (by something, but I don’t know what) after a period of time (~30s), returns a 400 bad request, and the Kibana Node.js client ends up retrying the request four times, before eventually giving up (after ~120s).

the important thing to fix (in my eyes) is the retry-on-timeout issue, not necessarily that too many connections are opened when retrying at all

One thing that is good to mention here is that these requests are canceled. I'm not fully sure whether that is by the Elasticsearch client, the proxy, or Elasticsearch itself. I assume the former.

I believe that that last detail, that the request is cancelled, may be the critical point. Especially if it's always happening at 30 seconds, since that is the default request timeout set by the client.

Both 7.x and 8.x versions of the client only retry by default for:

  • 502 status code
  • 503 status code
  • 504 status code
  • TimeoutError
  • ConnectionError

Those last two are the primary suspects. I need to dig into what conditions cause UndiciConnection or HttpConnection to throw those errors, and how the transport handles request cancellation. My leading theory is that when the timeout duration is reached, the transport cancels the request, but somehow that does not kill the retry loop.

@JoshMock
Copy link
Member

My leading theory is that when the timeout duration is reached, the transport cancels the request, but somehow that does not kill the retry loop.

This is true, but not by accident: the fact that all TimeoutErrors are retried, intentionally, seems to be the reason.

It wouldn't be too hard to add a retryOnTimeout option to the transport to change this behavior as needed. The question I'm struggling with is whether changing its default behavior would be considered a breaking change. So the default value of that option may be up for debate.

What I can also do is, as is suggested above, add an incremental backoff to the retries. Right now it retries immediately, so adding an exponential backoff with a little jitter could ease some pain. I'd feel comfortable enough introducing that as the default behavior without considering it a breaking change.

I'll start implementing the backoff, and discuss the default "retry on timeout" behavior with internal stakeholders (Kibana core, Cloud SRE, etc.).

@JoshMock
Copy link
Member

The general consensus from Kibana core and SRE folks who were involved in the conversation is that it's not a breaking change to turn off retry-on-timeout by default.

elastic/elastic-transport-js#100 turns it off, with a new option retryOnTimeout for those who wish to use the old behavior.

I also need to add a retryOnTimeout option in the client, and I'd like to introduce retry backoff as well, before closing this.

@JoshMock
Copy link
Member

JoshMock commented Jun 6, 2024

Added retry backoff in elastic/elastic-transport-js#101. With that merged, I'll publish 8.6.0 shortly, so the 8.14 client will benefit from it when I publish within the next few days.

Keeping this open until I can backport the retry changes to the 7.x client.

@JoshMock
Copy link
Member

JoshMock commented Jun 6, 2024

Also, ended up not adding a retryOnTimeout option to the client. Living by YAGNI; those who need it can bring their own Transport instance to do that.

@yukha-dw
Copy link

Also, ended up not adding a retryOnTimeout option to the client. Living by YAGNI; those who need it can bring their own Transport instance to do that.

Is it possible to enable retryOnTimeout by using this snippet?

class CustomTransport extends Transport {
  constructor(opts: TransportOptions) {
    opts.retryOnTimeout = true;
    super(opts);
  }
}

new Client({
  Transport: CustomTransport,
});

@JoshMock
Copy link
Member

Yes, that's the strategy I was thinking about @yukha-dw.

@JoshMock
Copy link
Member

JoshMock commented Jul 2, 2024

7.17.14 was just published to npm with retry on timeout disabled by default. See #2293.

Would love to see some validation from @dgieselaar or @dtuck9 that this fixes the problem, but I'm assuming a new 7.x release of Kibana will have to go out before that can happen.

Are y'all comfortable with me closing this in the meantime or should we keep it open until you can confirm things are better?

@dtuck9
Copy link
Author

dtuck9 commented Jul 2, 2024

I'm not entirely sure how to validate it from our end unfortunately. I'm fine closing it and reopening if we see any issues for newer versions, so I'll defer to @dgieselaar's preference.

@dgieselaar
Copy link
Member

The same applies to me @JoshMock , I'm fine with closing it, and I have no means right now to investigate whether the issue is resolved. It will take a while until this fix lands on the environments (serverless, IIRC) that were exhibiting the issue. @lukeelmers is this perhaps something you can take a note of? Don't mean to give you more homework, but you probably are more aware of changes there than I am.

@JoshMock
Copy link
Member

JoshMock commented Jul 3, 2024

That's what I figured. Given that we are relatively confident that it was due to retry-on-timeout being enabled by default, and that has now been disabled by default, I'm comfortable keeping this closed until the fix is rolled out broadly and we have clear proof that it did not solve the problem.

@JoshMock JoshMock closed this as completed Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants