-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
S3 snapshots with timeout failures after upgrade to 5.5.2 #26576
Comments
@davekonopka Also, do you have some stacktraces in logs that would help to diagnose this issue? Thanks |
@tlrx cannot really think of any significant changes that would lead to something like this. Could it be a coincidence? How are the nodes doing in terms of CPU and memory during snapshots? |
I wonder if it could have to do with #23952 where we now rely on the s3 client retry mechanism instead of our own retry mechanism. Searching for the error on the internet turns up a few interesting GH issues, indicating some issues with the S3 client and the retry mechanism: |
I've seen the following failure a few times in snapshots since originally reporting this:
It has happened a few times but usually only for one index. Most snapshots show no failures. |
We are seeing this randomly too on
We have pretty large indexes and on pretty busy nodes. Shouldn't it try to retry this shard if it fails to upload? |
Same is noticed on es 5.5.0 as well. Has anyone identified any workaround, or s3-client settings that will prevent these failures? |
Hi, We are experiencing the same problem in our cluster (ES 5.5.2, JVM 1.8.0_131). We see this behavior quite often since the upgrade to 5.5.2 (we were on 2.4.4 before) The behavior we observe is basically the following pattern: I would guess what happens is that the client connects to S3 successfully and due to the GC, cannot upload (or finish uploading) all the data, and gets a request timeout from S3 (due to the perceived inactivity from the client during GC). I think the aws-sdk does not do any retry on that kind of error, and just as @ywelsch mentioned, as the ES snapshot plugin retry mecanism has been removed, the upload request is then just not retried. Would it be possible, for example, to restore the behavior before #23952 as a configurable option of the snapshot repository? *Any idea why the snapshot would take so much memory? (if we could address this problem, there would be no GC issue neither, and that case would be solved as well..) |
I'm afraid there are two kind of issues here, but they are heavily related. The first one concerns the request timeout and I think that a first step in resolution is to update the AWS SDK used by the repository-s3 plugin as it is really old. The second issue I see is the memory consumption and I think this is because the plugin initializes a 100Mb (if node RAM is > 2gb, otherwise it is 5% of the heap) byte array for every single file to upload. This byte array was initialized with a fixed length of 5Mb on 2.4. This is a bug and I'm testing a fix. Finally, I think that we could do even better and use a AWS SDK's utility class named TransferManager to upload files to S3 (#26993). I expect this to be more resilient and efficient as the custom implementation we use, and it handles retries and multiple uploads. I'm also testing this and I'll update this issue as soon as I have more.
The AWS SDK already handles retry logic and the bit removed in #23952 were just multiplicating the number of retries. You should be able to restore a similar behavior by increasing the repositories.s3.max_retries setting. |
Thanks for the reply @tlrx! Regarding the memory consumption, are you referring to the |
@dsun811 To be transparent, there's no out of the box right value for the buffer_size setting and you have to experiment by yourself. A user reported that decreasing the |
@tlrx Just wanted to get back to you on that issue, and give you some feedback. We tried decreasing the |
Same issue on elastic 6.2
|
Elasticsearch version (
bin/elasticsearch --version
):5.5.2
Plugins installed: []
discovery-ec2
repository-s3
JVM version (
java -version
):java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
OS version (
uname -a
if on a Unix-like system):Amazon Linux on EC2 instances
Linux 4.9.43-17.38.amzn1.x86_64 #1 SMP Thu Aug 17 00:20:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
Description of the problem including expected versus actual behavior:
Steps to reproduce:
I recently upgraded a few clusters in different environments from 5.2.2 -> 5.5.2. Since doing so one of the clusters is running into timeout failures creating snapshots to S3. I've had a few successful snapshots and the other clusters have no failures so I know it does work. However most runs produce at least one failed shared or more with the same timeout error. Incidentally this has been limited to our production cluster which has the most/largest indices.
Provide logs (if relevant):
Some data redacted with
...
below.The text was updated successfully, but these errors were encountered: