Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 snapshots with timeout failures after upgrade to 5.5.2 #26576

Closed
davekonopka opened this issue Sep 11, 2017 · 15 comments
Closed

S3 snapshots with timeout failures after upgrade to 5.5.2 #26576

davekonopka opened this issue Sep 11, 2017 · 15 comments
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@davekonopka
Copy link

Elasticsearch version (bin/elasticsearch --version):
5.5.2

Plugins installed: []
discovery-ec2
repository-s3

JVM version (java -version):
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version (uname -a if on a Unix-like system):
Amazon Linux on EC2 instances
Linux 4.9.43-17.38.amzn1.x86_64 #1 SMP Thu Aug 17 00:20:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Steps to reproduce:

I recently upgraded a few clusters in different environments from 5.2.2 -> 5.5.2. Since doing so one of the clusters is running into timeout failures creating snapshots to S3. I've had a few successful snapshots and the other clusters have no failures so I know it does work. However most runs produce at least one failed shared or more with the same timeout error. Incidentally this has been limited to our production cluster which has the most/largest indices.

Provide logs (if relevant):

Some data redacted with ... below.

     "failures": [
        {
          "index": "...-2017.09.11",
          "index_uuid": "...-2017.09.11",
          "shard_id": 4,
          "reason": "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: IOException[Unable to upload object elasticsearch-snapshots/indices/.../4/__e]; nested: AmazonS3Exception[Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: BB3062E801AD4513)]; ",
          "node_id": "...",
          "status": "INTERNAL_SERVER_ERROR"
        }
      ],
     "failures": [
        {
          "index": "...-2017.09.10",
          "index_uuid": "...-2017.09.10",
          "shard_id": 0,
          "reason": "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: IOException[Unable to upload object elasticsearch-snapshots/indices/.../0/__1]; nested: AmazonS3Exception[Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: 4C030EBC4EB49F51)]; ",
          "node_id": "...",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "...-2017.08.30",
          "index_uuid": "...-2017.08.30",
          "shard_id": 0,
          "reason": "IndexShardSnapshotFailedException[Failed to write file list]; nested: IOException[Unable to upload object elasticsearch-snapshots/indices/.../0/pending-index-11]; nested: AmazonS3Exception[Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: 1562D61EBA5696BD)]; ",
          "node_id": "...",
          "status": "INTERNAL_SERVER_ERROR"
        }
      ],
@tlrx
Copy link
Member

tlrx commented Sep 12, 2017

The Snapshot/Restore feature had many changes between 5.2.2 and 5.5.2 but I don't see anything "obvious" that would explain the timeouts (I have #24403 but it's fixed for 5.5.2). Maybe @ywelsch or @imotov have an idea?

@tlrx tlrx added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Sep 12, 2017
@tlrx
Copy link
Member

tlrx commented Sep 12, 2017

@davekonopka Also, do you have some stacktraces in logs that would help to diagnose this issue? Thanks

@imotov
Copy link
Contributor

imotov commented Sep 13, 2017

@tlrx cannot really think of any significant changes that would lead to something like this. Could it be a coincidence? How are the nodes doing in terms of CPU and memory during snapshots?

@ywelsch
Copy link
Contributor

ywelsch commented Sep 13, 2017

I wonder if it could have to do with #23952 where we now rely on the s3 client retry mechanism instead of our own retry mechanism.

Searching for the error on the internet turns up a few interesting GH issues, indicating some issues with the S3 client and the retry mechanism:

aws/aws-sdk-java#1101
aws/aws-sdk-js#281

@davekonopka
Copy link
Author

davekonopka commented Sep 22, 2017

I've seen the following failure a few times in snapshots since originally reporting this:

{
  "index": "logs-xxxxxx-2017.09.18",
  "index_uuid": "logs-xxxxx-2017.09.18",
  "shard_id": 2,
  "reason": "IndexShardSnapshotFailedException[Failed to write file list]; nested: IOException[Unable to upload object elasticsearch-snapshots/indices/gud2dTkJRzqbZTif1Ve7Qg/2/pending-index-1]; nested: AmazonS3Exception[Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: A117F2F8595389C3)]; ",
  "node_id": "mGyXndzGQ-uuvfrqGGj0lw",
  "status": "INTERNAL_SERVER_ERROR"
}

It has happened a few times but usually only for one index. Most snapshots show no failures.

@jdoss
Copy link

jdoss commented Oct 2, 2017

We are seeing this randomly too on 5.4.1

      "snapshot" : "redacted",
      "uuid" : "redacted",
      "version_id" : 5040199,
      "version" : "5.4.1",
      "indices" : [
        redacted
      ],
      "state" : "PARTIAL",
      "start_time" : "2017-09-30T07:00:03.316Z",
      "start_time_in_millis" : 1506754803316,
      "end_time" : "2017-09-30T15:11:11.651Z",
      "end_time_in_millis" : 1506784271651,
      "duration_in_millis" : 29468335,
      "failures" : [
        {
          "index" : "redacted",
          "index_uuid" : "redacted",
          "shard_id" : 21,
          "reason" : "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: IOException[Unable to upload object site//indices/DwB4qsgrRMeIhVeZ2axpqA/21/__h2]; nested: AmazonS3Exception[MD5 local [twETo0o8O+HG8x5xyg2AaQ==], remote [4Zm2BGZ8PW+saqe+W0uw1g==] are not equal... (Service: null; Status Code: 0; Error Code: null; Request ID: null)]; ",
          "node_id" : "_7DaMJafSEO77nTozfkS4w",
          "status" : "INTERNAL_SERVER_ERROR"
        }
      ],
      "shards" : {
        "total" : 406,
        "failed" : 1,
        "successful" : 405
      }
    }
  ]

We have pretty large indexes and on pretty busy nodes. Shouldn't it try to retry this shard if it fails to upload?

@subbu-sw
Copy link

Same is noticed on es 5.5.0 as well. Has anyone identified any workaround, or s3-client settings that will prevent these failures?
{ "snapshots": [ { "snapshot": "curator-20171023215205", "uuid": "k5FKxN4USoiYS-c6Uj7vBg", "version_id": 5050099, "version": "5.5.0", "indices": [ ..., ], "state": "PARTIAL", "start_time": "2017-10-23T21:52:06.913Z", "start_time_in_millis": 1508795526913, "end_time": "2017-10-23T23:02:56.303Z", "end_time_in_millis": 1508799776303, "duration_in_millis": 4249390, "failures": [ { "index": "merch_best_sellers_2017_10_23", "index_uuid": "merch_best_sellers_2017_10_23", "shard_id": 0, "reason": "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: IOException[Unable to upload object indices/saal_rkoTCelrgfhSOcpzw/0/__i]; nested: AmazonS3Exception[Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: C80CED59E934C14E)]; ", "node_id": "BAARTEh6ReKhiv7-CCmXwA", "status": "INTERNAL_SERVER_ERROR" ...

@dsun811
Copy link

dsun811 commented Nov 2, 2017

Hi,

We are experiencing the same problem in our cluster (ES 5.5.2, JVM 1.8.0_131). We see this behavior quite often since the upgrade to 5.5.2 (we were on 2.4.4 before)

The behavior we observe is basically the following pattern:
(1) When the snapshot starts, there is a sharp increase in the heap used (we saw up to +4GB)
(2) An old gen GC (that can last up to 30sec) happens on some node(s) due to the increase in heap used
(3) The node will then report a snapshot failure, due to Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout ...)
(4) Resulting snapshot ends with PARTIAL state

I would guess what happens is that the client connects to S3 successfully and due to the GC, cannot upload (or finish uploading) all the data, and gets a request timeout from S3 (due to the perceived inactivity from the client during GC). I think the aws-sdk does not do any retry on that kind of error, and just as @ywelsch mentioned, as the ES snapshot plugin retry mecanism has been removed, the upload request is then just not retried.
I am also wondering why there is such a large increase in the heap used during snapshot (as far as I remember, I didn't observe such behavior in ES 2.4)

Would it be possible, for example, to restore the behavior before #23952 as a configurable option of the snapshot repository?

*Any idea why the snapshot would take so much memory? (if we could address this problem, there would be no GC issue neither, and that case would be solved as well..)

@tlrx
Copy link
Member

tlrx commented Nov 2, 2017

I'm afraid there are two kind of issues here, but they are heavily related. The first one concerns the request timeout and I think that a first step in resolution is to update the AWS SDK used by the repository-s3 plugin as it is really old.

The second issue I see is the memory consumption and I think this is because the plugin initializes a 100Mb (if node RAM is > 2gb, otherwise it is 5% of the heap) byte array for every single file to upload. This byte array was initialized with a fixed length of 5Mb on 2.4. This is a bug and I'm testing a fix.

Finally, I think that we could do even better and use a AWS SDK's utility class named TransferManager to upload files to S3 (#26993). I expect this to be more resilient and efficient as the custom implementation we use, and it handles retries and multiple uploads. I'm also testing this and I'll update this issue as soon as I have more.

Would it be possible, for example, to restore the behavior before #23952 as a configurable option of the snapshot repository?

The AWS SDK already handles retry logic and the bit removed in #23952 were just multiplicating the number of retries. You should be able to restore a similar behavior by increasing the repositories.s3.max_retries setting.

@dsun811
Copy link

dsun811 commented Nov 2, 2017

Thanks for the reply @tlrx! Regarding the memory consumption, are you referring to the buffer_size repository setting? (https://www.elastic.co/guide/en/elasticsearch/plugins/5.5/repository-s3-repository.html) Would it be better if I set that setting to a lower value, eg: 5mb?

@tlrx tlrx added >bug and removed feedback_needed labels Nov 2, 2017
@tlrx
Copy link
Member

tlrx commented Nov 2, 2017

@dsun811 To be transparent, there's no out of the box right value for the buffer_size setting and you have to experiment by yourself. A user reported that decreasing the buffer_size value gives better results with nodes with 4Gb of RAM.

@dsun811
Copy link

dsun811 commented Nov 7, 2017

@tlrx Just wanted to get back to you on that issue, and give you some feedback. We tried decreasing the buffer_size setting, and we are not experiencing the heavy GC anymore and all our snapshots have been successful since then. Thanks a lot for your time and the different links, it helped a lot!

@tlrx
Copy link
Member

tlrx commented Nov 9, 2017

@dsun811 Thanks for your feedback too :) I created #27280 that removes most of the memory allocations; only the AWS SDK client will buffer up to 16Mb when uploading files. That should help too.

@tlrx
Copy link
Member

tlrx commented Jan 8, 2018

#27278 and #27280 have been merged in v5.6.5, v6.0.1, v6.1.0. It should reduce the timeouts issues reported in here.

I'm closing this issue, if it appears again feel free to open a new issue and linked back to this one.

@tlrx tlrx closed this as completed Jan 8, 2018
@iahmad-khan
Copy link

Same issue on elastic 6.2

"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out]; nested: SdkClientException[Unable to execute HTTP request: Read timed out]; nested: SocketTimeoutException[Read timed out]; ",
          "node_id": "oDyxks_7Suu9rNVqzZM6-Q",
          "status": "INTERNAL_SERVER_ERROR"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

8 participants