S3 snapshots with timeout failures after upgrade to 5.5.2 #26576

davekonopka · 2017-09-11T14:14:42Z

Elasticsearch version (bin/elasticsearch --version):
5.5.2

Plugins installed: []
discovery-ec2
repository-s3

JVM version (java -version):
java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version (uname -a if on a Unix-like system):
Amazon Linux on EC2 instances
Linux 4.9.43-17.38.amzn1.x86_64 #1 SMP Thu Aug 17 00:20:39 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Steps to reproduce:

I recently upgraded a few clusters in different environments from 5.2.2 -> 5.5.2. Since doing so one of the clusters is running into timeout failures creating snapshots to S3. I've had a few successful snapshots and the other clusters have no failures so I know it does work. However most runs produce at least one failed shared or more with the same timeout error. Incidentally this has been limited to our production cluster which has the most/largest indices.

Provide logs (if relevant):

Some data redacted with ... below.

     "failures": [
        {
          "index": "...-2017.09.11",
          "index_uuid": "...-2017.09.11",
          "shard_id": 4,
          "reason": "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: IOException[Unable to upload object elasticsearch-snapshots/indices/.../4/__e]; nested: AmazonS3Exception[Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: BB3062E801AD4513)]; ",
          "node_id": "...",
          "status": "INTERNAL_SERVER_ERROR"
        }
      ],

     "failures": [
        {
          "index": "...-2017.09.10",
          "index_uuid": "...-2017.09.10",
          "shard_id": 0,
          "reason": "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: IOException[Unable to upload object elasticsearch-snapshots/indices/.../0/__1]; nested: AmazonS3Exception[Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: 4C030EBC4EB49F51)]; ",
          "node_id": "...",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "...-2017.08.30",
          "index_uuid": "...-2017.08.30",
          "shard_id": 0,
          "reason": "IndexShardSnapshotFailedException[Failed to write file list]; nested: IOException[Unable to upload object elasticsearch-snapshots/indices/.../0/pending-index-11]; nested: AmazonS3Exception[Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: 1562D61EBA5696BD)]; ",
          "node_id": "...",
          "status": "INTERNAL_SERVER_ERROR"
        }
      ],

The text was updated successfully, but these errors were encountered:

tlrx · 2017-09-12T11:51:48Z

The Snapshot/Restore feature had many changes between 5.2.2 and 5.5.2 but I don't see anything "obvious" that would explain the timeouts (I have #24403 but it's fixed for 5.5.2). Maybe @ywelsch or @imotov have an idea?

tlrx · 2017-09-12T11:52:46Z

@davekonopka Also, do you have some stacktraces in logs that would help to diagnose this issue? Thanks

imotov · 2017-09-13T01:24:56Z

@tlrx cannot really think of any significant changes that would lead to something like this. Could it be a coincidence? How are the nodes doing in terms of CPU and memory during snapshots?

ywelsch · 2017-09-13T07:54:08Z

I wonder if it could have to do with #23952 where we now rely on the s3 client retry mechanism instead of our own retry mechanism.

Searching for the error on the internet turns up a few interesting GH issues, indicating some issues with the S3 client and the retry mechanism:

aws/aws-sdk-java#1101
aws/aws-sdk-js#281

davekonopka · 2017-09-22T12:48:28Z

I've seen the following failure a few times in snapshots since originally reporting this:

{
  "index": "logs-xxxxxx-2017.09.18",
  "index_uuid": "logs-xxxxx-2017.09.18",
  "shard_id": 2,
  "reason": "IndexShardSnapshotFailedException[Failed to write file list]; nested: IOException[Unable to upload object elasticsearch-snapshots/indices/gud2dTkJRzqbZTif1Ve7Qg/2/pending-index-1]; nested: AmazonS3Exception[Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: A117F2F8595389C3)]; ",
  "node_id": "mGyXndzGQ-uuvfrqGGj0lw",
  "status": "INTERNAL_SERVER_ERROR"
}

It has happened a few times but usually only for one index. Most snapshots show no failures.

jdoss · 2017-10-02T22:19:27Z

We are seeing this randomly too on 5.4.1

      "snapshot" : "redacted",
      "uuid" : "redacted",
      "version_id" : 5040199,
      "version" : "5.4.1",
      "indices" : [
        redacted
      ],
      "state" : "PARTIAL",
      "start_time" : "2017-09-30T07:00:03.316Z",
      "start_time_in_millis" : 1506754803316,
      "end_time" : "2017-09-30T15:11:11.651Z",
      "end_time_in_millis" : 1506784271651,
      "duration_in_millis" : 29468335,
      "failures" : [
        {
          "index" : "redacted",
          "index_uuid" : "redacted",
          "shard_id" : 21,
          "reason" : "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: IOException[Unable to upload object site//indices/DwB4qsgrRMeIhVeZ2axpqA/21/__h2]; nested: AmazonS3Exception[MD5 local [twETo0o8O+HG8x5xyg2AaQ==], remote [4Zm2BGZ8PW+saqe+W0uw1g==] are not equal... (Service: null; Status Code: 0; Error Code: null; Request ID: null)]; ",
          "node_id" : "_7DaMJafSEO77nTozfkS4w",
          "status" : "INTERNAL_SERVER_ERROR"
        }
      ],
      "shards" : {
        "total" : 406,
        "failed" : 1,
        "successful" : 405
      }
    }
  ]

We have pretty large indexes and on pretty busy nodes. Shouldn't it try to retry this shard if it fails to upload?

subbu-sw · 2017-10-24T09:14:52Z

Same is noticed on es 5.5.0 as well. Has anyone identified any workaround, or s3-client settings that will prevent these failures?
{ "snapshots": [ { "snapshot": "curator-20171023215205", "uuid": "k5FKxN4USoiYS-c6Uj7vBg", "version_id": 5050099, "version": "5.5.0", "indices": [ ..., ], "state": "PARTIAL", "start_time": "2017-10-23T21:52:06.913Z", "start_time_in_millis": 1508795526913, "end_time": "2017-10-23T23:02:56.303Z", "end_time_in_millis": 1508799776303, "duration_in_millis": 4249390, "failures": [ { "index": "merch_best_sellers_2017_10_23", "index_uuid": "merch_best_sellers_2017_10_23", "shard_id": 0, "reason": "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: IOException[Unable to upload object indices/saal_rkoTCelrgfhSOcpzw/0/__i]; nested: AmazonS3Exception[Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout; Request ID: C80CED59E934C14E)]; ", "node_id": "BAARTEh6ReKhiv7-CCmXwA", "status": "INTERNAL_SERVER_ERROR" ...

dsun811 · 2017-11-02T06:48:48Z

Hi,

We are experiencing the same problem in our cluster (ES 5.5.2, JVM 1.8.0_131). We see this behavior quite often since the upgrade to 5.5.2 (we were on 2.4.4 before)

The behavior we observe is basically the following pattern:
(1) When the snapshot starts, there is a sharp increase in the heap used (we saw up to +4GB)
(2) An old gen GC (that can last up to 30sec) happens on some node(s) due to the increase in heap used
(3) The node will then report a snapshot failure, due to Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed. (Service: Amazon S3; Status Code: 400; Error Code: RequestTimeout ...)
(4) Resulting snapshot ends with PARTIAL state

I would guess what happens is that the client connects to S3 successfully and due to the GC, cannot upload (or finish uploading) all the data, and gets a request timeout from S3 (due to the perceived inactivity from the client during GC). I think the aws-sdk does not do any retry on that kind of error, and just as @ywelsch mentioned, as the ES snapshot plugin retry mecanism has been removed, the upload request is then just not retried.
I am also wondering why there is such a large increase in the heap used during snapshot (as far as I remember, I didn't observe such behavior in ES 2.4)

Would it be possible, for example, to restore the behavior before #23952 as a configurable option of the snapshot repository?

*Any idea why the snapshot would take so much memory? (if we could address this problem, there would be no GC issue neither, and that case would be solved as well..)

tlrx · 2017-11-02T09:01:13Z

I'm afraid there are two kind of issues here, but they are heavily related. The first one concerns the request timeout and I think that a first step in resolution is to update the AWS SDK used by the repository-s3 plugin as it is really old.

The second issue I see is the memory consumption and I think this is because the plugin initializes a 100Mb (if node RAM is > 2gb, otherwise it is 5% of the heap) byte array for every single file to upload. This byte array was initialized with a fixed length of 5Mb on 2.4. This is a bug and I'm testing a fix.

Finally, I think that we could do even better and use a AWS SDK's utility class named TransferManager to upload files to S3 (#26993). I expect this to be more resilient and efficient as the custom implementation we use, and it handles retries and multiple uploads. I'm also testing this and I'll update this issue as soon as I have more.

Would it be possible, for example, to restore the behavior before #23952 as a configurable option of the snapshot repository?

The AWS SDK already handles retry logic and the bit removed in #23952 were just multiplicating the number of retries. You should be able to restore a similar behavior by increasing the repositories.s3.max_retries setting.

dsun811 · 2017-11-02T09:37:05Z

Thanks for the reply @tlrx! Regarding the memory consumption, are you referring to the buffer_size repository setting? (https://www.elastic.co/guide/en/elasticsearch/plugins/5.5/repository-s3-repository.html) Would it be better if I set that setting to a lower value, eg: 5mb?

tlrx · 2017-11-02T10:16:47Z

@dsun811 To be transparent, there's no out of the box right value for the buffer_size setting and you have to experiment by yourself. A user reported that decreasing the buffer_size value gives better results with nodes with 4Gb of RAM.

dsun811 · 2017-11-07T12:37:20Z

@tlrx Just wanted to get back to you on that issue, and give you some feedback. We tried decreasing the buffer_size setting, and we are not experiencing the heavy GC anymore and all our snapshots have been successful since then. Thanks a lot for your time and the different links, it helped a lot!

tlrx · 2017-11-09T09:10:01Z

@dsun811 Thanks for your feedback too :) I created #27280 that removes most of the memory allocations; only the AWS SDK client will buffer up to 16Mb when uploading files. That should help too.

tlrx · 2018-01-08T12:43:15Z

#27278 and #27280 have been merged in v5.6.5, v6.0.1, v6.1.0. It should reduce the timeouts issues reported in here.

I'm closing this issue, if it appears again feel free to open a new issue and linked back to this one.

iahmad-khan · 2018-05-26T13:00:18Z

Same issue on elastic 6.2

"reason": "IndexShardSnapshotFailedException[com.amazonaws.SdkClientException: Unable to execute HTTP request: Read timed out]; nested: SdkClientException[Unable to execute HTTP request: Read timed out]; nested: SocketTimeoutException[Read timed out]; ",
          "node_id": "oDyxks_7Suu9rNVqzZM6-Q",
          "status": "INTERNAL_SERVER_ERROR"

tlrx added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Sep 12, 2017

tlrx added the feedback_needed label Sep 12, 2017

tlrx added >bug and removed feedback_needed labels Nov 2, 2017

tlrx closed this as completed Jan 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

S3 snapshots with timeout failures after upgrade to 5.5.2 #26576

S3 snapshots with timeout failures after upgrade to 5.5.2 #26576

davekonopka commented Sep 11, 2017

tlrx commented Sep 12, 2017

tlrx commented Sep 12, 2017

imotov commented Sep 13, 2017

ywelsch commented Sep 13, 2017

davekonopka commented Sep 22, 2017 •

edited

Loading

jdoss commented Oct 2, 2017

subbu-sw commented Oct 24, 2017

dsun811 commented Nov 2, 2017 •

edited

Loading

tlrx commented Nov 2, 2017

dsun811 commented Nov 2, 2017

tlrx commented Nov 2, 2017

dsun811 commented Nov 7, 2017

tlrx commented Nov 9, 2017

tlrx commented Jan 8, 2018

iahmad-khan commented May 26, 2018

S3 snapshots with timeout failures after upgrade to 5.5.2 #26576

S3 snapshots with timeout failures after upgrade to 5.5.2 #26576

Comments

davekonopka commented Sep 11, 2017

tlrx commented Sep 12, 2017

tlrx commented Sep 12, 2017

imotov commented Sep 13, 2017

ywelsch commented Sep 13, 2017

davekonopka commented Sep 22, 2017 • edited Loading

jdoss commented Oct 2, 2017

subbu-sw commented Oct 24, 2017

dsun811 commented Nov 2, 2017 • edited Loading

tlrx commented Nov 2, 2017

dsun811 commented Nov 2, 2017

tlrx commented Nov 2, 2017

dsun811 commented Nov 7, 2017

tlrx commented Nov 9, 2017

tlrx commented Jan 8, 2018

iahmad-khan commented May 26, 2018

davekonopka commented Sep 22, 2017 •

edited

Loading

dsun811 commented Nov 2, 2017 •

edited

Loading