Read time out errors when recovering a snapshot from S3 repository lets elasticsearch hanging #8280

JoeZ99 · 2014-10-29T23:20:30Z

During a snapshot restore process for a snapshot which is stored in a S3 repository, chances are that a Read Timeout error occurs when communicating to S3. Don't know whether S3 or Elasticsearch are to blame.

The issue is that that timeout error should be translated in some kind of error response from elasticsearch, but elasticsearch just keeps waiting for an answer that never comes. In consequence, besides dead points in the application code, the cluster remains in a fake recovery process, preventing it from allowing other recoveries.

this is the TimeoutError, as seen in elasticsearch logs

the endpoint is

http://elasticsearch.server.com/_snapshot/repository_name/snapshot_name/_restore?wait_for_completion=true

[2014-10-23 16:37:50,005][INFO ][snapshots ] [Whiplash] snapshot [backups-6:4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730] is done
[2014-10-23 16:37:50,218][INFO ][cluster.metadata ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632] deleting index
[2014-10-23 16:37:50,236][INFO ][cluster.metadata ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_phonetic_ts20141023163747520269] deleting index
[2014-10-23 16:44:28,557][WARN ][cluster.metadata ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632] re-syncing mappings with cluster state for types [[product]]
[2014-10-23 16:44:28,558][WARN ][cluster.metadata ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_phonetic_ts20141023163747520269] re-syncing mappings with cluster state for types [[product]]
[2014-10-23 16:46:24,886][WARN ][indices.cluster ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed recovery
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] restore failed
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
... 3 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed to restore snapshot [4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730]
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:158)
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] Failed to recover index
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:741)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:155)
... 5 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at java.security.DigestInputStream.read(DigestInputStream.java:161)
at com.amazonaws.services.s3.internal.DigestValidationInputStream.read(DigestValidationInputStream.java:59)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.elasticsearch.cloud.aws.blobstore.AbstractS3BlobContainer$1.run(AbstractS3BlobContainer.java:99)
... 3 more
[2014-10-23 16:46:24,887][WARN ][cluster.action.shard ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] sending failed shard for [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0], node[4dZyhkKySw-HpxhLjDYp4A], [P], restoring[backups-6:4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730], s[INITIALIZING], indexUUID [P6kfnE_UTt6Caae5KAIXHg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed recovery]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] restore failed]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed to restore snapshot [4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730]]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] Failed to recover index]; nested: SocketTimeoutException[Read timed out]; ]]
[2014-10-23 16:46:24,887][WARN ][cluster.action.shard ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] received shard failed for [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0], node[4dZyhkKySw-HpxhLjDYp4A], [P], restoring[backups-6:4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730], s[INITIALIZING], indexUUID [P6kfnE_UTt6Caae5KAIXHg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed recovery]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] restore failed]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed to restore snapshot [4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730]]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] Failed to recover index]; nested: SocketTimeoutException[Read timed out]; ]]

The text was updated successfully, but these errors were encountered:

imotov · 2014-10-31T19:25:48Z

When restore operation is interrupted because of an error, the shard that was being restored is most likely in a broken state. So, elasticsearch automatically tries to create this shard on another node in hope that next time it will be successful next time. This process doesn't stop until 1) the shard is successfully restored or 2) the index is explicitly deleted by the user. So, if there is a communication issues with S3, elasticsearch will retry for as long as needed until S3 recovers. In other words, it's not a "fake" recovery process, it's recovery processes form S3 repository and it follows the same principles that recovery process from a local gateway.

JoeZ99 · 2014-10-31T19:57:33Z

So, if I understood you right, when there is an error during a restore process, elasticsearch will retry the restore process using another node as the shard's destination, I guess because elasticsearch thinks that the error during the first restore process was related somehow to the node it tried to restore the shard to first.

What if you only have one node? (that's precisely our case) does elasticsearch keeps trying on the same node?
Even if that's the case (elasticsearch keeps trying on the same node, since there is only one), it's not working in our scenario.the "read timeout" s3 error is not a permanent condition. It just happens from time to time when -probably, we haven't nailed it yet- too many requests to the S3 are being made. What this means is that immediately afterwards this error, the S3 is accessible again, and if elasticsearch "retried" a restore, it would be successfull, and that's not what is happening. Once this error occurs, no recovery can be made unless we explicitly delete the index.

imotov · 2014-11-02T22:27:33Z

Yes, in the case of a single node, it might not retry to restore the shard to the same node. That's the common recovery/restore logic and in most cases I think it makes sense. Should we add some sort of detection of retry-able errors that will be processed differently and where it should reside core or s3 plugin - that's a good question. We need to discuss that.

imotov · 2014-11-03T15:23:59Z

After a brief discussion with @tlrx, we came to the conclusion that it might make sense to address this as part of elastic/elasticsearch-cloud-aws#125. We can add retry check in S3 layer with possible common retry logic on the BlobStore level. Reassigning it to @tlrx.

This commit adds a retry logic when reading blobs from S3. It also adds a retry logic when initializing a multipart upload and sets the internal "max retries" parameter of the Amazon S3 client with the same value as the "max_retry" parameter set for the snapshot repository (so in worst cases with the default value set to 3, 3x3=9 attempts will be made). The internal S3 client uses an exponential back off strategy between each connection exception (mainly IOException). Closes elastic/elasticsearch#8280

tlrx · 2014-11-21T09:55:15Z

Commit elastic/elasticsearch-cloud-aws@ea91adf should help to resolve this issue.

When elasticsearch-cloud-aws will be released for 1.4, could you please test this issue again and reopen if needed?

Thanks

This commit adds a retry logic when reading blobs from S3. It also adds a retry logic when initializing a multipart upload and sets the internal "max retries" parameter of the Amazon S3 client with the same value as the "max_retry" parameter set for the snapshot repository (so in worst cases with the default value set to 3, 3x3=9 attempts will be made). The internal S3 client uses an exponential back off strategy between each connection exception (mainly IOException). Closes elastic/elasticsearch#8280 Closes #140.

cregev · 2014-12-29T18:05:52Z

When will this issue is going to be fixed ?

JoeZ99 · 2015-01-02T14:44:27Z

@tlrx , I've tested with aws plugin 2.4.1 and elasticsearch 1.4.1, and same behavior, so I'll reopen as instructed

JoeZ99 · 2015-01-02T14:46:59Z

@tlrx , Don't know how to reopen this issue.

dadoonet · 2015-01-02T14:47:44Z

@JoeZ99 reopened.

cregev · 2015-01-05T09:52:19Z

Is there a workaround ?

JoeZ99 · 2015-01-05T11:11:30Z

The only"workaround" for us has been to make the API call with the
"wait_for_completion" parameter, and oh it takes more than 5 minutes, issue
a delete for the indices being restored to unlock the cluster and assume a
timeout read error
On Jan 5, 2015 10:53 AM, "Costya Regev" notifications@github.com wrote:

Is there a workaround ?

—
Reply to this email directly or view it on GitHub
#8280 (comment)
.

tlrx · 2015-04-03T14:47:49Z

Follow up on elastic/elasticsearch-cloud-aws#149 (comment)

clintongormley · 2015-11-21T20:16:36Z

Closing in favour of elastic/elasticsearch-cloud-aws#149 (comment)

clintongormley added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Oct 30, 2014

clintongormley assigned imotov Oct 30, 2014

imotov added the discuss label Nov 2, 2014

imotov assigned tlrx and unassigned imotov Nov 3, 2014

tlrx mentioned this issue Nov 4, 2014

Add retry logic for S3 connection errors when restoring snapshots elastic/elasticsearch-cloud-aws#134

Closed

tlrx closed this as completed in elastic/elasticsearch-cloud-aws@ea91adf Nov 20, 2014

asileon mentioned this issue Nov 27, 2014

repeated "Read timed out" errors when recovering a large sized shards from S3 repository elastic/elasticsearch-cloud-aws#149

Open

dadoonet reopened this Jan 2, 2015

clintongormley closed this as completed Nov 21, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read time out errors when recovering a snapshot from S3 repository lets elasticsearch hanging #8280

Read time out errors when recovering a snapshot from S3 repository lets elasticsearch hanging #8280

JoeZ99 commented Oct 29, 2014

imotov commented Oct 31, 2014

JoeZ99 commented Oct 31, 2014

imotov commented Nov 2, 2014

imotov commented Nov 3, 2014

tlrx commented Nov 21, 2014

cregev commented Dec 29, 2014

JoeZ99 commented Jan 2, 2015

JoeZ99 commented Jan 2, 2015

dadoonet commented Jan 2, 2015

cregev commented Jan 5, 2015

JoeZ99 commented Jan 5, 2015

tlrx commented Apr 3, 2015

clintongormley commented Nov 21, 2015

Read time out errors when recovering a snapshot from S3 repository lets elasticsearch hanging #8280

Read time out errors when recovering a snapshot from S3 repository lets elasticsearch hanging #8280

Comments

JoeZ99 commented Oct 29, 2014

imotov commented Oct 31, 2014

JoeZ99 commented Oct 31, 2014

imotov commented Nov 2, 2014

imotov commented Nov 3, 2014

tlrx commented Nov 21, 2014

cregev commented Dec 29, 2014

JoeZ99 commented Jan 2, 2015

JoeZ99 commented Jan 2, 2015

dadoonet commented Jan 2, 2015

cregev commented Jan 5, 2015

JoeZ99 commented Jan 5, 2015

tlrx commented Apr 3, 2015

clintongormley commented Nov 21, 2015