-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read time out errors when recovering a snapshot from S3 repository lets elasticsearch hanging #8280
Comments
When restore operation is interrupted because of an error, the shard that was being restored is most likely in a broken state. So, elasticsearch automatically tries to create this shard on another node in hope that next time it will be successful next time. This process doesn't stop until 1) the shard is successfully restored or 2) the index is explicitly deleted by the user. So, if there is a communication issues with S3, elasticsearch will retry for as long as needed until S3 recovers. In other words, it's not a "fake" recovery process, it's recovery processes form S3 repository and it follows the same principles that recovery process from a local gateway. |
So, if I understood you right, when there is an error during a restore process, elasticsearch will retry the restore process using another node as the shard's destination, I guess because elasticsearch thinks that the error during the first restore process was related somehow to the node it tried to restore the shard to first.
|
Yes, in the case of a single node, it might not retry to restore the shard to the same node. That's the common recovery/restore logic and in most cases I think it makes sense. Should we add some sort of detection of retry-able errors that will be processed differently and where it should reside core or s3 plugin - that's a good question. We need to discuss that. |
After a brief discussion with @tlrx, we came to the conclusion that it might make sense to address this as part of elastic/elasticsearch-cloud-aws#125. We can add retry check in S3 layer with possible common retry logic on the BlobStore level. Reassigning it to @tlrx. |
This commit adds a retry logic when reading blobs from S3. It also adds a retry logic when initializing a multipart upload and sets the internal "max retries" parameter of the Amazon S3 client with the same value as the "max_retry" parameter set for the snapshot repository (so in worst cases with the default value set to 3, 3x3=9 attempts will be made). The internal S3 client uses an exponential back off strategy between each connection exception (mainly IOException). Closes elastic/elasticsearch#8280
This commit adds a retry logic when reading blobs from S3. It also adds a retry logic when initializing a multipart upload and sets the internal "max retries" parameter of the Amazon S3 client with the same value as the "max_retry" parameter set for the snapshot repository (so in worst cases with the default value set to 3, 3x3=9 attempts will be made). The internal S3 client uses an exponential back off strategy between each connection exception (mainly IOException). Closes elastic/elasticsearch#8280
Commit elastic/elasticsearch-cloud-aws@ea91adf should help to resolve this issue. When elasticsearch-cloud-aws will be released for 1.4, could you please test this issue again and reopen if needed? Thanks |
This commit adds a retry logic when reading blobs from S3. It also adds a retry logic when initializing a multipart upload and sets the internal "max retries" parameter of the Amazon S3 client with the same value as the "max_retry" parameter set for the snapshot repository (so in worst cases with the default value set to 3, 3x3=9 attempts will be made). The internal S3 client uses an exponential back off strategy between each connection exception (mainly IOException). Closes elastic/elasticsearch#8280 Closes #140.
This commit adds a retry logic when reading blobs from S3. It also adds a retry logic when initializing a multipart upload and sets the internal "max retries" parameter of the Amazon S3 client with the same value as the "max_retry" parameter set for the snapshot repository (so in worst cases with the default value set to 3, 3x3=9 attempts will be made). The internal S3 client uses an exponential back off strategy between each connection exception (mainly IOException). Closes elastic/elasticsearch#8280 Closes #140.
When will this issue is going to be fixed ? |
@tlrx , I've tested with aws plugin 2.4.1 and elasticsearch 1.4.1, and same behavior, so I'll reopen as instructed |
@tlrx , Don't know how to reopen this issue. |
@JoeZ99 reopened. |
Is there a workaround ? |
The only"workaround" for us has been to make the API call with the
|
Follow up on elastic/elasticsearch-cloud-aws#149 (comment) |
Closing in favour of elastic/elasticsearch-cloud-aws#149 (comment) |
During a snapshot restore process for a snapshot which is stored in a S3 repository, chances are that a Read Timeout error occurs when communicating to S3. Don't know whether S3 or Elasticsearch are to blame.
The issue is that that timeout error should be translated in some kind of error response from elasticsearch, but elasticsearch just keeps waiting for an answer that never comes. In consequence, besides dead points in the application code, the cluster remains in a fake recovery process, preventing it from allowing other recoveries.
this is the TimeoutError, as seen in elasticsearch logs
the endpoint is
[2014-10-23 16:37:50,005][INFO ][snapshots ] [Whiplash] snapshot [backups-6:4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730] is done
[2014-10-23 16:37:50,218][INFO ][cluster.metadata ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632] deleting index
[2014-10-23 16:37:50,236][INFO ][cluster.metadata ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_phonetic_ts20141023163747520269] deleting index
[2014-10-23 16:44:28,557][WARN ][cluster.metadata ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632] re-syncing mappings with cluster state for types [[product]]
[2014-10-23 16:44:28,558][WARN ][cluster.metadata ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_phonetic_ts20141023163747520269] re-syncing mappings with cluster state for types [[product]]
[2014-10-23 16:46:24,886][WARN ][indices.cluster ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed to start shard
org.elasticsearch.index.gateway.IndexShardGatewayRecoveryException: [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed recovery
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:185)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] restore failed
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:130)
at org.elasticsearch.index.gateway.IndexShardGatewayService$1.run(IndexShardGatewayService.java:127)
... 3 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed to restore snapshot [4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730]
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:158)
at org.elasticsearch.index.snapshots.IndexShardSnapshotAndRestoreService.restore(IndexShardSnapshotAndRestoreService.java:124)
... 4 more
Caused by: org.elasticsearch.index.snapshots.IndexShardRestoreFailedException: [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] Failed to recover index
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository$RestoreContext.restore(BlobStoreIndexShardRepository.java:741)
at org.elasticsearch.index.snapshots.blobstore.BlobStoreIndexShardRepository.restore(BlobStoreIndexShardRepository.java:155)
... 5 more
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:204)
at org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:182)
at org.apache.http.conn.EofSensorInputStream.read(EofSensorInputStream.java:138)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at java.security.DigestInputStream.read(DigestInputStream.java:161)
at com.amazonaws.services.s3.internal.DigestValidationInputStream.read(DigestValidationInputStream.java:59)
at com.amazonaws.internal.SdkFilterInputStream.read(SdkFilterInputStream.java:71)
at java.io.FilterInputStream.read(FilterInputStream.java:107)
at org.elasticsearch.cloud.aws.blobstore.AbstractS3BlobContainer$1.run(AbstractS3BlobContainer.java:99)
... 3 more
[2014-10-23 16:46:24,887][WARN ][cluster.action.shard ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] sending failed shard for [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0], node[4dZyhkKySw-HpxhLjDYp4A], [P], restoring[backups-6:4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730], s[INITIALIZING], indexUUID [P6kfnE_UTt6Caae5KAIXHg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed recovery]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] restore failed]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed to restore snapshot [4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730]]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] Failed to recover index]; nested: SocketTimeoutException[Read timed out]; ]]
[2014-10-23 16:46:24,887][WARN ][cluster.action.shard ] [Whiplash] [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] received shard failed for [4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0], node[4dZyhkKySw-HpxhLjDYp4A], [P], restoring[backups-6:4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730], s[INITIALIZING], indexUUID [P6kfnE_UTt6Caae5KAIXHg], reason [Failed to start shard, message [IndexShardGatewayRecoveryException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed recovery]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] restore failed]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] failed to restore snapshot [4af7ac9316f168daccf8af5f42b6271e_ts20141023163749251730]]; nested: IndexShardRestoreFailedException[[4af7ac9316f168daccf8af5f42b6271e_ts20141023163747416632][0] Failed to recover index]; nested: SocketTimeoutException[Read timed out]; ]]
The text was updated successfully, but these errors were encountered: