Report more details of unobtainable ShardLock #61255

DaveCTurner · 2020-08-18T09:03:12Z

Today a common reason for a ShardLockObtainFailedException is when a
shard is removed from a node and then assigned straight back to it again
before the node has had a chance to shut the previous shard instance
down. For instance, this can happen if a node briefly leaves the cluster
holding a primary with no in-sync replicas.

The message in this case is typically as follows:

obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]

This is pretty hard to interpret, and doesn't raise the important
question: "why didn't the shard shut down sooner?"

With this change we reword the message a bit, report the age of the
shard lock, and adjust the details to report that the lock is held by a
closing shard:

obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms]

Relates #38807

Today a common reason for a `ShardLockObtainFailedException` is when a shard is removed from a node and then assigned straight back to it again before the node has had a chance to shut the previous shard instance down. For instance, this can happen if a node briefly leaves the cluster holding a primary with no in-sync replicas. The message in this case is typically as follows: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation] This is pretty hard to interpret, and doesn't raise the important question: "why didn't the shard shut down sooner?" With this change we reword the message a bit, report the age of the shard lock, and adjust the details to report that the lock is held by a closing shard: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms] Relates elastic#38807

elasticmachine · 2020-08-18T09:03:14Z

Pinging @elastic/es-distributed (:Distributed/Store)

original-brownbear

LGTM

original-brownbear · 2020-08-18T09:55:00Z

server/src/main/java/org/elasticsearch/env/NodeEnvironment.java

            try {
                if (mutex.tryAcquire(timeoutInMillis, TimeUnit.MILLISECONDS)) {
-                    lockDetails = details;
+                    lockDetails = Tuple.tuple(System.nanoTime(), details);


NIT: setDetails(details);

dakrone

LGTM

DaveCTurner · 2020-08-19T05:36:02Z

Thanks both

Today a common reason for a `ShardLockObtainFailedException` is when a shard is removed from a node and then assigned straight back to it again before the node has had a chance to shut the previous shard instance down. For instance, this can happen if a node briefly leaves the cluster holding a primary with no in-sync replicas. The message in this case is typically as follows: obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation] This is pretty hard to interpret, and doesn't raise the important question: "why didn't the shard shut down sooner?" With this change we reword the message a bit, report the age of the shard lock, and adjust the details to report that the lock is held by a closing shard: obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms] Relates #38807

DaveCTurner added >enhancement :Distributed Indexing/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. v8.0.0 v7.10.0 labels Aug 18, 2020

DaveCTurner requested review from dakrone and original-brownbear August 18, 2020 09:03

elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Aug 18, 2020

original-brownbear approved these changes Aug 18, 2020

View reviewed changes

DaveCTurner added 2 commits August 18, 2020 10:59

Merge branch 'master' into 2020-08-18-log-shard-lock-age

ab94d42

CR

48bf874

dakrone approved these changes Aug 18, 2020

View reviewed changes

DaveCTurner merged commit 98213df into elastic:master Aug 19, 2020

DaveCTurner deleted the 2020-08-18-log-shard-lock-age branch August 19, 2020 05:36

Mpdreamz mentioned this pull request Nov 16, 2020

7.10.1 Meta Ticket elastic/elasticsearch-net#5096

Closed

61 tasks

stevejgordon mentioned this pull request Dec 17, 2020

7.11.0 Meta Ticket elastic/elasticsearch-net#5198

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Report more details of unobtainable ShardLock #61255

Report more details of unobtainable ShardLock #61255

Uh oh!

DaveCTurner commented Aug 18, 2020

Uh oh!

elasticmachine commented Aug 18, 2020

Uh oh!

original-brownbear left a comment

Uh oh!

original-brownbear Aug 18, 2020

Uh oh!

dakrone left a comment

Uh oh!

DaveCTurner commented Aug 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Report more details of unobtainable ShardLock #61255

Report more details of unobtainable ShardLock #61255

Uh oh!

Conversation

DaveCTurner commented Aug 18, 2020

Uh oh!

elasticmachine commented Aug 18, 2020

Uh oh!

original-brownbear left a comment

Choose a reason for hiding this comment

Uh oh!

original-brownbear Aug 18, 2020

Choose a reason for hiding this comment

Uh oh!

dakrone left a comment

Choose a reason for hiding this comment

Uh oh!

DaveCTurner commented Aug 19, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants