-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Report more details of unobtainable ShardLock #61255
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Report more details of unobtainable ShardLock #61255
Conversation
Today a common reason for a `ShardLockObtainFailedException` is when a
shard is removed from a node and then assigned straight back to it again
before the node has had a chance to shut the previous shard instance
down. For instance, this can happen if a node briefly leaves the cluster
holding a primary with no in-sync replicas.
The message in this case is typically as follows:
obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]
This is pretty hard to interpret, and doesn't raise the important
question: "why didn't the shard shut down sooner?"
With this change we reword the message a bit, report the age of the
shard lock, and adjust the details to report that the lock is held by a
closing shard:
obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms]
Relates elastic#38807
|
Pinging @elastic/es-distributed (:Distributed/Store) |
original-brownbear
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| try { | ||
| if (mutex.tryAcquire(timeoutInMillis, TimeUnit.MILLISECONDS)) { | ||
| lockDetails = details; | ||
| lockDetails = Tuple.tuple(System.nanoTime(), details); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NIT: setDetails(details);
dakrone
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Thanks both |
Today a common reason for a `ShardLockObtainFailedException` is when a
shard is removed from a node and then assigned straight back to it again
before the node has had a chance to shut the previous shard instance
down. For instance, this can happen if a node briefly leaves the cluster
holding a primary with no in-sync replicas.
The message in this case is typically as follows:
obtaining shard lock timed out after 5000ms, previous lock details: [shard creation] trying to lock for [shard creation]
This is pretty hard to interpret, and doesn't raise the important
question: "why didn't the shard shut down sooner?"
With this change we reword the message a bit, report the age of the
shard lock, and adjust the details to report that the lock is held by a
closing shard:
obtaining shard lock for [starting shard] timed out after [5000ms], lock already held for [closing shard] with age [12345ms]
Relates #38807
Today a common reason for a
ShardLockObtainFailedExceptionis when ashard is removed from a node and then assigned straight back to it again
before the node has had a chance to shut the previous shard instance
down. For instance, this can happen if a node briefly leaves the cluster
holding a primary with no in-sync replicas.
The message in this case is typically as follows:
This is pretty hard to interpret, and doesn't raise the important
question: "why didn't the shard shut down sooner?"
With this change we reword the message a bit, report the age of the
shard lock, and adjust the details to report that the lock is held by a
closing shard:
Relates #38807