-
Notifications
You must be signed in to change notification settings - Fork 800
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify shard write operations #4955
Conversation
// RangeID might have been renewed by the same host while this update was in flight | ||
// Retry the operation if we still have the shard ownership | ||
if currentRangeID != s.getRangeID() { | ||
continue Create_Loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This path can never be hit, because the rangeID is protected by a lock, and the lock is held during the write operation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// RangeID might have been renewed by the same host while this update was in flight
// Retry the operation if we still have the shard ownership
hmmmmmmmmm. this seems to imply an async update in this process, and s.rangeID
is an atomic var...
... which is never read from. only written to. s.getRangeID()
reads a different (non-atomic) field.
I agree that this only changes in renewRangeLocked and:
- before this should never be hit because it set
currentRangeID := s.getRangeID()
before entering this switch - now this can never be hit because it doesn't loop at all, so it releases and re-acquires the lock before re-reading (no worse than before)
but this does make me wonder if the whole contextImpl
may be flawed / using the wrong field or something. or that atomic var is just another piece of dead code perhaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, so the atomic s.rangeID
was probably effectively removed in #2634
bdcd65f
to
0b2c576
Compare
if s.isClosed() { | ||
return nil, ErrShardClosed | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm guessing s.isClosed()
is no longer possible now?
e.g. seems like allocateTaskIDsLocked
or something would fail, wouldn't it?
not that it's inherently wrong, if it's worth being defensive in this code. in that case it's mostly curiosity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to double confirm. Probably not needed anymore now
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to make it more visible: potentially still worth looking into, but I'll vote to keep it. this is an atomic check, not blocked by the lock, so it could catch something.
the time window may be small (I'm not sure tbh), but it shouldn't be less correct to have this check. so unless you're confident in removing it, I'm game to leave it.
tag.Error(err), | ||
) | ||
s.closeShard() | ||
break Create_Loop |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oof. this means it was returning errMaxAttemptsExceeded
rather than the actual error :\
I'm not sure if that'll change any semantics (seems like probably no), but this certainly seems like a likely improvement at least.
case *persistence.ShardOwnershipLostError: | ||
{ | ||
// RangeID might have been renewed by the same host while this update was in flight | ||
// Retry the operation if we still have the shard ownership | ||
if currentRangeID != s.getRangeID() { | ||
continue Update_Loop | ||
} else { | ||
// Shard is stolen, trigger shutdown of history engine | ||
s.logger.Warn( | ||
"Closing shard: UpdateWorkflowExecution failed due to stolen shard.", | ||
tag.Error(err), | ||
) | ||
s.closeShard() | ||
break Update_Loop | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this break Update_Loop
seems particularly bad :\
glad to see it disappearing.
ok, yeah, I think I follow all this now. Since |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Error values seem almost-definitely improved, and agreed that these are non-reachable paths.
What changed?
Remove retry for shard write operations
Why?
To simplify logic, the code currently handles a case which can never be hit
How did you test it?
Potential risks
Release notes
Documentation Changes