Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify shard write operations #4955

Merged
merged 2 commits into from
Aug 18, 2022
Merged

Conversation

Shaddoll
Copy link
Contributor

@Shaddoll Shaddoll commented Aug 17, 2022

What changed?
Remove retry for shard write operations

Why?
To simplify logic, the code currently handles a case which can never be hit

How did you test it?

Potential risks

Release notes

Documentation Changes

// RangeID might have been renewed by the same host while this update was in flight
// Retry the operation if we still have the shard ownership
if currentRangeID != s.getRangeID() {
continue Create_Loop
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This path can never be hit, because the rangeID is protected by a lock, and the lock is held during the write operation

Copy link
Contributor

@Groxx Groxx Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// RangeID might have been renewed by the same host while this update was in flight
// Retry the operation if we still have the shard ownership

hmmmmmmmmm. this seems to imply an async update in this process, and s.rangeID is an atomic var...
... which is never read from. only written to. s.getRangeID() reads a different (non-atomic) field.

I agree that this only changes in renewRangeLocked and:

  • before this should never be hit because it set currentRangeID := s.getRangeID() before entering this switch
  • now this can never be hit because it doesn't loop at all, so it releases and re-acquires the lock before re-reading (no worse than before)

but this does make me wonder if the whole contextImpl may be flawed / using the wrong field or something. or that atomic var is just another piece of dead code perhaps.

Copy link
Contributor

@Groxx Groxx Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, so the atomic s.rangeID was probably effectively removed in #2634

@coveralls
Copy link

coveralls commented Aug 17, 2022

Pull Request Test Coverage Report for Build 0182aea5-dc76-420c-ada6-0e1000962a8b

  • 57 of 173 (32.95%) changed or added relevant lines in 1 file are covered.
  • 126 unchanged lines in 17 files lost coverage.
  • Overall coverage decreased (-0.008%) to 57.688%

Changes Missing Coverage Covered Lines Changed/Added Lines %
service/history/shard/context.go 57 173 32.95%
Files with Coverage Reduction New Missed Lines %
service/history/queue/transfer_queue_processor_base.go 1 78.05%
common/peerprovider/ringpopprovider/config.go 2 81.58%
common/task/weightedRoundRobinTaskScheduler.go 2 89.64%
service/history/execution/mutable_state_builder.go 2 69.03%
common/persistence/nosql/nosqlplugin/cassandra/workflow.go 3 57.44%
common/persistence/statsComputer.go 3 94.64%
service/history/task/fetcher.go 3 91.75%
service/matching/taskListManager.go 3 75.87%
common/persistence/serialization/parser.go 4 63.91%
common/persistence/serialization/thrift_decoder.go 4 57.14%
Totals Coverage Status
Change from base Build 0182aea2-9f9a-4948-9173-84645341c2de: -0.008%
Covered Lines: 84047
Relevant Lines: 145692

💛 - Coveralls

Comment on lines +622 to +624
if s.isClosed() {
return nil, ErrShardClosed
}
Copy link
Contributor

@Groxx Groxx Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm guessing s.isClosed() is no longer possible now?
e.g. seems like allocateTaskIDsLocked or something would fail, wouldn't it?

not that it's inherently wrong, if it's worth being defensive in this code. in that case it's mostly curiosity.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to double confirm. Probably not needed anymore now

Copy link
Contributor

@Groxx Groxx Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it more visible: potentially still worth looking into, but I'll vote to keep it. this is an atomic check, not blocked by the lock, so it could catch something.

the time window may be small (I'm not sure tbh), but it shouldn't be less correct to have this check. so unless you're confident in removing it, I'm game to leave it.

tag.Error(err),
)
s.closeShard()
break Create_Loop
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oof. this means it was returning errMaxAttemptsExceeded rather than the actual error :\

I'm not sure if that'll change any semantics (seems like probably no), but this certainly seems like a likely improvement at least.

Comment on lines -768 to -783
case *persistence.ShardOwnershipLostError:
{
// RangeID might have been renewed by the same host while this update was in flight
// Retry the operation if we still have the shard ownership
if currentRangeID != s.getRangeID() {
continue Update_Loop
} else {
// Shard is stolen, trigger shutdown of history engine
s.logger.Warn(
"Closing shard: UpdateWorkflowExecution failed due to stolen shard.",
tag.Error(err),
)
s.closeShard()
break Update_Loop
}
}
Copy link
Contributor

@Groxx Groxx Aug 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this break Update_Loop seems particularly bad :\
glad to see it disappearing.

@Groxx
Copy link
Contributor

Groxx commented Aug 18, 2022

ok, yeah, I think I follow all this now.

Since s.isClosed() is an atomic check rather than something we can trust behind the lock, I think my bias is to keep all of those initial checks currently. Atomic = concurrently modified = it may save a request occasionally? at least until it's ruled out.

Copy link
Contributor

@Groxx Groxx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Error values seem almost-definitely improved, and agreed that these are non-reachable paths.

@Shaddoll Shaddoll merged commit f2b2108 into uber:master Aug 18, 2022
ZackLK pushed a commit that referenced this pull request Aug 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants