Simplify shard write operations #4955

Shaddoll · 2022-08-17T02:51:44Z

What changed?
Remove retry for shard write operations

Why?
To simplify logic, the code currently handles a case which can never be hit

How did you test it?

Potential risks

Release notes

Documentation Changes

Shaddoll · 2022-08-17T02:53:00Z

service/history/shard/context.go

-				// RangeID might have been renewed by the same host while this update was in flight
-				// Retry the operation if we still have the shard ownership
-				if currentRangeID != s.getRangeID() {
-					continue Create_Loop


This path can never be hit, because the rangeID is protected by a lock, and the lock is held during the write operation

// RangeID might have been renewed by the same host while this update was in flight // Retry the operation if we still have the shard ownership

hmmmmmmmmm. this seems to imply an async update in this process, and s.rangeID is an atomic var...
... which is never read from. only written to. s.getRangeID() reads a different (non-atomic) field.

I agree that this only changes in renewRangeLocked and:

before this should never be hit because it set currentRangeID := s.getRangeID() before entering this switch

now this can never be hit because it doesn't loop at all, so it releases and re-acquires the lock before re-reading (no worse than before)

but this does make me wonder if the whole contextImpl may be flawed / using the wrong field or something. or that atomic var is just another piece of dead code perhaps.

ok, so the atomic s.rangeID was probably effectively removed in #2634

coveralls · 2022-08-17T17:00:54Z

Pull Request Test Coverage Report for Build 0182aea5-dc76-420c-ada6-0e1000962a8b

57 of 173 (32.95%) changed or added relevant lines in 1 file are covered.
126 unchanged lines in 17 files lost coverage.
Overall coverage decreased (-0.008%) to 57.688%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
service/history/shard/context.go	57	173	32.95%

Files with Coverage Reduction	New Missed Lines	%
service/history/queue/transfer_queue_processor_base.go	1	78.05%
common/peerprovider/ringpopprovider/config.go	2	81.58%
common/task/weightedRoundRobinTaskScheduler.go	2	89.64%
service/history/execution/mutable_state_builder.go	2	69.03%
common/persistence/nosql/nosqlplugin/cassandra/workflow.go	3	57.44%
common/persistence/statsComputer.go	3	94.64%
service/history/task/fetcher.go	3	91.75%
service/matching/taskListManager.go	3	75.87%
common/persistence/serialization/parser.go	4	63.91%
common/persistence/serialization/thrift_decoder.go	4	57.14%

Totals
Change from base Build 0182aea2-9f9a-4948-9173-84645341c2de:	-0.008%
Covered Lines:	84047
Relevant Lines:	145692

💛 - Coveralls

service/history/shard/context.go

Groxx · 2022-08-18T00:56:08Z

service/history/shard/context.go

+	if s.isClosed() {
+		return nil, ErrShardClosed
+	}


I'm guessing s.isClosed() is no longer possible now?
e.g. seems like allocateTaskIDsLocked or something would fail, wouldn't it?

not that it's inherently wrong, if it's worth being defensive in this code. in that case it's mostly curiosity.

I need to double confirm. Probably not needed anymore now

to make it more visible: potentially still worth looking into, but I'll vote to keep it. this is an atomic check, not blocked by the lock, so it could catch something.

the time window may be small (I'm not sure tbh), but it shouldn't be less correct to have this check. so unless you're confident in removing it, I'm game to leave it.

Groxx · 2022-08-18T01:02:23Z

service/history/shard/context.go

-						tag.Error(err),
-					)
-					s.closeShard()
-					break Create_Loop


oof. this means it was returning errMaxAttemptsExceeded rather than the actual error :\

I'm not sure if that'll change any semantics (seems like probably no), but this certainly seems like a likely improvement at least.

Groxx · 2022-08-18T01:04:35Z

service/history/shard/context.go

-		case *persistence.ShardOwnershipLostError:
-			{
-				// RangeID might have been renewed by the same host while this update was in flight
-				// Retry the operation if we still have the shard ownership
-				if currentRangeID != s.getRangeID() {
-					continue Update_Loop
-				} else {
-					// Shard is stolen, trigger shutdown of history engine
-					s.logger.Warn(
-						"Closing shard: UpdateWorkflowExecution failed due to stolen shard.",
-						tag.Error(err),
-					)
-					s.closeShard()
-					break Update_Loop
-				}
-			}


this break Update_Loop seems particularly bad :\
glad to see it disappearing.

Groxx · 2022-08-18T01:45:40Z

ok, yeah, I think I follow all this now.

Since s.isClosed() is an atomic check rather than something we can trust behind the lock, I think my bias is to keep all of those initial checks currently. Atomic = concurrently modified = it may save a request occasionally? at least until it's ruled out.

Groxx

Sounds good. Error values seem almost-definitely improved, and agreed that these are non-reachable paths.

Shaddoll commented Aug 17, 2022

View reviewed changes

Simplify shard write operations

0b2c576

Shaddoll force-pushed the simplify-shard-write branch from bdcd65f to 0b2c576 Compare August 17, 2022 16:39

Groxx reviewed Aug 17, 2022

View reviewed changes

service/history/shard/context.go Show resolved Hide resolved

Groxx reviewed Aug 18, 2022

View reviewed changes

Groxx approved these changes Aug 18, 2022

View reviewed changes

Merge branch 'master' into simplify-shard-write

a00fff4

Shaddoll merged commit f2b2108 into uber:master Aug 18, 2022

ZackLK pushed a commit that referenced this pull request Aug 22, 2022

Simplify shard write operations (#4955)

b239d46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify shard write operations #4955

Simplify shard write operations #4955

Shaddoll commented Aug 17, 2022 •

edited

Loading

Shaddoll Aug 17, 2022

Groxx Aug 18, 2022 •

edited

Loading

Groxx Aug 18, 2022 •

edited

Loading

coveralls commented Aug 17, 2022 •

edited

Loading

Groxx Aug 18, 2022 •

edited

Loading

Shaddoll Aug 18, 2022

Groxx Aug 18, 2022 •

edited

Loading

Groxx Aug 18, 2022

Groxx Aug 18, 2022 •

edited

Loading

Groxx commented Aug 18, 2022 •

edited

Loading

Groxx left a comment

Simplify shard write operations #4955

Simplify shard write operations #4955

Conversation

Shaddoll commented Aug 17, 2022 • edited Loading

Shaddoll Aug 17, 2022

Choose a reason for hiding this comment

Groxx Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

Groxx Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

coveralls commented Aug 17, 2022 • edited Loading

Pull Request Test Coverage Report for Build 0182aea5-dc76-420c-ada6-0e1000962a8b

💛 - Coveralls

Groxx Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

Shaddoll Aug 18, 2022

Choose a reason for hiding this comment

Groxx Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

Groxx Aug 18, 2022

Choose a reason for hiding this comment

Groxx Aug 18, 2022 • edited Loading

Choose a reason for hiding this comment

Groxx commented Aug 18, 2022 • edited Loading

Groxx left a comment

Choose a reason for hiding this comment

Shaddoll commented Aug 17, 2022 •

edited

Loading

Groxx Aug 18, 2022 •

edited

Loading

Groxx Aug 18, 2022 •

edited

Loading

coveralls commented Aug 17, 2022 •

edited

Loading

Groxx Aug 18, 2022 •

edited

Loading

Groxx Aug 18, 2022 •

edited

Loading

Groxx Aug 18, 2022 •

edited

Loading

Groxx commented Aug 18, 2022 •

edited

Loading