Disconnect dangling pollers on membership lost #6272

dkrotx · 2024-09-09T12:01:01Z

When we Stop() TaskListManager we currently don't do anything with
pollers. That's why long pollers are still waiting for the tasks
and this could cause a significant delay (1m) on schedule-to-start.

When TaskListManager shuts down we cancel long polling poller' requests.

To avoid 1m spike on schedule-to-start when cadence-matching is restarted.

Unit-test

cadence-client will observe more empty tasks.

If you previously seen 1m schedule-to-start every time you restart cadence-matching, this should be fixed now.

Documentation Changes

When we Stop() TaskListManager we currently don't do anything with pollers. That's why long pollers are still waiting for the tasks and this could cause a significant delay (1m) on schedule-to-start.

codecov · 2024-09-09T12:33:52Z

Codecov Report

Attention: Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.

Project coverage is 73.11%. Comparing base (08dba4d) to head (d7e2688).
Report is 13 commits behind head on master.

Files with missing lines	Patch %	Lines
common/ctxutils/ctxutils.go	93.75%	0 Missing and 1 partial ⚠️

Additional details and impacted files

Files with missing lines	Coverage Δ
service/matching/tasklist/matcher.go	`63.16% <100.00%> (+0.53%)`	⬆️
service/matching/tasklist/task_list_manager.go	`68.33% <100.00%> (+0.07%)`	⬆️
common/ctxutils/ctxutils.go	`93.75% <93.75%> (ø)`

... and 14 files with indirect coverage changes

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08dba4d...d7e2688. Read the comment docs.

dkrotx · 2024-09-09T17:18:38Z

I'm going to add tests for 2 other pollers, they'll be very similar - just need to catch when we're forwarding request.
At the same time, the diff helps in 3/4 cases making schedule-to-start spikes ~60s -> <=15s.
I'm planning to explore and fix the rest 25% in a separate diff.

taylanisikdemir · 2024-09-09T19:19:39Z

service/matching/tasklist/matcher.go

@@ -501,6 +528,9 @@ func (tm *TaskMatcher) pollOrForward(
 			EventName:    "Poll Timeout",
 		})
 		return nil, ErrNoTasks
+	case <-tm.cancelCtx.Done():


we can chain the contexts including cancelCtx before forwarding polls at line
tm.fwdr.ForwardPoll(ctx)

Did chaining trick with ctxutil.WithPropagatedContextCancel.
I think now the semantics we-dont-care-if-matcher-closes-or-client-disconnects is more explicit.

taylanisikdemir · 2024-09-09T19:22:35Z

service/matching/tasklist/matcher_test.go

@@ -579,6 +563,18 @@ func (t *MatcherTestSuite) TestIsolationMustOfferRemoteMatch() {
 	t.Equal(t.taskList.Parent(20), req.GetTaskList().GetName())
 }

+func (t *MatcherTestSuite) TestPollersDisconnectedAfterDisconnectBlockedPollers() {


It would be great to also simulate the scenario where tasklist ownership changes and this change reduces task latencies by preventing hanging polls. Simulation framework currently doesn't support such ownership change but should be straightforward to introduce

Maybe. But it's also super-easy to reproduce locally + this is the case production falls all the time because of the clients polling immediately disconnecting from exited instance.

Manual local testing is good if you know what you are doing but having it defined as another simulation framework would be preferred. We will run those simulation scenarios as part of CI and ensure features/improvements like this are not broken going forward.
Not a blocker for this PR. Let's add it when we have cycles.

service/matching/tasklist/task_list_manager.go

davidporter-id-au

As a draft, I can't fault anything, lgtm

After all, we want to get no-tasks from the matcher

taylanisikdemir · 2024-09-16T18:32:14Z

common/ctxutils/ctxutils.go

+	var wg sync.WaitGroup
+	wg.Add(1)
+
+	go func() {


I'd avoid this extra goroutine and put this logic inside returned func callback

I'm not sure I understand you - how the chaining would work then?
We need to make sure cancelling the dependant (parent) context when cancelCtx is cancelled.

After looking closely I couldn't see a way to achieve this without extra goroutine which helps propagate cancelCtx.Done.

taylanisikdemir · 2024-09-16T19:30:44Z

common/ctxutils/ctxutils.go

+	var wg sync.WaitGroup
+	wg.Add(1)
+
+	go func() {


After looking closely I couldn't see a way to achieve this without extra goroutine which helps propagate cancelCtx.Done.

Disconnect dangling pollers on membership lost

999b14d

When we Stop() TaskListManager we currently don't do anything with pollers. That's why long pollers are still waiting for the tasks and this could cause a significant delay (1m) on schedule-to-start.

taylanisikdemir reviewed Sep 9, 2024

View reviewed changes

davidporter-id-au reviewed Sep 10, 2024

View reviewed changes

service/matching/tasklist/task_list_manager.go Show resolved Hide resolved

davidporter-id-au reviewed Sep 10, 2024

View reviewed changes

Using context chaining for transparency

d7e2688

After all, we want to get no-tasks from the matcher

dkrotx marked this pull request as ready for review September 16, 2024 18:23

dkrotx requested review from Shaddoll, neil-xie, Groxx, shijiesheng, agautam478, jakobht, 3vilhamster, sankari165 and demirkayaender as code owners September 16, 2024 18:23

taylanisikdemir reviewed Sep 16, 2024

View reviewed changes

taylanisikdemir approved these changes Sep 16, 2024

View reviewed changes

dkrotx merged commit 957f6ef into cadence-workflow:master Sep 16, 2024
20 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disconnect dangling pollers on membership lost #6272

Disconnect dangling pollers on membership lost #6272

dkrotx commented Sep 9, 2024

codecov bot commented Sep 9, 2024 •

edited

Loading

dkrotx commented Sep 9, 2024

taylanisikdemir Sep 9, 2024

dkrotx Sep 16, 2024

taylanisikdemir Sep 9, 2024

dkrotx Sep 16, 2024

taylanisikdemir Sep 16, 2024

davidporter-id-au left a comment

taylanisikdemir Sep 16, 2024

dkrotx Sep 16, 2024

taylanisikdemir Sep 16, 2024

taylanisikdemir Sep 16, 2024

Disconnect dangling pollers on membership lost #6272

Disconnect dangling pollers on membership lost #6272

Conversation

dkrotx commented Sep 9, 2024

codecov bot commented Sep 9, 2024 • edited Loading

Codecov Report

dkrotx commented Sep 9, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

davidporter-id-au left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 9, 2024 •

edited

Loading