Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disconnect dangling pollers on membership lost #6272

Merged
merged 2 commits into from
Sep 16, 2024

Conversation

dkrotx
Copy link
Member

@dkrotx dkrotx commented Sep 9, 2024

When we Stop() TaskListManager we currently don't do anything with
pollers. That's why long pollers are still waiting for the tasks
and this could cause a significant delay (1m) on schedule-to-start.

When TaskListManager shuts down we cancel long polling poller' requests.

To avoid 1m spike on schedule-to-start when cadence-matching is restarted.

Unit-test

cadence-client will observe more empty tasks.

If you previously seen 1m schedule-to-start every time you restart cadence-matching, this should be fixed now.

Documentation Changes

When we Stop() TaskListManager we currently don't do anything with
pollers. That's why long pollers are still waiting for the tasks
and this could cause a significant delay (1m) on schedule-to-start.
Copy link

codecov bot commented Sep 9, 2024

Codecov Report

Attention: Patch coverage is 96.15385% with 1 line in your changes missing coverage. Please review.

Project coverage is 73.11%. Comparing base (08dba4d) to head (d7e2688).
Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
common/ctxutils/ctxutils.go 93.75% 0 Missing and 1 partial ⚠️
Additional details and impacted files
Files with missing lines Coverage Δ
service/matching/tasklist/matcher.go 63.16% <100.00%> (+0.53%) ⬆️
service/matching/tasklist/task_list_manager.go 68.33% <100.00%> (+0.07%) ⬆️
common/ctxutils/ctxutils.go 93.75% <93.75%> (ø)

... and 14 files with indirect coverage changes


Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 08dba4d...d7e2688. Read the comment docs.

@dkrotx
Copy link
Member Author

dkrotx commented Sep 9, 2024

I'm going to add tests for 2 other pollers, they'll be very similar - just need to catch when we're forwarding request.
At the same time, the diff helps in 3/4 cases making schedule-to-start spikes ~60s -> <=15s.
I'm planning to explore and fix the rest 25% in a separate diff.

@@ -501,6 +528,9 @@ func (tm *TaskMatcher) pollOrForward(
EventName: "Poll Timeout",
})
return nil, ErrNoTasks
case <-tm.cancelCtx.Done():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can chain the contexts including cancelCtx before forwarding polls at line
tm.fwdr.ForwardPoll(ctx)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did chaining trick with ctxutil.WithPropagatedContextCancel.
I think now the semantics we-dont-care-if-matcher-closes-or-client-disconnects is more explicit.

@@ -579,6 +563,18 @@ func (t *MatcherTestSuite) TestIsolationMustOfferRemoteMatch() {
t.Equal(t.taskList.Parent(20), req.GetTaskList().GetName())
}

func (t *MatcherTestSuite) TestPollersDisconnectedAfterDisconnectBlockedPollers() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to also simulate the scenario where tasklist ownership changes and this change reduces task latencies by preventing hanging polls. Simulation framework currently doesn't support such ownership change but should be straightforward to introduce

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe. But it's also super-easy to reproduce locally + this is the case production falls all the time because of the clients polling immediately disconnecting from exited instance.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Manual local testing is good if you know what you are doing but having it defined as another simulation framework would be preferred. We will run those simulation scenarios as part of CI and ensure features/improvements like this are not broken going forward.
Not a blocker for this PR. Let's add it when we have cycles.

Copy link
Member

@davidporter-id-au davidporter-id-au left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a draft, I can't fault anything, lgtm

After all, we want to get no-tasks from the matcher
var wg sync.WaitGroup
wg.Add(1)

go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd avoid this extra goroutine and put this logic inside returned func callback

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand you - how the chaining would work then?
We need to make sure cancelling the dependant (parent) context when cancelCtx is cancelled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking closely I couldn't see a way to achieve this without extra goroutine which helps propagate cancelCtx.Done.

var wg sync.WaitGroup
wg.Add(1)

go func() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After looking closely I couldn't see a way to achieve this without extra goroutine which helps propagate cancelCtx.Done.

@dkrotx dkrotx merged commit 957f6ef into cadence-workflow:master Sep 16, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants