Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tm: fix replmanager deadlock #6625

Merged
merged 3 commits into from
Aug 26, 2020
Merged

Conversation

sougou
Copy link
Contributor

@sougou sougou commented Aug 26, 2020

A deadlock was found during a PRS. The root cause was a fix where we changed the replmanager to take the action lock. Otherwise, it would potentially race and conflict with other actions. But this led to a deadlock because PromoteReplica also waits for the replmanager to finish its fix.

We could have spot-fixed this for the specific use case. But in the interest of preventing other corner cases, the better fix was to change replmanager to not wait if it couldn't obtain a lock.

However, the implementation of lock with context timeout was flawed, because it wouldn't really timeout if the context expired. So, I implemented a new AcquireContext in sync2.Semaphore to, which encouraged to fix the flaky tests there.

Using the semaphore allowed me to implement a real tryLock, and replManager could use it.

Since this was a race condition, I tested it manually. The test that failed previously now passes.

And fix flaky test

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Change actionMutex to a semaphore to implement a tryLock function
in tm, and use it in replManager.

Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
@sougou sougou requested review from deepthi, enisoc and rafael and removed request for deepthi August 26, 2020 03:17
go/sync2/semaphore.go Outdated Show resolved Hide resolved
Signed-off-by: Sugu Sougoumarane <ssougou@gmail.com>
Copy link
Member

@rafael rafael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@sougou sougou merged commit 6cb8496 into vitessio:master Aug 26, 2020
@sougou sougou deleted the ss-tm3-repl-deadlock branch August 26, 2020 22:36
@askdba askdba added this to the v8.0 milestone Oct 6, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants