Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: ensure non-expired context before each liveness update attempt #25631

Merged
merged 1 commit into from
May 17, 2018

Conversation

nvanbenschoten
Copy link
Member

Fixes #25430.

Before this change, a liveness update could get stuck in an infinite
loop if its context expired. This is because it would continue to retry
and continue to get errRetryLiveness errors due to
AmbiguousResultErrors created by DistSender.

Release note: None

Fixes cockroachdb#25430.

Before this change, a liveness update could get stuck in an infinite
loop if its context expired. This is because it would continue to retry
and continue to get `errRetryLiveness` errors due to
`AmbiguousResultErrors` created by `DistSender`.

Release note: None
@nvanbenschoten nvanbenschoten requested review from tbg, windchan7 and a team May 17, 2018 20:03
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@windchan7
Copy link
Contributor

Do you think it makes more sense to move the check down to updateLivenessAttempt or not? Right now updateLiveness is the only caller of that function so it doesn't really matter much.


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@nvanbenschoten
Copy link
Member Author

That's where I had it before, but I moved it up because updateLiveness is the function that's "aware" of the retry loop, so I think it makes more sense there. My reasoning was that we don't add these context cancellation checks at the beginning of every function, only in cases like this that need it.

@windchan7
Copy link
Contributor

Great! Thanks for the explanation. LGTM!

@windchan7
Copy link
Contributor

windchan7 commented May 17, 2018

I'm also wondering why we are not replacing this endless for loop with a retry.XXX. That will save the hassle of checking if context has been cancelled or not at every place like this.

@nvanbenschoten
Copy link
Member Author

There's always a debate between whether bounded loops help avoid problems or just mask issues (like this one) and make them harder to find. I think for now we should keep this as-is. The context cancellation should at-least bound the loop to some maximum duration.

@bdarnell
Copy link
Contributor

:lgtm:


Review status: 0 of 1 files reviewed at latest revision, all discussions resolved.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Should we backport this to 2.0? Is this just a test flake or could it explain some user issues like the import failures?

@nvanbenschoten
Copy link
Member Author

This could be more serious than a test flake, so I think we should backport it. If the user issues you're referring to included failed node liveness heartbeat: context deadline exceeded loops then it could explain them.

@nvanbenschoten
Copy link
Member Author

bors r+

craig bot pushed a commit that referenced this pull request May 17, 2018
25542: changefeedccl: uri sinks, fill in record key, WITH envelope r=tschottdorf a=danhhz

See commit messages for details

25631: storage: ensure non-expired context before each liveness update attempt r=nvanbenschoten a=nvanbenschoten

Fixes #25430.

Before this change, a liveness update could get stuck in an infinite
loop if its context expired. This is because it would continue to retry
and continue to get `errRetryLiveness` errors due to
`AmbiguousResultErrors` created by `DistSender`.

Release note: None

Co-authored-by: Daniel Harrison <daniel.harrison@gmail.com>
Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
@bdarnell
Copy link
Contributor

If the user issues you're referring to included failed node liveness heartbeat: context deadline exceeded loops then it could explain them.

I don't have a specific issue in mind but I know we've seen some reports of liveness problems (especially in conjunction with IMPORT).

@craig
Copy link
Contributor

craig bot commented May 17, 2018

Build succeeded

@craig craig bot merged commit 7d7f31d into cockroachdb:master May 17, 2018
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/nlCtx branch May 17, 2018 20:56
@tbg
Copy link
Member

tbg commented May 18, 2018 via email

craig bot pushed a commit that referenced this pull request May 22, 2018
25634: backport-2.0: storage: ensure non-expired context before each liveness update attempt r=nvanbenschoten a=nvanbenschoten

Backport 1/1 commits from #25631.

/cc @cockroachdb/release

---

Fixes #25430.

Before this change, a liveness update could get stuck in an infinite
loop if its context expired. This is because it would continue to retry
and continue to get `errRetryLiveness` errors due to
`AmbiguousResultErrors` created by `DistSender`.

Release note: None


Co-authored-by: Nathan VanBenschoten <nvanbenschoten@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants