Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

canary: Adds locking to prevent multiple concurrent invocations of confirmMissing from clobbering each other #5568

Merged
merged 3 commits into from
Mar 10, 2022

Conversation

afayngelerindbx
Copy link
Contributor

@afayngelerindbx afayngelerindbx commented Mar 8, 2022

What this PR does / why we need it:
nil pointer dereferences in this code are causing panics in our deployment. It seems that confirmMissing wasn't intended to be run concurrently. The locking in this PR prevents accessing missingEntries while it is also being set to slice of nils(comparator.go:L482).

Which issue(s) this PR fixes:
Fixes #5128

@afayngelerindbx afayngelerindbx requested a review from a team as a code owner March 8, 2022 02:09
@CLAassistant
Copy link

CLAassistant commented Mar 8, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@chaudum chaudum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, lgtm. Could we get this covered with a test, e.g. calling confirmMissing concurrently when also calling pruneEntries?

Copy link
Contributor

@kavirajk kavirajk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 Thanks @afayngelerindbx.

Agree with @chaudum, can you please add a test? I will merge after that.

@afayngelerindbx
Copy link
Contributor Author

It's pretty tricky to actually ensure that these tests run with the right sequencing to cause all the panics. Let me know if the test needs to be something smarter. Also, do I need to add anything to the CHANGELOG? This seems like too small a change for that.

@kavirajk
Copy link
Contributor

kavirajk commented Mar 9, 2022

@afayngelerindbx. It can be simple test that locks this behavior :)

And for CHANGELOG. I don't think we need entry there as this is not user-visible change :)

@afayngelerindbx
Copy link
Contributor Author

I added a fairly simple test that accomplishes two things:

  1. With my comparator.go changes removed, it panics fairly consistently. Sometimes, it even causes the exact panic in the attached issue. With the locking added back in the test passes consistently.
  2. With locking removed, the race detector(go test -race), shows racy data access for the new test. Once locking is added, the races are resolved.

Please, let me know whether you think this is sufficient test coverage for this change.

}

for i := 0; i < 10; i++ {
go func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

better to add sync.WaitGroup to makesure all goroutines are completed by end of the test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good suggestion. adding

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for adding :)

  1. can we make wg.Done() as defer wg.Done() at the beginning before assert.NotPanic is getting called? Because that way we can make sure wg.Done() is run even if that code panics.
  2. Not familiar with assert.Eventually. But shouldn't just simple wg.Wait() after the for loop is sufficient here? Not sure if I miss anything there.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe assert.NotPanic recovers from the panic and returns false. Adding the defer to make the code easier to reason about.

For 2, I would recommend keeping Eventually there. The expectation is that this test should return fairly quickly(within 1s) keeping a wg.Done() at the end will make the test suite time out(default 10m) if one of those goroutines hits a deadlock.

@kavirajk
Copy link
Contributor

thanks @afayngelerindbx looks good. Just added one minor suggestion to add waitGroup to the test.

Also looks like license/cla pending. Can you sign it? Then I will merge :)

@afayngelerindbx
Copy link
Contributor Author

thanks @afayngelerindbx looks good. Just added one minor suggestion to add waitGroup to the test.

Also looks like license/cla pending. Can you sign it? Then I will merge :)

I think I signed. The CLA bot comment says so as well: #5568 (comment) I'm not sure why the check isn't passing. Any suggestions?

@afayngelerindbx afayngelerindbx force-pushed the nil-ptr-fix branch 2 times, most recently from 970a2fc to eae3ca4 Compare March 10, 2022 11:32
`confirmMissing` from clobbering each other
@chaudum
Copy link
Contributor

chaudum commented Mar 10, 2022

And for CHANGELOG. I don't think we need entry there as this is not user-visible change :)

I think it makes sense to add a changelog entry, because it fixes a race condition that produces a nil pointer reference. Other people may be effected as well.

Copy link
Contributor

@chaudum chaudum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor thing regarding the usage of the pointer inside a range loop.

pkg/canary/comparator/comparator_test.go Outdated Show resolved Hide resolved
Co-authored-by: Christian Haudum <christian.haudum@gmail.com>
@kavirajk kavirajk merged commit e49c360 into grafana:main Mar 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic in canaries.
4 participants