-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix race between OnDemandHousekeeping
and housekeepingTick
#2755
Conversation
This fixes the data race between the functions which could be running in different goroutines. Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
Hi @pkagrawal. Thanks for your PR. I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Fixes #2754 |
Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
…datedTime) Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
/ok-to-test |
@@ -555,6 +558,8 @@ func (cd *containerData) housekeepingTick(timer <-chan time.Time, longHousekeepi | |||
klog.V(3).Infof("[%s] Housekeeping took %s", cd.info.Name, duration) | |||
} | |||
cd.notifyOnDemand() | |||
cd.lock.Lock() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that call above this line is responsible for synchronization with containerData.OnDemandHousekeeping()
. Lock that you are trying to introduce looks redundant to me. Are you able to provide a test case that help race to manifest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a test case, but we were using cadvisor and our race detector tripped up:
WARNING: DATA RACE
Read at 0x00c0001e7a50 by goroutine 30:
github.com/google/cadvisor/manager.(*containerData).OnDemandHousekeeping()
vendor/github.com/google/cadvisor/manager/container.go:141 +0x9b
github.com/google/cadvisor/manager.(*manager).getRequestedContainers.func1()
vendor/github.com/google/cadvisor/manager/manager.go:741 +0x64
Previous write at 0x00c0001e7a50 by goroutine 130:
github.com/google/cadvisor/manager.(*containerData).housekeepingTick()
vendor/github.com/google/cadvisor/manager/container.go:558 +0x34c
github.com/google/cadvisor/manager.(*containerData).housekeeping()
vendor/github.com/google/cadvisor/manager/container.go:494 +0x304
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you 🙇. I think we should use single synchronisation mechanism in this case. I wonder if moving notifyOnDemand() call a bit lower and applying similar change to OnDemandHousekeeping() would not solve the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking. I'm not quite sure that I follow what you mean by single synchronization. OnDemandHousekeeping
is an external API so anyone can call it, and it can potentially race with housekeepingTick
when both try and access statsLastUpdatedTime
. Access to this variable should be guarded by some mutual exclusion - right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@iwankgb Can you take a look at this again? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pkagrawal I don't think I will be able to look at this issue again before coming weekend.
@pkagrawal I was trying to replicate your problem and I simply can do this. I wrote following simple test: func TestOnDemandHousekeepingRace(t *testing.T) {
statsList := itest.GenerateRandomStats(1, 4, 1*time.Second)
stats := statsList[0]
//cd, mockHandler, _, fakeClock := newTestContainerData(t)
cd, mockHandler, _, _ := newTestContainerData(t)
mockHandler.On("GetStats").Return(stats, nil)
wg := sync.WaitGroup{}
wg.Add(1)
err := cd.Start()
assert.NoError(t, err)
go func() {
t.Log("starting on demand goroutine")
for i := 0; i < 100000; i++ {
t.Logf("on demand iteration %d", i)
cd.OnDemandHousekeeping(0 * time.Second)
}
wg.Done()
}()
wg.Wait()
} and I can't get any race:
|
@pkagrawal I have just realized that I was trying to use race detector on
Let me take one more look at this problem. |
I do confirm that this is a real issue:
using following test case: func TestOnDemandHousekeepingRace(t *testing.T) {
statsList := itest.GenerateRandomStats(1, 4, 1*time.Second)
stats := statsList[0]
//cd, mockHandler, _, fakeClock := newTestContainerData(t)
cd, mockHandler, _, _ := newTestContainerData(t)
mockHandler.On("GetStats").Return(stats, nil)
wg := sync.WaitGroup{}
wg.Add(1002)
go func() {
time.Sleep(10 * time.Millisecond)
err := cd.Start()
assert.NoError(t, err)
wg.Done()
}()
go func() {
t.Log("starting on demand goroutine")
for i := 0; i < 1000; i++ {
go func() {
time.Sleep(1 * time.Microsecond)
cd.OnDemandHousekeeping(0 * time.Millisecond)
wg.Done()
}()
}
wg.Done()
}()
wg.Wait()
} |
@pkagrawal, could you do the following:
I'm sorry for delaying you work but I had no idea that race detector works on amd64 only. |
@bobbypage, I summon you as this is a bug indeed. |
Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
@iwankgb Thanks for taking a look again and pointing me to a test case. I've now added your test case as well. |
LGTM |
LGTM, thanks. |
This fixes the data race between the functions which could be running
in different goroutines.
Signed-off-by: Pradyumna Agrawal pradyumnaa@vmware.com
Fixes #2754