Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix race between OnDemandHousekeeping and housekeepingTick #2755

Merged
merged 4 commits into from
Dec 30, 2020

Conversation

pkagrawal
Copy link
Contributor

@pkagrawal pkagrawal commented Dec 7, 2020

This fixes the data race between the functions which could be running
in different goroutines.

Signed-off-by: Pradyumna Agrawal pradyumnaa@vmware.com

Fixes #2754

This fixes the data race between the functions which could be running
in different goroutines.

Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
@k8s-ci-robot
Copy link
Collaborator

Hi @pkagrawal. Thanks for your PR.

I'm waiting for a google member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@pkagrawal
Copy link
Contributor Author

Fixes #2754

Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
…datedTime)

Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
@iwankgb
Copy link
Collaborator

iwankgb commented Dec 7, 2020

/ok-to-test

@@ -555,6 +558,8 @@ func (cd *containerData) housekeepingTick(timer <-chan time.Time, longHousekeepi
klog.V(3).Infof("[%s] Housekeeping took %s", cd.info.Name, duration)
}
cd.notifyOnDemand()
cd.lock.Lock()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that call above this line is responsible for synchronization with containerData.OnDemandHousekeeping(). Lock that you are trying to introduce looks redundant to me. Are you able to provide a test case that help race to manifest?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a test case, but we were using cadvisor and our race detector tripped up:

WARNING: DATA RACE
Read at 0x00c0001e7a50 by goroutine 30:
  github.com/google/cadvisor/manager.(*containerData).OnDemandHousekeeping()
      vendor/github.com/google/cadvisor/manager/container.go:141 +0x9b
  github.com/google/cadvisor/manager.(*manager).getRequestedContainers.func1()
      vendor/github.com/google/cadvisor/manager/manager.go:741 +0x64

Previous write at 0x00c0001e7a50 by goroutine 130:
  github.com/google/cadvisor/manager.(*containerData).housekeepingTick()
      vendor/github.com/google/cadvisor/manager/container.go:558 +0x34c
  github.com/google/cadvisor/manager.(*containerData).housekeeping()
      vendor/github.com/google/cadvisor/manager/container.go:494 +0x304

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you 🙇. I think we should use single synchronisation mechanism in this case. I wonder if moving notifyOnDemand() call a bit lower and applying similar change to OnDemandHousekeeping() would not solve the issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking. I'm not quite sure that I follow what you mean by single synchronization. OnDemandHousekeeping is an external API so anyone can call it, and it can potentially race with housekeepingTick when both try and access statsLastUpdatedTime. Access to this variable should be guarded by some mutual exclusion - right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@iwankgb Can you take a look at this again? Thanks!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pkagrawal I don't think I will be able to look at this issue again before coming weekend.

@iwankgb
Copy link
Collaborator

iwankgb commented Dec 12, 2020

@pkagrawal I was trying to replicate your problem and I simply can do this. I wrote following simple test:

func TestOnDemandHousekeepingRace(t *testing.T) {
	statsList := itest.GenerateRandomStats(1, 4, 1*time.Second)
	stats := statsList[0]

	//cd, mockHandler, _, fakeClock := newTestContainerData(t)
	cd, mockHandler, _, _ := newTestContainerData(t)
	mockHandler.On("GetStats").Return(stats, nil)

	wg := sync.WaitGroup{}
	wg.Add(1)

	err := cd.Start()
	assert.NoError(t, err)

	go func() {
		t.Log("starting on demand goroutine")
		for i := 0; i < 100000; i++  {
			t.Logf("on demand iteration %d", i)
			cd.OnDemandHousekeeping(0 * time.Second)
		}
		wg.Done()
	}()
	wg.Wait()
}

and I can't get any race:

=== RUN   TestOnDemandHousekeepingRace
--- PASS: TestOnDemandHousekeepingRace (0.14s)
PASS
ok  	github.com/google/cadvisor/manager	1.260s

@iwankgb
Copy link
Collaborator

iwankgb commented Dec 29, 2020

@pkagrawal I have just realized that I was trying to use race detector on linux/arm64 and:

     -race
           enable data race detection.
           Supported only on linux/amd64, freebsd/amd64, darwin/amd64 and windows/amd64.

Let me take one more look at this problem.

@iwankgb
Copy link
Collaborator

iwankgb commented Dec 29, 2020

I do confirm that this is a real issue:

[root@fedora-s-4vcpu-8gb-fra1-01 cadvisor]# uname -m
x86_64
[root@fedora-s-4vcpu-8gb-fra1-01 cadvisor]# GOMAXPROCS=4 go test -v -race -run TestOnDemandHousekeepingRace ./manager/
=== RUN   TestOnDemandHousekeepingRace
    container_test.go:342: starting on demand goroutine
==================
WARNING: DATA RACE
Read at 0x00c0004c6610 by goroutine 108:
  github.com/google/cadvisor/manager.(*containerData).OnDemandHousekeeping()
      /root/go/src/github.com/google/cadvisor/manager/container.go:141 +0x9b
  github.com/google/cadvisor/manager.TestOnDemandHousekeepingRace.func2.1()
      /root/go/src/github.com/google/cadvisor/manager/container_test.go:346 +0x4e

Previous write at 0x00c0004c6610 by goroutine 100:
  github.com/google/cadvisor/manager.(*containerData).housekeepingTick()
      /root/go/src/github.com/google/cadvisor/manager/container.go:558 +0x34c
  github.com/google/cadvisor/manager.(*containerData).housekeeping()
      /root/go/src/github.com/google/cadvisor/manager/container.go:494 +0x304

Goroutine 108 (running) created at:
  github.com/google/cadvisor/manager.TestOnDemandHousekeepingRace.func2()
      /root/go/src/github.com/google/cadvisor/manager/container_test.go:344 +0xaf

Goroutine 100 (running) created at:
  github.com/google/cadvisor/manager.(*containerData).Start()
      /root/go/src/github.com/google/cadvisor/manager/container.go:114 +0x4c
  github.com/google/cadvisor/manager.TestOnDemandHousekeepingRace.func1()
      /root/go/src/github.com/google/cadvisor/manager/container_test.go:336 +0x49
==================
    testing.go:1038: race detected during execution of test
--- FAIL: TestOnDemandHousekeepingRace (0.06s)
=== CONT
    testing.go:1038: race detected during execution of test
FAIL
FAIL	github.com/google/cadvisor/manager	0.119s
FAIL

using following test case:

func TestOnDemandHousekeepingRace(t *testing.T) {
	statsList := itest.GenerateRandomStats(1, 4, 1*time.Second)
	stats := statsList[0]

	//cd, mockHandler, _, fakeClock := newTestContainerData(t)
	cd, mockHandler, _, _ := newTestContainerData(t)
	mockHandler.On("GetStats").Return(stats, nil)

	wg := sync.WaitGroup{}
	wg.Add(1002)

	go func() {
		time.Sleep(10 * time.Millisecond)
		err := cd.Start()
		assert.NoError(t, err)
		wg.Done()
	}()

	go func() {
		t.Log("starting on demand goroutine")
		for i := 0; i < 1000; i++  {
			go func() {
				time.Sleep(1 * time.Microsecond)
				cd.OnDemandHousekeeping(0 * time.Millisecond)
				wg.Done()
			}()
		}
		wg.Done()
	}()
	wg.Wait()
}

@iwankgb
Copy link
Collaborator

iwankgb commented Dec 29, 2020

@pkagrawal, could you do the following:

I'm sorry for delaying you work but I had no idea that race detector works on amd64 only.

@iwankgb
Copy link
Collaborator

iwankgb commented Dec 29, 2020

@bobbypage, I summon you as this is a bug indeed.

Signed-off-by: Pradyumna Agrawal <pradyumnaa@vmware.com>
@pkagrawal
Copy link
Contributor Author

pkagrawal commented Dec 29, 2020

@iwankgb Thanks for taking a look again and pointing me to a test case. I've now added your test case as well.

@iwankgb
Copy link
Collaborator

iwankgb commented Dec 30, 2020

LGTM

@bobbypage
Copy link
Collaborator

LGTM, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Data race between OnDemandHousekeeping and housekeepingTick
4 participants