-
Notifications
You must be signed in to change notification settings - Fork 122
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop can cause race detector errors in testing #59
Comments
Btw, when I say that |
I've also intermittently seen another race detector error when running the tests. This one doesn't seem to involve channels and I'm not quite sure what is going on with it— based on the stacktrace it seems like maybe it's trying to print an error message that references an But also, this might indicate that the test is actually unstable since I wouldn't expect it to be calling
|
I'm happy to try to fix these. I believe I've fixed the intermittent issue that you were seeing. One of the fundamental problems with ccache tests is that many ccache behaviors are asynchronous, so our test needs to sleep for a random period of time and hope the behavior under test finished executing. Go's select, unlike Erlang's message queue, provide no ordering (FIFO) guarantees - so (as far as I can tell), we have no options but to sleep. The LayeredCaches' ResizeOnTheFly test was having problems because of this. And when the following assertion failed:
We'd have a race condition because the test runner was reading values that it shouldn't (via reflection). Making the test more robust removes this issue. I've added defer c.Stop() to a few tests, but haven't been able to reproduce this yet. |
Yesterday I was able to reproduce the "Stop causes a race on the channel" behavior pretty consistently, both in my project's tests that use ccache and also in ccache's own tests. But today, it's happening only in my project's tests and the only one I can reproduce in the ccache tests is the ResizeOnTheFly problem. I expect this is very timing-dependent so maybe overall execution speed on my machine is different today due to the load from other processes. |
This is a tangent, but about the "test runner reading values that it shouldn't" thing, you might be able to address that by defining a |
About waiting for asynchronous things in the tests, like the changing value of GetDropped() - what I've found to be more reliable than adding arbitrary sleeps in cases like this is to poll the condition. It's not hard to write a helper method to generalize that kind of logic; you would just specify a boolean function, the maximum time to wait for the condition (this timeout can be pretty long, like multiple seconds, since it's just to ensure that the test won't hang forever if it fails and if it does fail it's OK for the test run to be slow), how often to poll the condition, and a function to generate a nice failure message. Our tests are currently using func waitForCondition(timeout time.Duration, pollInterval time.Duration,
condition func() bool, failMessage func() string) {
timeoutCh := time.After(timeout)
pollTicker := time.NewTicker(pollInterval)
defer pollTicker.Stop()
for {
select {
case <-timeoutCh:
Fail(failMessage())
case <-pollTicker.C:
if condition() {
return
}
}
}
} And then for the specific case of waiting for the dropped count to change (this version only works for func waitForDroppedCount(cache *Cache, expectedCount int) {
var lastCount int
waitForCondition(time.Second*2, time.Millisecond*5,
func() bool {
lastCount = cache.GetDropped()
return lastCount == expectedCount
},
func() string {
return fmt.Sprintf("timed out waiting for dropped count to be %d - last count was %d",
expectedCount, lastCount)
})
} I don't think that is quite right - I just tried it out, and got a timeout I didn't expect, which is hard to diagnose currently because |
I had a thought about reproducing the Stop problem, although I don't quite know how to apply the thought yet. The specific condition the race detector is complaining about seems to be the one that's documented here, where a
So I'm guessing that the failure depends on the following conditions:
If that's correct, then I would expect to be able to reproduce the issue better if If I'm correctly reading the race detector docs I linked to (though I'm far from certain about that), one solution might be to add non-blocking receives on both But of course it won't be possible to verify that that's a fix until it's possible to reproduce the failure. |
Sorry, looking at that trace again from my previous comment, I don't think that is relevant. That's a different problem, one that is specific to my project— it was trying to read from the cache on a different goroutine than the one that was closing the cache, which of course is no good since there's no guarantee that the cache wouldn't have already been closed. That part is easy to fix in my own code, but I don't think that could have been the thing that was failing yesterday in the ccache tests. I wish I had saved the stacktraces from those. |
And... I'm sorry to leave so many comments, but this has made me think more about the concurrency issues around Stop, and I'm realizing that ccache is not as convenient for my use cases as I'd hoped. That is: ccache operations can panic if you've already called Stop, and I know that you've documented that fact in the comments on the Stop method, but it's a non-trivial usability issue in any situation like library code where the number of active goroutines that might be calling into this code isn't under my control. That is, even though the cache is thread-safe in terms of concurrent gets and sets - its overall state of being safe to use is not thread-safe; any goroutine that calls Stop will blow up any others that are currently in a code path that touches the cache. I think this means I'll have to have a RWMutex and put a read lock around all of my calls to Get or Set, and a write lock around my call to Stop. That should prevent panics, but it's an extra layer of synchronization around calls that already contain synchronization themselves, which is kind of unfortunate (not just because of the slight overhead of an extra RLock(), but because it's nice to be able to stick to "don't put a lock around anything that might itself need to wait on a lock" whenever possible on general principle). I mean, the underlying issue is the possibility of sending on a closed channel, and that's something people already have to worry about if they're writing any Go code that maintains a channel— there are well-defined patterns for that. It's just that one doesn't normally think of a Get on a thread-safe cache as involving a channel send, and it's common to use caches in ways that don't go well with those patterns. So, I'm wondering if it might be possible to rearrange anything inside ccache to let it leverage its existing synchronization logic to make this more safe. If I can think of anything, I'll submit a PR, but this may just be a futile daydream. |
Regarding the test sync issues, I have a solution that might be better than the polling idea - I'll submit a PR. |
Stop no longer works by closing promotable. Instead, it uses a control message (control messages are relatively new, which I guess is why I didn't use it in the first place). The major downside to this is that instead of panic'ing, concurrent operations could block forever (because the worker is no longer receiving). To compensate: 1 - Stop is delayed (configurable time), to let concurrent work finish Still, I'm not sure this is acceptable. I rather have it panic than hang. Further, when As an aside, the tests no longer cheat. There's now a public GC and Size function, which use the control channel. This helps eliminate test-specific race issues. |
Took another stab at this. Clean up control some. Don't rely on closing channels to stop. |
I hadn't revisited this in a long time, so I tried updating my code to use the latest release of ccache (nice to have generics now) and I found that the problem is still there. I see that you've had a PR open since last year, which I hadn't seen at the time... unfortunately it's now out of date due to upstream changes. |
I re-created the Can give them a spin? |
The unit tests for our project are normally run with the
-race
option, because we do a lot of concurrency stuff and want to avoid subtle unsafe usages. When I integratedccache
into the project, I started getting race detector errors in a test scenario where the cache is shut down withStop()
at the end of the test.It seems that this is due to what the race detector considers to be unsafe usage of the
promotables
channel, where there is the potential for a race betweenclose
and a previous channel send, as documented here. The race detector isn't saying that a race really did happen during the test run, but it can tell, based on the pattern of accesses to the channel, that one could happen— so it considers that to be an automatic fail.I wondered why such an issue wouldn't have shown up in
ccache
's own unit tests, but that's because—Stop
at all. Like, there's nodefer cache.Stop()
after creating a store (so I imagine there are a lot of orphaned goroutines being created during test runs)— and also there doesn't seem be any test coverage ofStop
itself.When I added a single
defer cache.Stop()
to a test, and then rango test -race ./...
instead ofgo test ./...
, I immediately got the same kind of error. In any codebase where concurrency is very important, like this one, this is a bit concerning. Even if this particular kind of race condition might not have significant consequences in itself, the fact that it's not possible to run tests with race detection means we can't use that tool to detect other kinds of concurrency problems.The text was updated successfully, but these errors were encountered: