-
Notifications
You must be signed in to change notification settings - Fork 18k
proposal: testing/synctest: new package for testing concurrent code #67434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I really like how simple this API is.
How does time work when goroutines aren't idle? Does it stand still, or does it advance at the usual rate? If it stands still, it seems like that could break software that assumes time will advance during computation (that maybe that's rare in practice). If it advances at the usual rate, it seems like that reintroduces a source of flakiness. E.g., in your example, the 1 second sleep will advance time by 1 second, but then on a slow system the checking thread may still not execute for a long time. What are the bounds of the fake time implementation? Presumably if you're making direct system calls that interact with times or durations, we're not going to do anything about that. Are we going to make any attempt at faking time in the file system?
What if a goroutine is blocked on a channel that goes outside the group? This came to mind in the context of whether this could be used to coordinate a multi-process client/server test, though I think it would also come up if there's any sort of interaction with a background worker goroutine or pool.
What happens if multiple goroutines in a group call Wait? I think the options are to panic or to consider all of them idle, in which case they would all wake up when every other goroutine in the group is idle. What happens if you have nested groups, say group A contains group B, and a goroutine in B is blocked in Wait, and then a goroutine in A calls Wait? I think your options are to panic (though that feels wrong), wake up both if all of the goroutines in group A are idle, or wake up just B if all of the goroutines in B are idle (but this block waking up A until nothing is calling Wait in group B). |
Time stands still, except when all goroutines in a group are idle. (Same as the playground behaves, I believe.) This would break software that assumes time will advance. You'd need to use something else to test that case.
The time package: Faking time in the filesystem seems complicated and highly specialized, so I don't think we should try. Code which cares about file timestamps will need to use a test
As proposed, this would count as an idle goroutine. If you fail to isolate the system under test this will probably cause problems, so don't do that.
As proposed, none of them ever wake up and your test times out, or possibly panics if we can detect that all goroutines are blocked in that case. Having them all wake at the same time would also be reasonable.
Oh, I didn't think of that. Nested groups are too complicated, |
This is a very interesting proposal! I feel worried that the Assuming that's a valid concern (if it isn't then I'll retract this entire comment!), I could imagine mitigating it in two different ways:
(I apologize in advance if I misunderstood any part of the proposal or if I am missing something existing that's already similarly convenient to |
The fact that I think using idle-wait synchronization outside of tests is always going to be a mistake. It's fragile and fiddly, and you're better served by explicit synchronization. (This prompts the question: Isn't this fragile and fiddly inside tests as well? It is, but using a fake clock removes much of the sources of fragility, and tests often have requirements that make the fiddliness a more worthwhile tradeoff. In the expiring cache example, for example, non-test code will never need to guarantee that a cache entry expires precisely at the nanosecond defined.) So while perhaps we could offer a standalone synchroniziation primitive outside of As for passing a |
Interesting proposal. I like that it allows for waiting for a group of goroutines, as opposed to all goroutines in my proposal (#65336), though I do have some concerns:
|
One of the goals of this proposal is to minimize the amount of unnatural code required to make a system testable. Mock time implementations require replacing calls to idiomatic time package functions with a testable interface. Putting fake time in the standard library would let us just write the idiomatic code without compromising testability. For timeouts, the Also, it would be pointless for |
I wanted to evaluate practical usage of the proposed API. I wrote a version of Run and Wait based on parsing the output of runtime.Stack. Wait calls runtime.Gosched in a loop until all goroutines in the current group are idle. I also wrote a fake time implementation. Combined, these form a reasonable facsimile of the proposed synctest package, with some limitations: The code under test needs to be instrumented to call the fake time functions, and to call a marking function after creating new goroutines. Also, you need to call a synctest.Sleep function in tests to advance the fake clock. I then added this instrumentation to net/http. The synctest package does not work with real network connections, so I added an in-memory net.Conn implementation to the net/http tests. I also added an additional helper to net/http's tests, which simplifies some of the experimentation below: var errStillRunning = errors.New("async op still running")
// asyncResult is the result of an asynchronous operation.
type asyncResult[T any] struct {}
// runAsync runs f in a new goroutine,
// and returns an asyncResult which is populated with the result of f when it finishes.
// runAsync calls synctest.Wait after running f.
func runAsync[T any](f func() (T, error)) *asyncResult[T]
// done reports whether the asynchronous operation has finished.
func (r *asyncResult[T]) done() bool
// result returns the result of the asynchronous operation.
// It returns errStillRunning if the operation is still running.
func (r *asyncResult[T]) result() (T, error) One of the longest-running tests in the net/http package is TestServerShutdownStateNew (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/serve_test.go#5611). This test creates a server, opens a connection to it, and calls Server.Shutdown. It asserts that the server, which is expected to wait 5 seconds for the idle connection to close, shuts down in no less than 2.5 seconds and no more than 7.5 seconds. This test generally takes about 5-6 seconds to run in both HTTP/1 and HTTP/2 modes. The portion of this test which performs the shutdown is: shutdownRes := make(chan error, 1)
go func() {
shutdownRes <- ts.Config.Shutdown(context.Background())
}()
readRes := make(chan error, 1)
go func() {
_, err := c.Read([]byte{0})
readRes <- err
}()
// TODO(#59037): This timeout is hard-coded in closeIdleConnections.
// It is undocumented, and some users may find it surprising.
// Either document it, or switch to a less surprising behavior.
const expectTimeout = 5 * time.Second
t0 := time.Now()
select {
case got := <-shutdownRes:
d := time.Since(t0)
if got != nil {
t.Fatalf("shutdown error after %v: %v", d, err)
}
if d < expectTimeout/2 {
t.Errorf("shutdown too soon after %v", d)
}
case <-time.After(expectTimeout * 3 / 2):
t.Fatalf("timeout waiting for shutdown")
}
// Wait for c.Read to unblock; should be already done at this point,
// or within a few milliseconds.
if err := <-readRes; err == nil {
t.Error("expected error from Read")
} I wrapped the test in a synctest.Run call and changed it to use the in-memory connection. I then rewrote this section of the test: shutdownRes := runAsync(func() (struct{}, error) {
return struct{}{}, ts.Config.Shutdown(context.Background())
})
readRes := runAsync(func() (int, error) {
return c.Read([]byte{0})
})
// TODO(#59037): This timeout is hard-coded in closeIdleConnections.
// It is undocumented, and some users may find it surprising.
// Either document it, or switch to a less surprising behavior.
const expectTimeout = 5 * time.Second
synctest.Sleep(expectTimeout - 1)
if shutdownRes.done() {
t.Fatal("shutdown too soon")
}
synctest.Sleep(2 * time.Second)
if _, err := shutdownRes.result(); err != nil {
t.Fatalf("Shutdown() = %v, want complete", err)
}
if n, err := readRes.result(); err == nil || err == errStillRunning {
t.Fatalf("Read() = %v, %v; want error", n, err)
} The test exercises the same behavior it did before, but it now runs instantaneously. (0.01 seconds on my laptop.) I made an interesting discovery after converting the test: The server does not actually shut down in 5 seconds. In the initial version of this test, I checked for shutdown exactly 5 seconds after calling Shutdown. The test failed, reporting that the Shutdown call had not completed. Examining the Shutdown function revealed that the server polls for closed connections during shutdown, with a maximum poll interval of 500ms, and therefore shutdown can be delayed slightly past the point where connections have shut down. I changed the test to check for shutdown after 6 seconds. But once again, the test failed. Further investigation revealed this code (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/server.go#3041): st, unixSec := c.getState()
// Issue 22682: treat StateNew connections as if
// they're idle if we haven't read the first request's
// header in over 5 seconds.
if st == StateNew && unixSec < time.Now().Unix()-5 {
st = StateIdle
} The comment states that new connections are considered idle for 5 seconds, but thanks to the low granularity of Unix timestamps the test can consider one idle for as little as 4 or as much as 6 seconds. Combined with the 500ms poll interval (and ignoring any added scheduler delay), Shutdown may take up to 6.5 seconds to complete, not 5. Using a fake clock rather than a real one not only speeds up this test dramatically, but it also allows us to more precisely test the behavior of the system under test. Another slow test is TestTransportExpect100Continue (https://go.googlesource.com/go/+/refs/tags/go1.22.3/src/net/http/transport_test.go#1188). This test sends an HTTP request containing an "Expect: 100-continue" header, which indicates that the client is waiting for the server to indicate that it wants the request body before it sends it. In one variation, the server does not send a response; after a 2 second timeout, the client gives up waiting and sends the request. This test takes 2 seconds to execute, thanks to this timeout. In addition, the test does not validate the timing of the client sending the request body; in particular, tests pass even if the client waits The portion of the test which sends the request is: resp, err := c.Do(req) I changed this to: rt := runAsync(func() (*Response, error) {
return c.Do(req)
})
if v.timeout {
synctest.Sleep(expectContinueTimeout-1)
if rt.done() {
t.Fatalf("RoundTrip finished too soon")
}
synctest.Sleep(1)
}
resp, err := rt.result()
if err != nil {
t.Fatal(err)
} This test now executes instantaneously. It also verifies that the client does or does not wait for the ExpectContinueTimeout as expected. I made one discovery while converting this test. The synctest.Run function blocks until all goroutines in the group have exited. (In the proposed synctest package, Run will panic if all goroutines become blocked (deadlock), but I have not implemented that feature in the test version of the package.) The test was hanging in Run, due to leaking a goroutine. I tracked this down to a missing net.Conn.Close call, which was leaving an HTTP client reading indefinitely from an idle and abandoned server connection. In this case, Run's behavior caused me some confusion, but ultimately led to the discovery of a real (if fairly minor) bug in the test. (I'd probably have experienced less confusion, but I initially assumed this was a bug in the implementation of Run.) At one point during this exercise, I accidentally called testing.T.Run from within a synctest.Run group. This results in, at the very best, quite confusing behavior. I think we would want to make it possible to detect when running within a group, and have testing.T.Run panic in this case. My experimental implementation of the synctest package includes a synctest.Sleep function by necessity: It was much easier to implement with an explicit call to advance the fake clock. However, I found in writing these tests that I often want to sleep and then wait for any timers to finish executing before continuing. I think, therefore, that we should have one additional convenience function: package synctest
// Sleep pauses the current goroutine for the duration d,
// and then blocks until every goroutine in the current group is idle.
// It is identical to calling time.Sleep(d) followed by Wait.
//
// The caller of Sleep must be in a goroutine created by Run,
// or a goroutine transitively started by Run.
// If it is not, Sleep panics.
func Sleep(d time.Duration) {
time.Sleep(d)
Wait()
} The net/http package was not designed to support testing with a fake clock. This has served as an obstacle to improving the state of the package's tests, many of which are slow, flaky, or both. Converting net/http to be testable with my experimental version of synctest required a small number of minor changes. A runtime-supported synctest would have required no changes at all to net/http itself. Converting net/http tests to use synctest required adding an in-memory net.Conn. (I didn't attempt to use net.Pipe, because its fully-synchronous behavior tends to cause problems in tests.) Aside from this, the changes required were very small. My experiment is in https://go.dev/cl/587657. |
This proposal has been added to the active column of the proposals project |
Commenting here due to @rsc's request: Relative to my proposal #65336, I have the following concerns:
|
Regarding overriding the The In contrast, we can test code which calls Time is fundamentally different in that there is no way to use real time in a test without making the test flaky and slow. Time is also different from an Since we can't use real time in tests, we can insert a testable wrapper around the In addition, if we define a standard testable wrapper around the clock, we are essentially declaring that all public packages which deal with time should provide a way to plumb in a clock. (Some packages do this already, of course; crypto/tls.Config.Time is an example in That's an option, of course. But it would be a very large change to the Go ecosystem as a whole. |
The pprof.SetGoroutineLabels disagrees.
It doesn't try to hide it, more like tries to restrict people from relying on numbers.
If I understood proposal correctly, it will wait for any goroutine (and recursively) that was started using |
Yes, if you call |
Given that there's more precedent for goroutine identity than I had previously thought, and seeing how However, I'm still a little ambivalent about goroutine groups affecting That being said, I agree that plumbing a time/clock interface through existing code is indeed tedious, and having |
Thanks for doing the experiment. I find the results pretty compelling.
I don't quite understand this function. Given the fake time implementation, if you sleep even a nanosecond past timer expiry, aren't you already guaranteed that those timers will have run because the fake time won't advance to your sleep deadline until everything is blocked again?
Partly I was wondering about nested groups because I've been scheming other things that the concept of a goroutine group could be used for. Though it's true that, even if we have groups for other purposes, it may make sense to say that synctest groups cannot be nested, even if in general groups can be nested. |
You're right that sleeping past the deadline of a timer is sufficient. The It's fairly natural to sleep to the exact instant of a timer, however. If a cache entry expires in some amount of time, it's easy to sleep for that exact amount of time, possibly using the same constant that the cache timeout was initialized with, rather than adding a nanosecond. Adding nanoseconds also adds a small but real amount of confusion to a test in various small ways: The time of logged events drifts off the integer second, rate calculations don't come out as cleanly, and so on. Plus, if you forget to add the necessary adjustment or otherwise accidentally sleep directly onto the instant of a timer's expiry, you get a race condition. Cleaner, I think, for the test code to always resynchronize after poking the system under test. This doesn't have to be a function in the synctest package, of course;
I'm very intrigued! I've just about convinced myself that there's a useful general purpose synchronization API hiding in here, but I'm not sure what it is or what it's useful for. |
For what it's worth, I think it's a good thing that virtual time is included in this, because it makes sure that this package isn't used in production settings. It makes it only suitable for tests (and very suitable). |
It sounds like the API is still:
Damien suggested adding also:
The difference between time.Sleep and synctest.Sleep seems subtle enough that it seems like you should have to spell out the Wait at the call sites where you need it. The only time you really need Wait is if you know someone else is waking up at that very moment. But then if they've both done the Sleep+Wait form then you still have a problem. You really only want some of the call sites (maybe just one) to use the Sleep+Wait form. I suppose that the production code will use time.Sleep since it's not importing synctest, so maybe it's clear that the test harness is the only one that will call Sleep+Wait. On the other hand, fixing a test failure by changing s/time.Sleep/synctest.Sleep/ will be a strange-looking bug fix. Better to have to add synctest.Wait instead. If we really need this, it could be synctest.SleepAndWait but that's what statements are for. Probably too subtle and should just limit the proposal to Run and Wait. |
Some additional suggestions for the description of the
Additionally, for "mutex operation", let's list out the the exact operations considered for implementation/testing completeness:
|
The API looks simple and that is excellent. What I am worried about is the unexpected failure modes, leading to undetected regressions, which might need tight support in the testing package to detect. Imagine you unit test your code but are unable to mock out a dependency. Maybe due to lack of experience or bad design of existing code I have to work with. That dependency that suddenly starts calling a syscall (e.g. to lazily try to tune the library using a sync.Once instead of on init time and having a timeout). Without support in testing you will never detect that now and only your tests will suddenly time out after an innocent minor dependency update. |
May I ortgogonally to the previous comment suggest to limit this package to standard library only to gather more experience with that approach before ? That would also allow to sketch out integration with the testing package in addition to finding more pitfalls. |
Can you expand more on what you mean by undetected regressions? If the code under test (either directly, or through a dependency) unexpectedly calls a blocking syscall,
What kind of support are you thinking of? |
What does this do?
Does it succeed or panic? It's not clear to me from the API docs because:
This is obviously a degenerate case, but I think it also applies if a test wanted to get the fake time features when testing otherwise non-concurrent code. |
In this case, the goroutine calling |
The hazards of global variables strike again! Thanks Neil. |
As of https://go.dev/cl/645719, the error reporting for bubble deadlocks should be better. When a "deadlock: all goroutines in bubble are blocked" panic occurs, the runtime will now by default print the stack for the panicking goroutine (the one which called |
Once the goroutine started by synctest.Run exits, stop advancing the fake clock in its bubble. This avoids confusing situations where a bubble remains alive indefinitely while a background goroutine reads from a time.Ticker or otherwise advances the clock. For #67434 Change-Id: Id608ffe3c7d7b07747b56a21f365787fb9a057d7 Reviewed-on: https://go-review.googlesource.com/c/go/+/662155 Reviewed-by: Michael Pratt <mpratt@google.com> LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com> Auto-Submit: Damien Neil <dneil@google.com>
Suggestion: It could be useful during development to have an env variable that would cause synctest to immediately abort and give the developer an error at test runtime if: a) there is a sync.Wait() call in progress; and b) a select within a bubble is not durably blocking because it is receiving on a channel from outside the bubble. (I could have used such an env variable to have the synctest help me catch my mistake, without needing expert advice). Obviously some "outside to inside the bubble" channel operations might be needed in general, and so should not be verboten, but I think this could be a big help during development. The system then teaches beginners to synctest where they are going wrong. It would allow the dev to convey intent to the synctest runtime, and would allow the synctest runtime in turn to provide helpful assistance in writing the intended test logic. (edit: Or a new test flag that did the same, equivalently. Just like we run with go test -race during testing, we could run with go test -durcheck ). |
Perhaps the stack traces for bubbled goroutines should include an explicit mention of whether the goroutine is durably blocked or not. The goroutine status actually already tells you this, but you need to know an unreasonable amount about synctest internals to know what statuses are durably blocking and which are not. So something like this, where the first goroutine is reading from a non-bubbled channel and the second is reading from a bubbled one:
Then if you have a test that's hanging unexpectely, you can run it with -timeout=1s (or some other appropriately short interval) and observe which goroutines are keeping Wait from returning. |
Oooh. That would be awesome. I like it. I still think the -durcheck flag is also worth it, hopefully saving alot of support tickets and confused newcomers. I mean, if you like hand holding dozens of beginners, it's your time... or just say "run with go test -durcheck first". :) edit: Also the problem might occur after running for several minutes; just like a data race condition caught by an extended run on go test -race. Since my simulations are "model checking light", I do expect to be running them overnight (16 hours+). I used my newfound intuition about outside channels preventing durable blocking to construct a "pausable simulation" proof-of-principle: |
I especially like the labeling of the durabilty in a stack trace because I'm still wondering if my conjecture above is correct, and at least I could spot check it. To extract the essential question from my previous long query: Does the goroutine that does time.Sleep() and then finishes a synctest.Wait() now know that they are the only runnable goroutine, that all others are durably blocked? I think so, but I've assumed there is only one synctest.Wait() possible at a time. If I assume that, then the Sleep wakes only those goro who should be awake at this point in faketime (not blocked on a timer), and the Wait lets every other goroutine durably block before returning to its caller. Yep. There it is, in the docs. That must be why it is guaranteed:
Nevermind(!) I've answered my own question. :) |
Minor feedback: using It's easily worked around, so not sure it's worth polluting the API. Basic example: package main
import (
"testing"
"testing/synctest"
"time"
)
func TestSystemServer(t *testing.T) {
synctest.Run(func() {
ctx := t.Context()
// This is the "fix"
// ctx, cancel := context.WithCancel(t.Context())
// defer cancel()
ch := make(chan struct{})
go func() {
timer := time.NewTimer(time.Second)
for {
select {
case <-ctx.Done():
return
case <-timer.C:
ch <- struct{}{}
}
}
}()
synctest.Wait()
<-ch
})
} Wrapping in a sub-test is one work-around: package main
import (
"testing"
"testing/synctest"
"time"
)
func TestSystemServer(t *testing.T) {
synctest.Run(func() {
t.Run("child", func(t *testing.T) {
ctx := t.Context()
ch := make(chan struct{})
go func() {
timer := time.NewTimer(time.Second)
for {
select {
case <-ctx.Done():
return
case <-timer.C:
ch <- struct{}{}
}
}
}()
synctest.Wait()
<-ch
})
})
} |
Previous discussion on the subject: #67434 (comment) I find the It would be really nice if it were If you had a handle to Also, you could use |
Change https://go.dev/cl/670976 mentions this issue: |
Neil wrote:
Sweet. Thanks for posting the CL, Neil. I was stuck in some slow scenarios, and wondering if it was because of waiting for durable blocking. So I tried to decode the CL and put together this cheat sheet. If anyone else is blocked on decoding the durability hints in the goroutine status... I think it is right, but please double check it, as I could have misunderstood the CL...
Here's a mnemonic for remembering them: "SCONES are durable":
|
Change https://go.dev/cl/671961 mentions this issue: |
Change https://go.dev/cl/671960 mentions this issue: |
Overall, it seems like people are very happy with synctest and it solves a lot of problems. The current API has some rough interactions with the testing package, which we think will be resolved by the fairly small API change in #73567 (comment). We'd put this on hold to gain experience with it and I think that's been very successful. We're ready to take it off hold and consider this along with #73567. |
Based on the discussion above, this proposal seems like a likely accept. The proposal is to add a new package, // Package synctest provides support for testing concurrent code.
package synctest
// Test executes f in a new goroutine.
//
// The new goroutine and any goroutines transitively started by it form
// an isolated "bubble".
// Test waits for all goroutines in the bubble to exit before returning.
//
// Goroutines in the bubble use a synthetic time implementation.
// The initial time is midnight UTC 2000-01-01.
//
// Time advances when every goroutine in the bubble is blocked.
// For example, a call to time.Sleep will block until all other
// goroutines are blocked and return after the bubble's clock has
// advanced. See [Wait] for the specific definition of blocked.
//
// If every goroutine is blocked and there are no timers scheduled,
// Test panics.
//
// Channels, time.Timers, and time.Tickers created within the bubble
// are associated with it. Operating on a bubbled channel, timer, or ticker
// from outside the bubble panics.
//
// If Test is called from within an existing bubble, it panics.
//
// The [*testing.T] provided to f has the following properties:
//
// - Functions registered with T.Cleanup run inside the bubble,
// immediately before Test returns.
// - The [context.Context] returned by T.Context has a Done channel
// associated with the bubble.
// - T.Run may not be called within the bubble.
func Test(t *testing.T, f func(*testing.T))
// Wait blocks until every goroutine within the current bubble,
// other than the current goroutine, is durably blocked.
// It panics if called from a non-bubbled goroutine,
// or if two goroutines in the same bubble call Wait at the same time.
//
// A goroutine is durably blocked if can only be unblocked by another
// goroutine in its bubble. The following operations durably block
// a goroutine:
// - a send or receive on a channel from within the bubble
// - a select statement where every case is a channel within the bubble
// - sync.Cond.Wait
// - time.Sleep
//
// A goroutine executing a system call or waiting for an external event
// such as a network operation is not durably blocked.
// For example, a goroutine blocked reading from an network connection
// is not durably blocked even if no data is currently available on the
// connection, because it may be unblocked by data written from outside
// the bubble or may be in the process of receiving data from a kernel
// network buffer.
//
// A goroutine is not durably blocked when blocked on a send or receive
// on a channel that was not created within its bubble, because it may
// be unblocked by a channel receive or send from outside its bubble.
func Wait() This API comes from #73567 (comment), except that I changed The existing |
Comment: The proposal text above mixes "blocked" (in the Test doc) and "durably blocked" (in the Wait doc) which is slightly confusing. Is this distinction intended? If so, it is worth explaining why blocked but not durably blocked suffices in the Test case, as such care went into explaining it for the Wait case. Otherwise it would be clearer to have the adjective used consistently, either just "blocked", or just "durably blocked", in both cases. I like it with durable in both, but only if that is accurate, of course. |
I encountered unexpected and worrying flakiness in the stdlib when run under synctest. Here's a simple test that randomly panics: package flaky_test
import (
"encoding/json"
"testing"
"testing/synctest"
)
func TestFlaky(t *testing.T) {
t.Parallel()
for range 100 {
t.Run("", func(t *testing.T) {
t.Parallel()
flakyScenario(t)
})
}
}
func flakyScenario(t *testing.T) {
synctest.Run(func() {
_, err := json.Marshal([]string{})
if err != nil {
t.Fatalf("failed to serialize: %v", err)
}
})
} When run in a loop (ex: This is boiled down from one of our real test cases. It fails faster under load or when you increase the parallelism, and we reproduced it several times in our CI because to this. I haven't deeply analyzed the cause, but I see that the This issue was a total surprise to me, and took a while to debug: the actual test in our codebase also uses I'm not sure this is an actual bug of the There may be other seemingly benign algorithms in the stdlib that fail unexpectedly under synctest, which will create debugging headaches. I think those should be adapted to be stable under synctest (ideally) or documented as unsafe (reducing heavily the valid use cases for synctest). |
To be clear, I love synctest and I definitely want it to be accepted. This case is extracted from a heavily asynchronous codebase, which has several agents acting on tickers and communicating through channels. Our previous testing approach relied heavily on carefully designed mocks that yield values very precisely to lockstep the tests, and careful setup of timeouts.
It has also helped me catch some goroutine leaks on cleanup that were asymptomatic but still unexpected. It feels like strict-mode goroutines, which is perfect to ensure correctness. But it may be hard to use if there are such pitfalls in the stdlib 😕 |
I can't reproduce the issue when using |
Thanks for the report, @mgarstecki! I was able to reproduce the problem with your example. The problem is that When multiple synctest bubbles access the cache at the same time, a goroutine in one bubble can become blocked on a goroutine in the other bubble. Since we don't track whether the I'm not certain yet what the right fix is. |
I wouldn't be offended if t.Parallel() were forbidden somehow when using synctest. At best, it is unclear what parallel tests would mean for synctest's manipulation of time. |
Each bubble has its own independent clock. Barring some form of cross-bubble communication (which is what's happening here), parallel tests should just work. |
Change https://go.dev/cl/673335 mentions this issue: |
That would be sad, because parallel fits well with synctest: both features combine to speed up the tests on a large codebase, it would be sad to have to choose between them. But it modifies the behavior of Goroutines, so I'm not surprised to find incompatible algorithms in existing code. I do wonder what should be done if further cases are found in the stdlib. We have a promising fix for If such cases are found in the stdlib, should they be documented as incompatible with synctest, and documented somewhere ? |
That's fair. I have different reasons for liking synctest. To me, the appeal is test reliability. Where there is tension between test speed and reliability, I will choose reliability, but my test suites aren't huge. Perhaps using t.Parallel should mean accepting some reliability risk (because of global synchronization structures) with a promise of speed, and the Go team should work on issues as they are detected and reported; users can fall back to removing t.Parallel when they find a problem. That's not completely satisfying, but maybe it is pragmatic? |
Current proposal status: #67434 (comment)
This is a proposal for a new package to aid in testing concurrent code.
This package has two main features:
As an example, let us say we are testing an expiring concurrent cache:
A naive test for this cache might look something like this:
This test has a couple problems. It's slow, taking four seconds to execute. And it's flaky, because it assumes the cache entry will not have expired one second before its deadline and will have expired one second after. While computers are fast, it is not uncommon for an overloaded CI system to pause execution of a program for longer than a second.
We can make the test less flaky by making it slower, or we can make the test faster at the expense of making it flakier, but we can't make it fast and reliable using this approach.
We can design our Cache type to be more testable. We can inject a fake clock to give us control over time in tests. When advancing the fake clock, we will need some mechanism to ensure that any timers that fire have executed before progressing the test. These changes come at the expense of additional code complexity: We can no longer use time.Timer, but must use a testable wrapper. Background goroutines need additional synchronization points.
The synctest package simplifies all of this. Using synctest, we can write:
This is identical to the naive test above, wrapped in synctest.Run and with the addition of two calls to synctest.Wait. However:
A limitation of the synctest.Wait function is that it does not recognize goroutines blocked on network or other I/O operations as idle. While the scheduler can identify a goroutine blocked on I/O, it cannot distinguish between a goroutine that is genuinely blocked and one which is about to receive data from a kernel network buffer. For example, if a test creates a loopback TCP connection, starts a goroutine reading from one side of the connection, and then writes to the other, the read goroutine may remain in I/O wait for a brief time before the kernel indicates that the connection has become readable. If synctest.Wait considered a goroutine in I/O wait to be idle, this would cause nondeterminism in cases such as this,
Tests which use synctest with network connections or other external data sources should use a fake implementation with deterministic behavior. For net.Conn, net.Pipe can create a suitable in-memory connection.
This proposal is based in part on experience with tests in the golang.org/x/net/http2 package. Tests of an HTTP client or server often involve multiple interacting goroutines and timers. For example, a client request may involve goroutines writing to the server, reading from the server, and reading from the request body; as well as timers covering various stages of the request process. The combination of fake clocks and an operation which waits for all goroutines in the test to stabilize has proven effective.
@gabyhelp's overview of this issue: #67434 (comment)
The text was updated successfully, but these errors were encountered: