-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reinstate read-only lock on hooks access in dialHook to fix data race #3225
Conversation
88de941
to
06d0fb7
Compare
@ofekshenawa Would you be able to help with the review on this one? This is a follow up to #3088. Thanks. |
@LINKIWI would you mind adding a test for that race condition? |
It is actually already covered by https://github.com/redis/go-redis/blame/master/redis_test.go#L588C29-L588C29. I think this test did not regress in the original commit since the behavior is inherently racy. I had spent some time trying to come up with a more robust test, but was not able to find a reasonable solution; if you have any suggestions I'm happy to try them. |
@LINKIWI I will see if I can came up with a better test to detect this. But overall, if the description of #3088 is correct, let's try to address this. We can have a timeout for acquiring the lock? Overall since there is a context, we can use the context as well. I am not that familiar with the issue you were facing and will need some time to get myself familiar with it. Do you think there is a clear way to reproduce the issue described in #3088 ? |
The test added in #3088 should be an effective regression test for the issue, and you can use that test as a reproduction example. A timeout on lock acquisition will still create lock contention conditions. Under load, one would observe the same symptoms as in #3088 with that approach. The fundamental issue is that we shouldn't be trying to acquire a write lock at all, since we only need read-only access in the dial hook; we can solve both the racy access and dial-time lock contention with the patch proposed in this PR. |
* Add guidance on unstable RESP3 support for RediSearch commands to README (#3177) * Add UnstableResp3 to docs * Add RawVal and RawResult to wordlist * Explain more about SetVal * Add UnstableResp to wordlist * Eliminate redundant dial mutex causing unbounded connection queue contention (#3088) * Eliminate redundant dial mutex causing unbounded connection queue contention * Dialer connection timeouts unit test --------- Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> * SortByWithCount FTSearchOptions fix (#3201) * SortByWithCount FTSearchOptions fix * FTSearch test fix * Another FTSearch test fix * Another FTSearch test fix --------- Co-authored-by: Christopher Golling <Chris.Golling@aexp.com> * Fix race condition in clusterNodes.Addrs() (#3219) Resolve a race condition in the clusterNodes.Addrs() method. Previously, the method returned a reference to a string slice, creating the potential for concurrent reads by the caller while the slice was being modified by the garbage collection process. Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com> * chore: fix some comments (#3226) Signed-off-by: zhuhaicity <zhuhai@52it.net> Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com> * fix(aggregate, search): ft.aggregate bugfixes (#3263) * fix: rearange args for ft.aggregate apply should be before any groupby or sortby * improve test * wip: add scorer and addscores * enable all tests * fix ftsearch with count test * make linter happy * Addscores is available in later redisearch releases. For safety state it is available in redis ce 8 * load an apply seem to break scorer and addscores * fix: add unstableresp3 to cluster client (#3266) * fix: add unstableresp3 to cluster client * propagate unstableresp3 * proper test that will ignore error, but fail if client panics * add separate test for clusterclient constructor * fix: flaky ClientKillByFilter test (#3268) * Reinstate read-only lock on hooks access in dialHook (#3225) * use limit when limitoffset is zero (#3275) * remove redis 8 comments * update package versions * use latest golangci-lint * fix(search&aggregate):fix error overwrite and typo #3220 (#3224) * fix (#3220) * LOAD has NO AS param(https://redis.io/docs/latest/commands/ft.aggregate/) * fix typo: WITHCOUT -> WITHCOUNT * fix (#3220): * Compatible with known RediSearch issue in test * fix (#3220) * fixed the calculation bug of the count of load params * test should not include special condition * return errors when they occur --------- Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com> Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> * Recognize byte slice for key argument in cluster client hash slot computation (#3049) Co-authored-by: Vladyslav Vildanov <117659936+vladvildanov@users.noreply.github.com> Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> --------- Signed-off-by: zhuhaicity <zhuhai@52it.net> Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com> Co-authored-by: LINKIWI <LINKIWI@users.noreply.github.com> Co-authored-by: Cgol9 <chris.golling@verizon.net> Co-authored-by: Christopher Golling <Chris.Golling@aexp.com> Co-authored-by: Shawn Wang <62313353+shawnwgit@users.noreply.github.com> Co-authored-by: ZhuHaiCheng <zhuhai@52it.net> Co-authored-by: herodot <54836727+bitsark@users.noreply.github.com> Co-authored-by: Vladyslav Vildanov <117659936+vladvildanov@users.noreply.github.com>
Previously, in #3088, I removed the mutex guarding the implementation of
dialHook
in order to resolve an unbounded contention failure mode, that had the potential to backpressure commands indefinitely during periods of server downtime.However, this introduced a data race regression, which was the original motivation of introducing the lock, in #2814.
A minimal reproduction is as follows:
This race is caused by concurrent access to
hs.current
when the connection pool executesdialHook
in the background (whenMinIdleConns > 0
) whileAddHook
also mutateshs.current
. However, withindialHook
, only read access is required. This PR proposes fixing this by changing the mutex to async.RWMutex
and guarding only the access tohs.current
with the lock, which both solves the data race and does not regress the connection contention unit test introduced in #3088.With this patch, the example test above passes with the race detector enabled:
$ go test -v -race -count=1 ./cmd/... === RUN TestExec ping: PONG --- PASS: TestExec (0.00s) PASS ok github.com/redis/go-redis/v9/cmd 1.010s