Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reinstate read-only lock on hooks access in dialHook to fix data race #3225

Merged
merged 3 commits into from
Feb 11, 2025

Conversation

LINKIWI
Copy link
Contributor

@LINKIWI LINKIWI commented Jan 11, 2025

Previously, in #3088, I removed the mutex guarding the implementation of dialHook in order to resolve an unbounded contention failure mode, that had the potential to backpressure commands indefinitely during periods of server downtime.

However, this introduced a data race regression, which was the original motivation of introducing the lock, in #2814.

A minimal reproduction is as follows:

package main

import (
	"context"
	"fmt"

	"github.com/redis/go-redis/v9"
)

type h struct{}

func (h *h) DialHook(next redis.DialHook) redis.DialHook {
	return next
}

func (h *h) ProcessHook(next redis.ProcessHook) redis.ProcessHook {
	return next
}

func (h *h) ProcessPipelineHook(next redis.ProcessPipelineHook) redis.ProcessPipelineHook {
	return next
}

func exec() {
	ctx := context.Background()
	opts := &redis.Options{
		MinIdleConns: 5,
	}

	client := redis.NewClient(opts)
	client.AddHook(&h{})

	fmt.Println(client.Ping(ctx))
}

func main() {
	exec()
}
package main

import (
	"testing"
)

func TestExec(t *testing.T) {
	exec()
}
$ go test -v -race -count=1 ./cmd/...
=== RUN   TestExec
==================
WARNING: DATA RACE
Write at 0x00c0000f6130 by goroutine 8:
  github.com/redis/go-redis/v9.(*hooksMixin).chain()
      /home/kiwi/sync/code/external/go-redis/redis.go:126 +0x128
  github.com/redis/go-redis/v9.(*hooksMixin).AddHook()
      /home/kiwi/sync/code/external/go-redis/redis.go:117 +0x1b1
  github.com/redis/go-redis/v9/cmd.exec()
      /home/kiwi/sync/code/external/go-redis/cmd/main.go:31 +0xaa
  github.com/redis/go-redis/v9/cmd.TestExec()
      /home/kiwi/sync/code/external/go-redis/cmd/main_test.go:8 +0x1c
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/lib/go/src/testing/testing.go:1743 +0x44

Previous read at 0x00c0000f6130 by goroutine 9:
  github.com/redis/go-redis/v9.(*hooksMixin).dialHook()
      /home/kiwi/sync/code/external/go-redis/redis.go:183 +0x8c
  github.com/redis/go-redis/v9.(*hooksMixin).dialHook-fm()
      <autogenerated>:1 +0x8f
  github.com/redis/go-redis/v9.newConnPool.func1()
      /home/kiwi/sync/code/external/go-redis/options.go:516 +0x9a
  github.com/redis/go-redis/v9/internal/pool.(*ConnPool).dialConn()
      /home/kiwi/sync/code/external/go-redis/internal/pool/pool.go:213 +0x16c
  github.com/redis/go-redis/v9/internal/pool.(*ConnPool).addIdleConn()
      /home/kiwi/sync/code/external/go-redis/internal/pool/pool.go:143 +0x54
  github.com/redis/go-redis/v9/internal/pool.(*ConnPool).checkMinIdleConns.func1()
      /home/kiwi/sync/code/external/go-redis/internal/pool/pool.go:126 +0x2e

Goroutine 8 (running) created at:
  testing.(*T).Run()
      /usr/lib/go/src/testing/testing.go:1743 +0x825
  testing.runTests.func1()
      /usr/lib/go/src/testing/testing.go:2168 +0x85
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1690 +0x226
  testing.runTests()
      /usr/lib/go/src/testing/testing.go:2166 +0x8be
  testing.(*M).Run()
      /usr/lib/go/src/testing/testing.go:2034 +0xf17
  main.main()
      _testmain.go:45 +0x164

Goroutine 9 (running) created at:
  github.com/redis/go-redis/v9/internal/pool.(*ConnPool).checkMinIdleConns()
      /home/kiwi/sync/code/external/go-redis/internal/pool/pool.go:125 +0x6d
  github.com/redis/go-redis/v9/internal/pool.NewConnPool()
      /home/kiwi/sync/code/external/go-redis/internal/pool/pool.go:109 +0x25e
  github.com/redis/go-redis/v9.newConnPool()
      /home/kiwi/sync/code/external/go-redis/options.go:514 +0x35d
  github.com/redis/go-redis/v9.NewClient()
      /home/kiwi/sync/code/external/go-redis/redis.go:665 +0x208
  github.com/redis/go-redis/v9/cmd.exec()
      /home/kiwi/sync/code/external/go-redis/cmd/main.go:30 +0xa4
  github.com/redis/go-redis/v9/cmd.TestExec()
      /home/kiwi/sync/code/external/go-redis/cmd/main_test.go:8 +0x1c
  testing.tRunner()
      /usr/lib/go/src/testing/testing.go:1690 +0x226
  testing.(*T).Run.gowrap1()
      /usr/lib/go/src/testing/testing.go:1743 +0x44
==================
ping: PONG
    testing.go:1399: race detected during execution of test
--- FAIL: TestExec (0.00s)
FAIL
FAIL	github.com/redis/go-redis/v9/cmd	0.009s
FAIL

This race is caused by concurrent access to hs.current when the connection pool executes dialHook in the background (when MinIdleConns > 0) while AddHook also mutates hs.current. However, within dialHook, only read access is required. This PR proposes fixing this by changing the mutex to a sync.RWMutex and guarding only the access to hs.current with the lock, which both solves the data race and does not regress the connection contention unit test introduced in #3088.

With this patch, the example test above passes with the race detector enabled:

$ go test -v -race -count=1 ./cmd/...
=== RUN   TestExec
ping: PONG
--- PASS: TestExec (0.00s)
PASS
ok  	github.com/redis/go-redis/v9/cmd	1.010s

@LINKIWI
Copy link
Contributor Author

LINKIWI commented Jan 15, 2025

@ofekshenawa Would you be able to help with the review on this one? This is a follow up to #3088. Thanks.

@ndyakov
Copy link
Collaborator

ndyakov commented Feb 3, 2025

@LINKIWI would you mind adding a test for that race condition?

@LINKIWI
Copy link
Contributor Author

LINKIWI commented Feb 3, 2025

@LINKIWI would you mind adding a test for that race condition?

It is actually already covered by https://github.com/redis/go-redis/blame/master/redis_test.go#L588C29-L588C29. I think this test did not regress in the original commit since the behavior is inherently racy.

I had spent some time trying to come up with a more robust test, but was not able to find a reasonable solution; if you have any suggestions I'm happy to try them.

@ndyakov
Copy link
Collaborator

ndyakov commented Feb 10, 2025

@LINKIWI I will see if I can came up with a better test to detect this. But overall, if the description of #3088 is correct, let's try to address this. We can have a timeout for acquiring the lock? Overall since there is a context, we can use the context as well. I am not that familiar with the issue you were facing and will need some time to get myself familiar with it. Do you think there is a clear way to reproduce the issue described in #3088 ?

@LINKIWI
Copy link
Contributor Author

LINKIWI commented Feb 10, 2025

@LINKIWI I will see if I can came up with a better test to detect this. But overall, if the description of #3088 is correct, let's try to address this. We can have a timeout for acquiring the lock? Overall since there is a context, we can use the context as well. I am not that familiar with the issue you were facing and will need some time to get myself familiar with it. Do you think there is a clear way to reproduce the issue described in #3088 ?

The test added in #3088 should be an effective regression test for the issue, and you can use that test as a reproduction example.

A timeout on lock acquisition will still create lock contention conditions. Under load, one would observe the same symptoms as in #3088 with that approach. The fundamental issue is that we shouldn't be trying to acquire a write lock at all, since we only need read-only access in the dial hook; we can solve both the racy access and dial-time lock contention with the patch proposed in this PR.

@ndyakov ndyakov self-requested a review February 11, 2025 15:35
@ndyakov ndyakov merged commit 9db1286 into redis:master Feb 11, 2025
15 checks passed
@ndyakov ndyakov mentioned this pull request Feb 21, 2025
ndyakov added a commit that referenced this pull request Feb 21, 2025
* Add guidance on unstable RESP3 support for RediSearch commands to README (#3177)

* Add UnstableResp3 to docs

* Add RawVal and RawResult to wordlist

* Explain more about SetVal

* Add UnstableResp to wordlist

* Eliminate redundant dial mutex causing unbounded connection queue contention (#3088)

* Eliminate redundant dial mutex causing unbounded connection queue contention

* Dialer connection timeouts unit test

---------

Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com>

* SortByWithCount FTSearchOptions fix (#3201)

* SortByWithCount FTSearchOptions fix

* FTSearch test fix

* Another FTSearch test fix

* Another FTSearch test fix

---------

Co-authored-by: Christopher Golling <Chris.Golling@aexp.com>

* Fix race condition in clusterNodes.Addrs() (#3219)

Resolve a race condition in the clusterNodes.Addrs() method.
Previously, the method returned a reference to a string slice, creating
the potential for concurrent reads by the caller while the slice was
being modified by the garbage collection process.

Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com>

* chore: fix some comments (#3226)

Signed-off-by: zhuhaicity <zhuhai@52it.net>
Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com>

* fix(aggregate, search): ft.aggregate bugfixes (#3263)

* fix: rearange args for ft.aggregate

apply should be before any groupby or sortby

* improve test

* wip: add scorer and addscores

* enable all tests

* fix ftsearch with count test

* make linter happy

* Addscores is available in later redisearch releases.

For safety state it is available in redis ce 8

* load an apply seem to break scorer and addscores

* fix: add unstableresp3 to cluster client (#3266)

* fix: add unstableresp3 to cluster client

* propagate unstableresp3

* proper test that will ignore error, but fail if client panics

* add separate test for clusterclient constructor

* fix: flaky ClientKillByFilter test (#3268)

* Reinstate read-only lock on hooks access in dialHook (#3225)

* use limit when limitoffset is zero (#3275)

* remove redis 8 comments

* update package versions

* use latest golangci-lint

* fix(search&aggregate):fix error overwrite and typo  #3220 (#3224)

* fix (#3220)

* LOAD has NO AS param(https://redis.io/docs/latest/commands/ft.aggregate/)

* fix typo: WITHCOUT -> WITHCOUNT

* fix (#3220):

    * Compatible with known RediSearch issue in test

* fix (#3220)

    * fixed the calculation bug of the count of load params

* test should not include special condition

* return errors when they occur

---------

Co-authored-by: Nedyalko Dyakov <nedyalko.dyakov@gmail.com>
Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com>

* Recognize byte slice for key argument in cluster client hash slot computation (#3049)

Co-authored-by: Vladyslav Vildanov <117659936+vladvildanov@users.noreply.github.com>
Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com>

---------

Signed-off-by: zhuhaicity <zhuhai@52it.net>
Co-authored-by: ofekshenawa <104765379+ofekshenawa@users.noreply.github.com>
Co-authored-by: LINKIWI <LINKIWI@users.noreply.github.com>
Co-authored-by: Cgol9 <chris.golling@verizon.net>
Co-authored-by: Christopher Golling <Chris.Golling@aexp.com>
Co-authored-by: Shawn Wang <62313353+shawnwgit@users.noreply.github.com>
Co-authored-by: ZhuHaiCheng <zhuhai@52it.net>
Co-authored-by: herodot <54836727+bitsark@users.noreply.github.com>
Co-authored-by: Vladyslav Vildanov <117659936+vladvildanov@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants