-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data race in broker v1.36.0 #2320
Comments
v1.37.0 has the same problem.
|
Anyone? This is upgrade blocker for me. |
I have the very same identical issue with 1.37.2
|
Another example in 1.37.2
|
I'm also interested in the status of this. |
This was introduced in 5b04c98 |
A race condition was introduced in 5b04c98 (feat(metrics): track consumer-fetch-response-size) when passing the metric registry around to get additional metrics. Notably, `handleResponsePromise()` could access the registry after the broker has been closed and is tentatively being reopened. This triggers a data race because `b.metricRegistry` is being set during `Open()` (as it is part of the configuration). We fix this by reverting the addition of `handleResponsePromise()` as a method to `Broker`. Instead, we provide it with the metric registry as an argument. An alternative would have been to get the metric registry before the `select` call. However, removing it as a method make it clearer than this function is not allowed to access the broker internals as they are not protected by the lock and the broker may not be alive any more. All the following calls to `b.metricRegistry` are done while the lock is held: - inside `Open()`, the lock is held, including inside the goroutine - inside `Close()`, the lock is held - `AsyncProduce()` has a contract that it must be called while the broker is open, we keep a copy of the metric registry to use inside the callback - `sendInternal()`, has a contract that the lock should be held - `authenticateViaSASLv1()` is called from `Open()` and `sendWithPromise()`, both of them holding the lock - `sendAndReceiveSASLHandshake()` is called from - `authenticateViaSASLv0/v1()`, which are called from `Open()` and `sendWithPromise()` I am unsure about `responseReceiver()`, however, it is also calling `b.readFull()` which accesses `b.conn`, so I suppose it is safe. This leaves `sendAndReceive()` which is calling `send()`, which is calling `sendWithPromise()` which puts a lock. We move the lock to `sendAndReceive()` instead. `send()` is only called from `sendAndReceiver()` and we put a lock for `sendWithPromise()` other caller. The test has been stolen from IBM#2393 from @samuelhewitt. IBM#2393 is an alternative proposal using a RW lock to protect `b.metricRegistry`. Fix IBM#2320
A race condition was introduced in 5b04c98 (feat(metrics): track consumer-fetch-response-size) when passing the metric registry around to get additional metrics. Notably, `handleResponsePromise()` could access the registry after the broker has been closed and is tentatively being reopened. This triggers a data race because `b.metricRegistry` is being set during `Open()` (as it is part of the configuration). We fix this by reverting the addition of `handleResponsePromise()` as a method to `Broker`. Instead, we provide it with the metric registry as an argument. An alternative would have been to get the metric registry before the `select` call. However, removing it as a method make it clearer than this function is not allowed to access the broker internals as they are not protected by the lock and the broker may not be alive any more. All the following calls to `b.metricRegistry` are done while the lock is held: - inside `Open()`, the lock is held, including inside the goroutine - inside `Close()`, the lock is held - `AsyncProduce()` has a contract that it must be called while the broker is open, we keep a copy of the metric registry to use inside the callback - `sendInternal()`, has a contract that the lock should be held - `authenticateViaSASLv1()` is called from `Open()` and `sendWithPromise()`, both of them holding the lock - `sendAndReceiveSASLHandshake()` is called from - `authenticateViaSASLv0/v1()`, which are called from `Open()` and `sendWithPromise()` I am unsure about `responseReceiver()`, however, it is also calling `b.readFull()` which accesses `b.conn`, so I suppose it is safe. This leaves `sendAndReceive()` which is calling `send()`, which is calling `sendWithPromise()` which puts a lock. We move the lock to `sendAndReceive()` instead. `send()` is only called from `sendAndReceiver()` and we put a lock for `sendWithPromise()` other caller. The test has been stolen from #2393 from @samuelhewitt. #2393 is an alternative proposal using a RW lock to protect `b.metricRegistry`. Fix #2320
The issue still remains in 1.38.1, I still encounter: WARNING: DATA RACE Previous write at 0x00c0000c2070 by goroutine 146: Goroutine 152 (running) created at: Goroutine 146 (running) created at: Should I file a separate issue for this? |
On failure, `Broker.Open()` can be called several times while a producer is running. In IBM#2409, it was assumed that AsyncProduce can only be called with an open broker, however, it should be read that a user should call it after opening the broker. The broker could be disconnected and in progress of being reconnected. This is not hard to fix as we already have a lock protecting the creation of the registry: just don't create a new metric registry when attempting to reopen the broker. Fix IBM#2320 (again)
) On failure, `Broker.Open()` can be called several times while a producer is running. In #2409, it was assumed that AsyncProduce can only be called with an open broker, however, it should be read that a user should call it after opening the broker. The broker could be disconnected and in progress of being reconnected. This is not hard to fix as we already have a lock protecting the creation of the registry: just don't create a new metric registry when attempting to reopen the broker. Fix #2320 (again)
@stuart-byma Can you try the fix in #2428? |
#2428 resolves the problem, thx! |
I'm on v1.40.1 (which I believe has the fix) but I'm still seeing the issue. Is there something in my code I should look for that might be causing this?
and
|
Also, why isn't some kind of mutex around that block of code needed? I would think we'd need to lock things if one thread is trying to read a field and another thread is trying to write it. The "if" doesn't prevent concurrent read/write or two writes is a 2nd thread sneaks in there between the "if" and the assignment. |
A mutex specifically for the registry would kill the performance. From my understanding, it is not needed. The registry is only created on first use (protected by a lock) and then, it won't be modified anymore. It seems that a broker can be used without using Open() first, I don't know how this could happen. |
Versions
Configuration
Sample Test code
code: CLICK ME
Logs
logs: CLICK ME
Problem Description
A unit tests that confirms a ConsumeGroup wrapper successfully cancels after a context timeout begins failing due a data race when moving from
v1.35.0
tov1.36.0
.In broker.go there are several reads for
b.conf.MetricRegistry
(test in question fails inhandleResponsePromise
), none of them are guarded with a lock. Broker.Open() writes tob.conf
which creates the race condition.The text was updated successfully, but these errors were encountered: