-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(metrics): fix race when accessing metric registry #2409
Conversation
A race condition was introduced in 5b04c98 (feat(metrics): track consumer-fetch-response-size) when passing the metric registry around to get additional metrics. Notably, `handleResponsePromise()` could access the registry after the broker has been closed and is tentatively being reopened. This triggers a data race because `b.metricRegistry` is being set during `Open()` (as it is part of the configuration). We fix this by reverting the addition of `handleResponsePromise()` as a method to `Broker`. Instead, we provide it with the metric registry as an argument. An alternative would have been to get the metric registry before the `select` call. However, removing it as a method make it clearer than this function is not allowed to access the broker internals as they are not protected by the lock and the broker may not be alive any more. All the following calls to `b.metricRegistry` are done while the lock is held: - inside `Open()`, the lock is held, including inside the goroutine - inside `Close()`, the lock is held - `AsyncProduce()` has a contract that it must be called while the broker is open, we keep a copy of the metric registry to use inside the callback - `sendInternal()`, has a contract that the lock should be held - `authenticateViaSASLv1()` is called from `Open()` and `sendWithPromise()`, both of them holding the lock - `sendAndReceiveSASLHandshake()` is called from - `authenticateViaSASLv0/v1()`, which are called from `Open()` and `sendWithPromise()` I am unsure about `responseReceiver()`, however, it is also calling `b.readFull()` which accesses `b.conn`, so I suppose it is safe. This leaves `sendAndReceive()` which is calling `send()`, which is calling `sendWithPromise()` which puts a lock. We move the lock to `sendAndReceive()` instead. `send()` is only called from `sendAndReceiver()` and we put a lock for `sendWithPromise()` other caller. The test has been stolen from IBM#2393 from @samuelhewitt. IBM#2393 is an alternative proposal using a RW lock to protect `b.metricRegistry`. Fix IBM#2320
@dnwe This seems like a more appropriate fix to #2320, I like the classification of If everyone agrees, would be great to see this merged and I can close PR #2393 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @vincentbernat — great analysis and fix 👍🏻
@samuelhewitt yep agreed, looks like this PR should cover the necessary changes — thanks for all your work on this as well |
On failure, `Broker.Open()` can be called several times while a producer is running. In IBM#2409, it was assumed that AsyncProduce can only be called with an open broker, however, it should be read that a user should call it after opening the broker. The broker could be disconnected and in progress of being reconnected. This is not hard to fix as we already have a lock protecting the creation of the registry: just don't create a new metric registry when attempting to reopen the broker. Fix IBM#2320 (again)
) On failure, `Broker.Open()` can be called several times while a producer is running. In #2409, it was assumed that AsyncProduce can only be called with an open broker, however, it should be read that a user should call it after opening the broker. The broker could be disconnected and in progress of being reconnected. This is not hard to fix as we already have a lock protecting the creation of the registry: just don't create a new metric registry when attempting to reopen the broker. Fix #2320 (again)
A race condition was introduced in
5b04c98 (feat(metrics): track consumer-fetch-response-size) when passing the metric registry around to get additional metrics. Notably,
handleResponsePromise()
could access the registry after the broker has been closed and is tentatively being reopened. This triggers a data race becauseb.metricRegistry
is being set duringOpen()
(as it is part of the configuration).We fix this by reverting the addition of
handleResponsePromise()
as a method toBroker
. Instead, we provide it with the metric registry as an argument. An alternative would have been to get the metric registry before theselect
call. However, removing it as a method make it clearer than this function is not allowed to access the broker internals as they are not protected by the lock and the broker may not be alive any more.All the following calls to
b.metricRegistry
are done while the lock is held:Open()
, the lock is held, including inside the goroutineClose()
, the lock is heldAsyncProduce()
has a contract that it must be called while the broker is open, we keep a copy of the metric registry to use inside the callbacksendInternal()
, has a contract that the lock should be heldauthenticateViaSASLv1()
is called fromOpen()
andsendWithPromise()
, both of them holding the locksendAndReceiveSASLHandshake()
is called fromauthenticateViaSASLv0/v1()
, which are called fromOpen()
andsendWithPromise()
I am unsure about
responseReceiver()
, however, it is also callingb.readFull()
which accessesb.conn
, so I suppose it is safe.This leaves
sendAndReceive()
which is callingsend()
, which is callingsendWithPromise()
which puts a lock. We move the lock tosendAndReceive()
instead.send()
is only called fromsendAndReceiver()
and we put a lock forsendWithPromise()
other caller.The test has been stolen from #2393 from @samuelhewitt. #2393 is an alternative proposal using a RW lock to protect
b.metricRegistry
.Fix #2320