Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High goroutine count after high client/message load #462

Open
bddckr opened this issue Mar 11, 2025 · 5 comments
Open

High goroutine count after high client/message load #462

bddckr opened this issue Mar 11, 2025 · 5 comments
Labels
needs investigation Identify cause / reproduce issue

Comments

@bddckr
Copy link

bddckr commented Mar 11, 2025

Hey there! Thanks for this great MQTT broker. I'm a newbie to Go, so please bear with me while I explain the following 😄

@SimonMacIntyre and I have been doing some load-tests and noticed that the threads goroutine count count reported by the HTTP stats endpoint doesn't go down once clients disconnect (ungracefully).

This is what we've roughly been doing in this particular test:

  1. Connect thousands of clients.
  2. Each client subscribes to a specific topic.
  3. Once all clients are connected, we publish lots of messages to that single topic everyone's subscribed to.
  4. We keep it running and observe general performance details.
  5. Eventually we take down all the clients we've started. We do so ungracefully.
  6. We keep observing the threads count via Mochi's HTTP stats endpoint.

Once all clients were connected, we've reached a number of threads. That thread count then stays that high even after the clients were all disconnected (ungracefully).

We've checked in 2 hours later, and the thread count was all the way down to what it was before the test. We instead expected the thread count to go down right after all the clients were shown as disconnected according to Mochi. We can't tell when exactly the thread count went down, but we can confirm it stayed high for at least 10 minutes after all clients were disconnected.

Does anyone have any idea as to what's happening? Is this a cause for concern?

@bddckr
Copy link
Author

bddckr commented Mar 11, 2025

I've checked all the goroutine calls and it seems Mochi is properly stopping them all.

That said, I believe what we've observed must be something related to the two defer statements seen here:

  • As we've polled the HTTP stats endpoint, we can confirm that the ClientsConnected counter went down to zero when we disconnected all clients.
  • However, I believe the cl.Stop() call might not be returning quickly.

Therefore, I wonder whether closing the connection is blocked for a long while. I came across this discussion and am now wondering whether Mochi is missing a call to SetLinger. In our case we would have expected SetLinger(0) to simply discard any unsent data. But I'm not 100% sure if that would work while still ensuring that clients get a DISCONNECT packet in general.

I believe that would explain what happened in our scenario, as we've had tens of thousands of inflight messages according to the HTTP stats endpoint when we disconnected all the clients.

@thedevop
Copy link
Collaborator

Mochi can’t immediately detect an ungraceful client disconnect. To do so, it employs two mechanisms:

  1. The MQTT protocol’s keepalive timeout, which is the keepalive value set by the client during the Connect handshake.
  2. The TCP protocol’s timeout at the OS level (both keepalive and retry mechanisms).

Both defer s.Listeners.ClientsWg.Done() and defer cl.Stop(nil) should not cause the delay you observed, it's most likely due to the time it took to detect the client is no longer active.

@bddckr
Copy link
Author

bddckr commented Mar 12, 2025

We've set the keep-alive to 10 seconds (with great success in general), so I expect the MQTT timeout would kick in after ~15 seconds. However, we've observed multiple minutes of the high goroutine count.

Also: The ClientsConnected count did go down quickly as expected (thanks to that low keep-alive, I'm sure). After that it still took minutes to stop those goroutines, at least according to that "thread count" reported via the HTTP endpoint.

The fact that the connected client count reduced, but the thread count didn't is what gives me pause - is that matching what you just said?

@thedevop
Copy link
Collaborator

thedevop commented Mar 13, 2025

Given you saw the connected client count reduced shortly (~15s based on the keep-alive of 10s), there should be nothing holding up attachClient from exiting. There are 3 deferred for that function in the actual execution order :

defer s.Info.ClientsConnected.Add(-1) // reduce the connected client count, which you saw the effect
defer cl.Stop(nil) // should returned immediately, as in your case it was called once before already
defer s.Listeners.ClientsWg.Done() // would complete immediately

One possible cause of what you observed is Threads count do not update in real-time. Can you share the value you used for SysTopicResendInterval?

@thedevop thedevop added the needs investigation Identify cause / reproduce issue label Mar 13, 2025
@bddckr
Copy link
Author

bddckr commented Mar 13, 2025

One possible cause of what you observed is Threads count do not update in real-time. Can you share the value you used for SysTopicResendInterval?

We use the default, which is 1 second, I believe.


We're hoping to run some more load-tests today and I will report back with the findings. Thanks for your help so far! Much appreciated ❤

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs investigation Identify cause / reproduce issue
Projects
None yet
Development

No branches or pull requests

2 participants