Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket mode: Unhandled event 'server explicit disconnect' in state 'connecting' #2225

Closed
okovpashko opened this issue Sep 3, 2024 · 9 comments · Fixed by #2254
Closed

Socket mode: Unhandled event 'server explicit disconnect' in state 'connecting' #2225

okovpashko opened this issue Sep 3, 2024 · 9 comments · Fixed by #2254
Labels
auto-triage-skip Prevent this issue from being closed due to lack of activity dependencies Pull requests that update a dependency file semver:major
Milestone

Comments

@okovpashko
Copy link

@slack/bolt version

3.21.1

Your App and Receiver Configuration

e.g. const myApp = new App({ ... what options are you using? })

const app = new App({
  token: env.SLACK_BOT_TOKEN,
  socketMode: true,
  appToken: env.SLACK_APP_TOKEN,
  logger,
  logLevel: env.LOG_LEVEL,
  customRoutes,
  port: env.APP_PORT,
});

Node.js runtime version

v20.17.0

Steps to reproduce:

N/A

Expected result:

No unexpected disconnection errors

Actual result:

Regular disconnection errors:

Unhandled event 'server explicit disconnect' in state 'connecting'.

Here's the stack trace from Sentry

image

and the latest error history

image

Requirements

After upgrading Bolt.js from 3.19.0 to 3.21.1, we started receiving disconnection errors multiple times daily on various environments (prod and testing) of projects that use the Socket Mode.
At least one of those projects didn't have any other logic and/or infrastructure changes.

It looks like there's an error was introduced in one of the latest updates. Could you please confirm the issue or suggest where to dig if it's on our side?

@seratch seratch added question M-T: User needs support to use the project needs info An issue that is claimed to be a bug and hasn't been reproduced, or otherwise needs more info and removed untriaged labels Sep 3, 2024
@seratch
Copy link
Member

seratch commented Sep 3, 2024

Hi @okovpashko, we're sorry for the disruption during the upgrade and thank you for taking the time to report this.

It seems there is an issue on the underlying @slack/socket-mode library side. To get more detailed information on the situation, could you share the outputs of grep "@slack/" package-lock.json for your project?

We're curious about the versions of the underlying @slack/ npm packages. You're probably using @slack/socket-mode@1.3.x, and indeed the pattern might be unhandled even with the latest version of the 1.3.x series: https://github.com/slackapi/node-slack-sdk/blob/%40slack/socket-mode%401.3.6/packages/socket-mode/src/SocketModeClient.ts#L267-L283

One thing I am wondering is if this issue might be a potential issue even with bolt-js 3.19 because there is no specific code on the bolt-js side to reconnect for Socket Mode disconnections. I'm not sure about the direct trigger that caused the new error pattern for your app yet, but upgrading @slack/socket-mode etc. this time might be a possible cause. To resolve this issue, @slack/socket-mode 1.3.x should make a patch release to enable it functional with this kind of race condition situation.

Aside from that, our team has a long-term plan to rewrite the Socket Mode client for Node.js users to eliminate these kinds of errors, and the underlying module 2.x is already published, although bolt-js does not support it yet. I don't have detailed information about the future release plans at this moment, but @filmaj will be able to share more details on this.

@seratch seratch removed the question M-T: User needs support to use the project label Sep 3, 2024
@okovpashko
Copy link
Author

@seratch thank you for the rapid response.

Here's my grep output:

         "@slack/bolt": "^3.21.1",
    "node_modules/@slack/bolt": {
      "resolved": "https://registry.npmjs.org/@slack/bolt/-/bolt-3.21.1.tgz",
        "@slack/logger": "^4.0.0",
        "@slack/oauth": "^2.6.3",
        "@slack/socket-mode": "^1.3.6",
        "@slack/types": "^2.11.0",
        "@slack/web-api": "^6.12.1",
    "node_modules/@slack/bolt/node_modules/@slack/logger": {
      "resolved": "https://registry.npmjs.org/@slack/logger/-/logger-4.0.0.tgz",
    "node_modules/@slack/bolt/node_modules/path-to-regexp": {
    "node_modules/@slack/logger": {
      "resolved": "https://registry.npmjs.org/@slack/logger/-/logger-3.0.0.tgz",
    "node_modules/@slack/oauth": {
      "resolved": "https://registry.npmjs.org/@slack/oauth/-/oauth-2.6.3.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/web-api": "^6.12.1",
    "node_modules/@slack/socket-mode": {
      "resolved": "https://registry.npmjs.org/@slack/socket-mode/-/socket-mode-1.3.6.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/web-api": "^6.12.1",
    "node_modules/@slack/types": {
      "resolved": "https://registry.npmjs.org/@slack/types/-/types-2.11.0.tgz",
    "node_modules/@slack/web-api": {
      "resolved": "https://registry.npmjs.org/@slack/web-api/-/web-api-6.12.1.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/types": "^2.11.0",
    "node_modules/@slack/web-api/node_modules/eventemitter3": {
    "node_modules/@slack/web-api/node_modules/is-stream": {
    "@slack/bolt": {
      "resolved": "https://registry.npmjs.org/@slack/bolt/-/bolt-3.21.1.tgz",
        "@slack/logger": "^4.0.0",
        "@slack/oauth": "^2.6.3",
        "@slack/socket-mode": "^1.3.6",
        "@slack/types": "^2.11.0",
        "@slack/web-api": "^6.12.1",
        "@slack/logger": {
          "resolved": "https://registry.npmjs.org/@slack/logger/-/logger-4.0.0.tgz",
    "@slack/logger": {
      "resolved": "https://registry.npmjs.org/@slack/logger/-/logger-3.0.0.tgz",
    "@slack/oauth": {
      "resolved": "https://registry.npmjs.org/@slack/oauth/-/oauth-2.6.3.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/web-api": "^6.12.1",
    "@slack/socket-mode": {
      "resolved": "https://registry.npmjs.org/@slack/socket-mode/-/socket-mode-1.3.6.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/web-api": "^6.12.1",
    "@slack/types": {
      "resolved": "https://registry.npmjs.org/@slack/types/-/types-2.11.0.tgz",
    "@slack/web-api": {
      "resolved": "https://registry.npmjs.org/@slack/web-api/-/web-api-6.12.1.tgz",
        "@slack/logger": "^3.0.0",
        "@slack/types": "^2.11.0",

I took a look at the Git history in our project and noticed that the @slack/socket-mode package was updated from 1.3.3 to 1.3.6 along with the @slack/bolt. Probably it might be the cause.

Anyway, your reply is very helpful as now I understand that it's not the issue on our side, you are aware of it and working on the fix.

@filmaj
Copy link
Contributor

filmaj commented Sep 3, 2024

Hello, this is unfortunately a design flaw with socket-mode 1.x. It is a duplicate of slackapi/node-slack-sdk#1787. You can find the details for this problem in that issue thread.

I don't think this is a problem that was introduced between socket-mode 1.3.3 and 1.3.6, but you can try locking to the earlier version to see if the problem goes away on your end. Several customers started reporting issues related to this functionality in May of this year, which prompted a long investigation. Ultimately, socket-mode v2 was released that completely changes the design of the underlying socket-mode module and has addressed the issue for affected customers. You can give socket-mode v2 a try by using bolt@4.0.0-rc.1 - but fair warning, it is still in release candidacy as we are still some distance away from having a 4.0 of bolt-js ready.

If possible, if you have debug logs available leading up to the problem to share, please post them in slackapi/node-slack-sdk#1787 - I am collecting such behaviours and trying to add them to the integration tests in socket-mode to write more realistic test scenarios to better help prevent regressions. Understanding the state of the state machine in socket-mode and the sequence of events sent by Slack's backend would be helpful.

@filmaj filmaj added semver:major auto-triage-skip Prevent this issue from being closed due to lack of activity dependencies Pull requests that update a dependency file and removed needs info An issue that is claimed to be a bug and hasn't been reproduced, or otherwise needs more info labels Sep 3, 2024
@filmaj filmaj added this to the 4.0.0 milestone Sep 3, 2024
@okovpashko
Copy link
Author

@filmaj thank you for the explanation. I posted the debug logs to the issue you mentioned.

@okovpashko
Copy link
Author

I've also just noticed the disconnection reason too_many_websockets. I'm wondering if there is an issue with for example not closing the previous connections on our end. We run bots in Kubernetes and I'm sure that we kill all the obsolete containers in the namespace.

@filmaj
Copy link
Contributor

filmaj commented Sep 3, 2024

I've also just noticed the disconnection reason too_many_websockets. I'm wondering if there is an issue with for example not closing the previous connections on our end.

Ah yes, definitely leftover / hanging network connections not cleanly closed would contribute to the backend sending too_many_websockets.

Some context: each Slack app ID is allowed a maximum of 10 concurrent websocket (socket-mode) connections. Some customers sometimes create too many app instances (scale their app beyond 10 instances) which can lead to this problem. Possibly, as you mention, not explicitly stopping the server or cleanly closing the network connection could, at least temporarily, cause the error as well (until the network connection becomes severed or the Slack backend realizes the connection is no longer established).

When Slack detects it has more than 10 concurrent websocket connections open with a single app, it will send a type: warning message to the app with reason: too_many_websockets. By default, all socket-mode implementations that we at Slack manage (i.e. node @slack/socket-mode but also the Python and Java versions) will accept this message from the server and try to re-establish a connection with Slack. That is what "event 'server explicit disconnect'" means in the error message: Slack sent a type:warning message to the app, and the app is trying to re-establish a connection. This could lead to an error loop behaviour: Slack sends a type:warning or type:disconnect message, socket-mode tries to re-establish a connection, the app is already at the maximum concurrent number of connections allowed, and so fails.

Ultimately, I suggest looking into the reason that Slack is sending too_many_websockets: that is the actual problem; the 'explicit server disconnect' exception is just a symptom of that problem.

Thank you for sharing logs in the other issue, I will inspect them and see if that can help me re-create a test scenario; once a test scenario is written, then it is much easier to experiment with different solutions. ❤️

@filmaj
Copy link
Contributor

filmaj commented Oct 2, 2024

FYI a release candidate with this fix is available in @slack/bolt@4.0.0-rc.2. It is a pre-release, however, so you will need to explicitly install that version, and since it is a breaking change, it is included with other breaking changes. However, there is a migration guide available to help guide you through the changes requiring other modifications to your app.

@filmaj
Copy link
Contributor

filmaj commented Oct 17, 2024

The fix here is available in the new major bolt v4 version, FYI.

@okovpashko
Copy link
Author

okovpashko commented Nov 4, 2024

@filmaj thank you for the update. We will migrate to the new version but I'm afraid that it will fix only the unhandled errors.

We still have no clue of the reason for the too_many_websockets response from Slack API.

The latest occurrence was this night and it looked like a regular reconnection request from Slack API:

2024-11-04 04:24:53.542	[App]: Received "disconnect" (reason: too_many_websockets) message - will attempt reconnect
2024-11-04 04:24:53.541	    at Receiver.emit (node:events:519:28)
2024-11-04 04:24:53.541	    at Receiver.receiverOnMessage (/app/node_modules/ws/lib/websocket.js:1070:20)
2024-11-04 04:24:53.541	    at WebSocket.emit (node:events:519:28)
2024-11-04 04:24:53.541	    at WebSocket.onMessage (/app/node_modules/ws/lib/event-target.js:132:16)
2024-11-04 04:24:53.541	    at SocketModeClient.onWebSocketMessage (/app/node_modules/@slack/socket-mode/dist/SocketModeClient.js:608:31)
2024-11-04 04:24:53.541	    at HierarchicalStateMachine.handle (/app/node_modules/finity/lib/core/HierarchicalStateMachine.js:72:24)
2024-11-04 04:24:53.541	    at TaskScheduler.enqueue (/app/node_modules/finity/lib/core/TaskScheduler.js:19:12)
2024-11-04 04:24:53.540	    at TaskScheduler.execute (/app/node_modules/finity/lib/core/TaskScheduler.js:29:7)
2024-11-04 04:24:53.540	    at /app/node_modules/finity/lib/core/HierarchicalStateMachine.js:79:33
2024-11-04 04:24:53.540	    at StateMachine.handleUnhandledEvent (/app/node_modules/finity/lib/core/StateMachine.js:76:13)
2024-11-04 04:24:53.540	Error: Unhandled event 'server explicit disconnect' in state 'connecting'.
2024-11-04 04:24:53.540	This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). The promise rejected with the reason:
2024-11-04 04:24:53.533	[App]: Received a message on the WebSocket: {"type":"disconnect","reason":"too_many_websockets","debug_info":{"host":"applink-13"}}
2024-11-04 04:24:53.521	[App]: Transitioning to state: connecting:handshaking
2024-11-04 04:24:53.521	[App]: Secondary WebSocket open event received (connection established)
2024-11-04 04:24:53.377	[App]: Transitioning to state: connecting:authenticated
2024-11-04 04:24:53.377	[App]: apiCall('apps.connections.open') end
2024-11-04 04:24:53.377	[App]: http request result: {"ok":true,"url":"wss://wss-primary.slack.com/link/?ticket=35474bba-278a-4922-ac35-5f2880319e2a&app_id=c6f978591b122ce59e8e3ae6bf0fb9de47ac943f7ff81e67f1cc76ad3f244a11","response_metadata":{"scopes":["connections:write"],"acceptedScopes":["connections:write"]}}
2024-11-04 04:24:53.374	[App]: http response received
2024-11-04 04:24:53.246	[App]: http request headers: {}
2024-11-04 04:24:53.246	[App]: http request body: {}
2024-11-04 04:24:53.246	[App]: http request url: https://slack.com/api/apps.connections.open
2024-11-04 04:24:53.246	[App]: apiCall('apps.connections.open') start
2024-11-04 04:24:53.246	[App]: Going to retrieve a new WSS URL ...
2024-11-04 04:24:53.246	[App]: Transitioning to state: connecting:authenticating
2024-11-04 04:24:53.246	[App]: Going to establish a new connection to Slack ...
2024-11-04 04:24:53.246	[App]: Transitioning to state: connecting
2024-11-04 04:24:53.246	[App]: Reconnecting to Slack ...
2024-11-04 04:24:53.246	[App]: Transitioning to state: reconnecting
2024-11-04 04:24:53.246	[App]: Terminated the heart beat job
2024-11-04 04:24:53.246	[App]: Cancelled the job waiting for ping from Slack
2024-11-04 04:24:53.246	[App]: Received "disconnect" (reason: warning) message - will attempt reconnect
2024-11-04 04:24:53.241	[App]: Received a message on the WebSocket: {"type":"disconnect","reason":"warning","debug_info":{"host":"applink-canary-0"}}

We have only one instance of the application with only one active connection. The time frame of the logs above is 2024-11-02 00:00:00 – 2024-11-04 04:25:00 which means that there were no operations on our backend since Saturday.

As I understand, Slack API drops connections every few hours causing all clients to reconnect.

Is it possible that there were other 9 active connections since Saturday?
If not, how could we get too_many_websockets if we had only one active connection?

Could you please suggest where to dig to debug this issue?


UPD: I can see that we indeed have 10 active connections

2024-11-04 04:25:44.776	[App]: Received a message on the WebSocket: {"type":"hello","num_connections":10,"debug_info":{"host":"applink-8","build_number":101,"approximate_connection_time":18060},"connection_info":{"app_id":"A012345"}}

Could you please suggest if there is a way to drop all unused connections?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-triage-skip Prevent this issue from being closed due to lack of activity dependencies Pull requests that update a dependency file semver:major
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants