-
Notifications
You must be signed in to change notification settings - Fork 663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
socket-mode: Handling WS errors during handshake #2099
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #2099 +/- ##
==========================================
+ Coverage 91.65% 91.89% +0.23%
==========================================
Files 38 38
Lines 10311 10320 +9
Branches 647 649 +2
==========================================
+ Hits 9451 9484 +33
+ Misses 848 824 -24
Partials 12 12
Flags with carried forward coverage won't be shown. Click here to find out more. |
this.logger.debug('WebSocket open event received (connection established)!'); | ||
this.websocket = ws; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not tracking the current websocket instance was a problem in the case the web socket server returns an error. Since only once the open
event was emitted would we keep track of the websocket (this line), any errors during the websocket handshake would call into disconnect
which, crucially, only terminates the underlying network resources for the tracked this.websocket
.
That was the underlying issue that caused the "doubling up" of reconnection attempts (see "Issue 1" described by @gdhardy1 in this comment): each time an error was encountered during websocket handshake, there would be an 'untracked' ws
instance still executing reconnect attempts in addition to a new one being spawned. Essentially websocket-connection-bombing ourselves.
If you remove the changes in this file, you will see the integration test I added at the end of this PR fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for finding this. I was able to patch this in and test it while we await a full fix. It stabilized things tremendously.
this.logger.debug('Continuing with reconnect...'); | ||
this.emit(State.Reconnecting); | ||
cb.apply(this).then(res); | ||
if (this.shuttingDown) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While not a problem brought up by users, I noticed a problem here when writing integration tests:
- if you turn on the socket mode client and it is seeing errors from the web socket server, and
- in between reconnect attempts you call
disconnect()
to stop the client,
.. then the client will keep trying to reconnect forever via this part of the code. Adding this guard here just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this
makes for an awesome catch 🙏 ✨ Thanks for guarding against it!
@@ -61,10 +61,7 @@ function errorWithCode(error: Error, code: ErrorCode): CodedError { | |||
* A factory to create SMWebsocketError objects. | |||
*/ | |||
export function websocketErrorWithOriginal(original: Error): SMWebsocketError { | |||
const error = errorWithCode( | |||
new Error(`Failed to send message on websocket: ${original.message}`), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed "Failed to send message" as part of this error, as this error can be raised unrelated to message-sending. In this PR, I noticed that websocket handshake errors also used this code path, which made the error message misleading.
@@ -168,6 +169,47 @@ describe('Integration tests with a WebSocket server', () => { | |||
assert.isTrue(debugLoggerSpy.calledWith(sinon.match('Unable to parse an incoming WebSocket message'))); | |||
await client.disconnect(); | |||
}); | |||
it('should maintain one serial reconnection attempt if WSS server sends unexpected HTTP response during handshake, like a 409', async () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the test testing for the exponential reconnection doubling spiral bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Keeping track of what's been instantiated in this
is such a great fix!
I've tested this with a few common cases and error cases with a Bolt app using the RC from slackapi/bolt-js#2347, including:
- Running an app for a while
- Running multiple apps at the same time
- Running >10 apps at once for infinite retries due to
too_many_websockets
- Turning internet off and back on while apps are running
And found all of these to connect with the right retries! One socket connection process persisted after exit which was due to some tooling strangeness and scripting I had setup, but after kill
the num_connections
returned to the expected 1
😌 🎉
With the backend rollout resolving most 409
errors and reports from @gdhardy1 that this is working well, I think this is set to merge and release! 🚢 💨
Going to coordinate a few other changes before this, but I appreciate this PR so much ❤️
this.logger.debug('Continuing with reconnect...'); | ||
this.emit(State.Reconnecting); | ||
cb.apply(this).then(res); | ||
if (this.shuttingDown) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this
makes for an awesome catch 🙏 ✨ Thanks for guarding against it!
import sinon from 'sinon'; | ||
|
||
import { SlackWebSocket } from './SlackWebSocket'; | ||
proxyquire.noPreserveCache(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Superb test setups all throughout 👏 This adds so much confidence around the changes.
With some of the retry timing we might have to lookout for flakiness of tests due to the CI runners - nothing a rerun might not solve, but with the matrix setups I think our chances of this happening increase eversomuch 👾
@@ -90,7 +90,7 @@ describe('Integration tests with a WebSocket server', () => { | |||
await client.disconnect(); | |||
}); | |||
}); | |||
describe('catastrophic server behaviour', () => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥!
client.on('close', () => { | ||
closed++; | ||
}); | ||
// do not use await here, since `start()` won't return until the connection is established. we are intentionally testing connection establishment failure, so that will never finish. so, let's run this in a rogue "thread", e.g. fire off an async method and let it do its thing! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome 👏
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me as well 💯
Thank you @zimeg for manually stress testing these changes 🙏
I tested out our socket-mode-oauth example and seems like the socket gets established properly 👍
@WilliamBergamin 🙏 And thank you for checking with the example! Let's merge this great fix for the next release 🚀 |
This PR fixes #2094
If a websocket error is raised during WS connection establishment, and retries are on, the library today may start a runaway loop of doubling the number of WS connections as it attempts to retry. This was caused by the
SlackWebSocket
class not tracking a fresh new WS connection until the connection was established; if an error occurred during the handshake, a new WS connection would be created and the old one would not be cleaned up. The core of the fix in this PR involves tracking the WS connection at all times.