Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

socket-mode: Handling WS errors during handshake #2099

Merged
merged 5 commits into from
Dec 9, 2024
Merged

socket-mode: Handling WS errors during handshake #2099

merged 5 commits into from
Dec 9, 2024

Conversation

filmaj
Copy link
Contributor

@filmaj filmaj commented Nov 22, 2024

This PR fixes #2094

If a websocket error is raised during WS connection establishment, and retries are on, the library today may start a runaway loop of doubling the number of WS connections as it attempts to retry. This was caused by the SlackWebSocket class not tracking a fresh new WS connection until the connection was established; if an error occurred during the handshake, a new WS connection would be created and the old one would not be cleaned up. The core of the fix in this PR involves tracking the WS connection at all times.

@filmaj filmaj added bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented discussion M-T: An issue where more input is needed to reach a decision server-side-issue draft pkg:socket-mode applies to `@slack/socket-mode` backend Issue related to the Slack API labels Nov 22, 2024
Copy link

codecov bot commented Nov 22, 2024

Codecov Report

Attention: Patch coverage is 50.00000% with 7 lines in your changes missing coverage. Please review.

Project coverage is 91.89%. Comparing base (278514f) to head (1c035e3).
Report is 20 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2099      +/-   ##
==========================================
+ Coverage   91.65%   91.89%   +0.23%     
==========================================
  Files          38       38              
  Lines       10311    10320       +9     
  Branches      647      649       +2     
==========================================
+ Hits         9451     9484      +33     
+ Misses        848      824      -24     
  Partials       12       12              
Flag Coverage Δ
cli-hooks 95.24% <ø> (+0.01%) ⬆️
cli-test 94.48% <ø> (+0.01%) ⬆️
oauth 77.39% <ø> (ø)
socket-mode 61.82% <50.00%> (+3.59%) ⬆️
web-api 96.88% <ø> (ø)
webhook 96.65% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

this.logger.debug('WebSocket open event received (connection established)!');
this.websocket = ws;
Copy link
Contributor Author

@filmaj filmaj Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not tracking the current websocket instance was a problem in the case the web socket server returns an error. Since only once the open event was emitted would we keep track of the websocket (this line), any errors during the websocket handshake would call into disconnect which, crucially, only terminates the underlying network resources for the tracked this.websocket.

That was the underlying issue that caused the "doubling up" of reconnection attempts (see "Issue 1" described by @gdhardy1 in this comment): each time an error was encountered during websocket handshake, there would be an 'untracked' ws instance still executing reconnect attempts in addition to a new one being spawned. Essentially websocket-connection-bombing ourselves.

If you remove the changes in this file, you will see the integration test I added at the end of this PR fail.

Copy link

@gdhardy1 gdhardy1 Nov 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for finding this. I was able to patch this in and test it while we await a full fix. It stabilized things tremendously.

this.logger.debug('Continuing with reconnect...');
this.emit(State.Reconnecting);
cb.apply(this).then(res);
if (this.shuttingDown) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not a problem brought up by users, I noticed a problem here when writing integration tests:

  • if you turn on the socket mode client and it is seeing errors from the web socket server, and
  • in between reconnect attempts you call disconnect() to stop the client,

.. then the client will keep trying to reconnect forever via this part of the code. Adding this guard here just in case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes for an awesome catch 🙏 ✨ Thanks for guarding against it!

@@ -61,10 +61,7 @@ function errorWithCode(error: Error, code: ErrorCode): CodedError {
* A factory to create SMWebsocketError objects.
*/
export function websocketErrorWithOriginal(original: Error): SMWebsocketError {
const error = errorWithCode(
new Error(`Failed to send message on websocket: ${original.message}`),
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed "Failed to send message" as part of this error, as this error can be raised unrelated to message-sending. In this PR, I noticed that websocket handshake errors also used this code path, which made the error message misleading.

@@ -168,6 +169,47 @@ describe('Integration tests with a WebSocket server', () => {
assert.isTrue(debugLoggerSpy.calledWith(sinon.match('Unable to parse an incoming WebSocket message')));
await client.disconnect();
});
it('should maintain one serial reconnection attempt if WSS server sends unexpected HTTP response during handshake, like a 409', async () => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the test testing for the exponential reconnection doubling spiral bug.

@filmaj filmaj changed the title WIP: Handling 409 response from websocket server socket-mode: Handling WS errors during handshake Nov 26, 2024
@filmaj filmaj marked this pull request as ready for review November 26, 2024 15:39
@filmaj filmaj requested a review from a team November 26, 2024 15:39
Copy link
Member

@zimeg zimeg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Keeping track of what's been instantiated in this is such a great fix!

I've tested this with a few common cases and error cases with a Bolt app using the RC from slackapi/bolt-js#2347, including:

  • Running an app for a while
  • Running multiple apps at the same time
  • Running >10 apps at once for infinite retries due to too_many_websockets
  • Turning internet off and back on while apps are running

And found all of these to connect with the right retries! One socket connection process persisted after exit which was due to some tooling strangeness and scripting I had setup, but after kill the num_connections returned to the expected 1 😌 🎉

With the backend rollout resolving most 409 errors and reports from @gdhardy1 that this is working well, I think this is set to merge and release! 🚢 💨

Going to coordinate a few other changes before this, but I appreciate this PR so much ❤️

this.logger.debug('Continuing with reconnect...');
this.emit(State.Reconnecting);
cb.apply(this).then(res);
if (this.shuttingDown) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this makes for an awesome catch 🙏 ✨ Thanks for guarding against it!

import sinon from 'sinon';

import { SlackWebSocket } from './SlackWebSocket';
proxyquire.noPreserveCache();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Superb test setups all throughout 👏 This adds so much confidence around the changes.

With some of the retry timing we might have to lookout for flakiness of tests due to the CI runners - nothing a rerun might not solve, but with the matrix setups I think our chances of this happening increase eversomuch 👾

@@ -90,7 +90,7 @@ describe('Integration tests with a WebSocket server', () => {
await client.disconnect();
});
});
describe('catastrophic server behaviour', () => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥!

client.on('close', () => {
closed++;
});
// do not use await here, since `start()` won't return until the connection is established. we are intentionally testing connection establishment failure, so that will never finish. so, let's run this in a rogue "thread", e.g. fire off an async method and let it do its thing!
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome 👏

Copy link
Contributor

@WilliamBergamin WilliamBergamin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me as well 💯

Thank you @zimeg for manually stress testing these changes 🙏

I tested out our socket-mode-oauth example and seems like the socket gets established properly 👍

@zimeg
Copy link
Member

zimeg commented Dec 9, 2024

@WilliamBergamin 🙏 And thank you for checking with the example! Let's merge this great fix for the next release 🚀

@zimeg zimeg merged commit c389a75 into main Dec 9, 2024
56 of 57 checks passed
@zimeg zimeg deleted the handle-409 branch December 9, 2024 19:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Issue related to the Slack API bug M-T: A confirmed bug report. Issues are confirmed when the reproduction steps are documented discussion M-T: An issue where more input is needed to reach a decision draft pkg:socket-mode applies to `@slack/socket-mode` server-side-issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WebSocket error occurred: Unexpected server response: 409
4 participants