Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some dead websocket connections are never cleaned up #431

Closed
krajj7 opened this issue Jul 8, 2023 · 27 comments
Closed

Some dead websocket connections are never cleaned up #431

krajj7 opened this issue Jul 8, 2023 · 27 comments
Assignees
Labels

Comments

@krajj7
Copy link

krajj7 commented Jul 8, 2023

After upgrading from sente 1.17.0 to 1.18.1 I noticed that my app is accumulating dead websocket connections in the connected-uids atom that never go away, even after trying to close them by sending [:chsk/close]. One case when this seems to happen is when the browser is killed non-gracefully, ie. kill -9.

This can be simulated by using the example project in the repo (I am assuming Linux and using Chromium as the browser):

  1. git checkout https://github.com/ptaoussanis/sente.git; cd sente/example-project; lein cljsbuild once
  2. (for my JVM I have to add the dependency [com.google.guava/guava "25.1-jre"] in project.clj or it doesn't work)
  3. lein repl
  4. run (-main)
  5. Navigate your browser to the page printed in the REPL ("Web server is running at ...")
  6. Make sure it made a websocket connection (I haven't tested AJAX)
  7. Log in with some user-id on the page
  8. Running (:ws @connected-uids) in the REPL should show that user-id
  9. Kill the browser non-gracefully by running killall -KILL chromium
  10. Try to disconnect the client: (chsk-send! <uid> [:chsk/close])
  11. Wait 30 seconds and check (:ws @connected-uids)

With the example project from version 1.17.0 the client disappears from (:ws @connected-uids) after only a few seconds. With version 1.18.1 the client seems to stay indefinitely.

I realize that using :chsk/close is undocumented, as far as I understand, unresponsive clients should eventually be disconnected automatically, but this never seems to happen with 1.18.1 (when the browser is killed non-gracefully as above). With 1.17.0 automatic disconnects mostly worked but I found some cases where an explicit close was needed, which is why I'm using it.

@ptaoussanis ptaoussanis self-assigned this Jul 8, 2023
@ptaoussanis
Copy link
Member

@krajj7 Hi Jan,

I'm having difficulty reproducing this so far. I'm consistently seeing (:ws @connected-uids) displaying the expected #{} a couple seconds after the forced kill.

Are you 100% sure that you're running Sente 1.18.1 and not 1.18.0? Have you tried running lein clean before your build?

You can check the sente/sente-version var to make sure you don't have a stale build artifact, etc.

I realize that using :chsk/close is undocumented

Correct, but it should still work fine.

as far as I understand, unresponsive clients should eventually be disconnected automatically

Correct. The server has a loop that regularly tries to ping the client if the connection has been idle. That ping attempt should trigger a connection close if it fails.

If you've confirmed that you're really running 1.18.1 and are still seeing the issue - did you make any other changes to the reference example? E.g. did you change the default web server from http-kit, etc.?

Thanks!

@krajj7
Copy link
Author

krajj7 commented Jul 9, 2023

Hi, thank you for looking into this.

I re-tried today and found that I also cannot reliably reproduce with the given instructions, I am sorry about that. I can still trigger stuck connections with a few minutes of trying, but frustratingly, I haven't been able to find a reliable trigger.

I am 100% sure I'm using 1.18.1 and I didn't make any relevant changes in the example project. The only changes I made was the added guava dependency and commenting out auto-opening of the browser, since (.browse (java.awt.Desktop/getDesktop) (java.net.URI. uri)) causes an UnsupportedOperationException on my machine.

I was most successful triggering the problem with these steps (in the example project):

  1. Start the example project server, open the web page, make sure it's using a websocket connection
  2. Log in as "a"
  3. Freeze the browser by running killall -STOP chromium
  4. Spam the browser with a couple messages from the REPL: (doseq [x (range 250)] (chsk-send! "a" [:test/test]))
  5. Un-freeze the browser by doing killall -CONT chromium
  6. Quickly run (chsk-send! "a" [:chsk/close]) in the REPL while the queued-up messages are processing. The browser eventually reconnects, but sometimes this new connection becomes "immortal".
  7. Close the browser or log in with a different UID
  8. After a few seconds check (:ws @connected-uids). If the problem was triggered "a" will remain there forever.

The timing of step 6 seems to be critical. If it's done too late or too early the problem doesn't manifest. The message spam doesn't always seem to be neccessary, I'm not sure if it matters or not.

I'm sorry about these unreliable instructions. If possible please give it 5 minutes and let me know if you got a stuck connection. If not I will try to dive deeper and see if I can find the cause or something more helpful.

@krajj7
Copy link
Author

krajj7 commented Jul 9, 2023

I'll also add that besides the example project and chromium, I also observed this problem in my larger app which uses http-kit, the connections were non-localhost and it happens with Firefox as well.

@ptaoussanis
Copy link
Member

@krajj7 Hi Jan, thanks for the additional info. It's helpful to know if it's intermittent and/or seems to involve some kind of timing issue.

That indicates a different kind of problem than if it's consistently reproducible.

Will investigate further and come back to you 👍

@ptaoussanis
Copy link
Member

@krajj7 Really struggling to reproduce this on my end. And have gone over the relevant code a few times now without spotting any obvious explanation so far.

I'm sorry to prod on this, but would you please humour me and just explicitly confirm that you've run lein clean and confirmed that you're seeing [1 18 1] when you check sente/version?

1.18.0 had a known issue with connected-uids_ not being properly cleared after disconnects, and I just know from experience that it's super easy to unexpectedly end up with stale artifacts.

I need to get some sleep, but will continue digging first thing in the morning - I might be missing something obvious atm.

Apologies for all the trouble on this, and appreciate your assistance debugging!

Cheers

@ptaoussanis
Copy link
Member

Quick update: while I still haven't been able to successfully reproduce on my end yet, I have a couple ideas for some of the logic that could be made more robust. Will prepare a debug build with some improvements and extra logging and come back to you.

@krajj7
Copy link
Author

krajj7 commented Jul 10, 2023

I'm sorry to hear you weren't able to reproduce it. My repro instructions must still be missing something crucial, but I don't know what it is.

I can confirm I get [1 18 1] when I run sente/sente-version. I don't have 1.18.0 on my system at all. I managed to reproduce this after doing "lein clean" and also from a completely fresh checkout of the repository.

I'll try to look "under the hood" a bit more and see if I can discover something.

@ptaoussanis
Copy link
Member

I can confirm I get [1 18 1] when I run sente/sente-version. I don't have 1.18.0 on my system at all. I managed to reproduce this after doing "lein clean" and also from a completely fresh checkout of the repository.

Thanks Jan 🙏

I'll try to look "under the hood" a bit more and see if I can discover something.

I think I've spotted a possible cause of the trouble you're seeing, trying to confirm - will come back to you shortly.

ptaoussanis added a commit that referenced this issue Jul 10, 2023
Possible fix for [#431]?

Before this commit:
  It was possible for a `conns_` CAS update to fail on conn close,
  *while still updating* the entry's udt - leading to the responsible
  gc loop failing its udt check.

After this commit:
  CAS updates to `conns_` are properly atomic.
  Either sch and udt are both updated, or neither is.
ptaoussanis added a commit that referenced this issue Jul 10, 2023
@ptaoussanis
Copy link
Member

When you get an opportunity, would you please try [com.taoensso/sente "1.18.2-SNAPSHOT"], now on Clojars?

The dev reference project uses this version.

My hope is that the problem should be fixed here, but if it's not - could you please share the server-side logging output from that reference project?

Thanks so much for all your time+effort on this! Really sorry about all the trouble.

@krajj7
Copy link
Author

krajj7 commented Jul 10, 2023

Thanks so much for all your time+effort on this! Really sorry about all the trouble.

No problem at all, I'm glad to assist.

I tried "1.18.2-SNAPSHOT" and managed to trigger the issue.

The log is here: https://github.com/ptaoussanis/sente/files/12002682/bug.log
(the problematic session starts at 2023-07-10T13:06:11.290Z, I accidentally included the previous one as well)

This is roughly what I did:

  • Started server, opened example page, refreshed to get WS connection and logged in as "a"
  • Tried to get the connection stuck by freezing the browser, sending :chsk/close to "a", unfreezing
  • Logged in as "b", then back again as "a"
  • Noticed that "b" got stuck, closed browser
  • Waited two minutes, terminated the app

A few things to note:

  • I was trying to get "a" stuck, but in the end it was "b" that got stuck without ever sending :chsk/close to it (only sent that to "a"), so :chsk/close may not be important
  • This appears in the log around the time when I closed the browser for good (actually that may have been later, or possibly I restarted the browser, unfortunately I don't remember anymore):
    ws gc loop end {:uid "b", :gid "9483"}
    But as the logs show "b" still seems to be connected and being sent messages.
  • There also seems to be a stuck :sente/nil-uid AJAX connection that remains after the browser is closed, I'm not sure where exactly it came from.

Hope this helps. I'll be happy to do more experiments if needed.

ptaoussanis added a commit that referenced this issue Jul 10, 2023
The new implementation is more explicit, and includes a fix to
at least one bug in the previous implementation: it was previously
possible for a `conns_` CAS update to fail on conn close,
*while still updating* the entry's udt - leading to the responsible
gc loop failing its udt check.
ptaoussanis added a commit that referenced this issue Jul 10, 2023
@ptaoussanis
Copy link
Member

Hope this helps. I'll be happy to do more experiments if needed.

That did help, thanks! And much appreciated 🙏

Would you please try again with the current (updated) [com.taoensso/sente "1.18.2-SNAPSHOT"]?

@krajj7
Copy link
Author

krajj7 commented Jul 10, 2023

The new example project does (set-min-log-level! :report) which inhibits logging (I couldn't see what port the server used to open the page), so I changed that to (set-min-log-level! :trace)

Anyway the issue seems to be much easier to trigger now. I get stuck connections just by re-logging in with different user-ids, without doing anything else.

I made two logs, the first is really short:
https://github.com/ptaoussanis/sente/files/12004962/1.18.1-2_test1.log

  1. opened example page, logged in as "a"
  2. froze browser
  3. sent (chsk-send! "a" [:chsk/close])
  4. noticed connection not going away
  5. killed browser
  6. connection "a" is still there after a few minutes

Second log:
https://github.com/ptaoussanis/sente/files/12004969/1.18.1-2_test2.log

  1. opened example page, logged in as "1"
  2. logged in as "2"
  3. logged in as "3"
  4. logged in as "4"
  5. logged in as "5"
  6. closed browser
  7. there are two stuck connections

ptaoussanis added a commit that referenced this issue Jul 10, 2023
The new implementation is more explicit, and includes a fix to
at least one bug in the previous implementation: it was previously
possible for a `conns_` CAS update to fail on conn close,
*while still updating* the entry's udt - leading to the responsible
gc loop failing its udt check.
@ptaoussanis
Copy link
Member

Finally managed to see this on my end! For some reason took dozens of attempts on my system, still don't know why we've had such different experiences with this.

Anyway, it looks like the lingering connections were caused by the connection's :on-close event somehow firing before handshake. I didn't anticipate that as a possibility, so the connection logic was getting into an unexpected state and showing up as "immortal" as you put it.

I'll get the logic reworked and push another update.

Will then ask for your patience to test once again if possible, since it's been so difficult for me to actually generate the necessary conditions on my system.

ptaoussanis added a commit that referenced this issue Jul 11, 2023
Incl. notably some additional debugging tools for [#431]:

  - `connected-uids_` and `conns_` now both printed in loop
  - New buttons:
    - Print connected uids
    - Test repeated logins
ptaoussanis added a commit that referenced this issue Jul 11, 2023
Improvements include:

  - New internal "conn-id" concept that:
    1. No longer depends on http server implementations to
       properly implement identity
    2. Greatly improves logging output, easing debugging

  - General logging improvements to ease debugging
  - Now expose internal `conns_` state to ease debugging

  - Added server-side ping timeout to match client side,
    and to catch unexpected cases where http server is
    never able to identify a connection as broken.

  - Simplified internal API for updating `conns_` state.

  - More robust handling when events fire in unexpected
    order (e.g. :on-close firing before :on-open handshake).

  - Generally improved clarity and robustness.

Note: some relevant additions will also be made to the
reference example project to aid debugging related to
ptaoussanis added a commit that referenced this issue Jul 11, 2023
Improvements include:

  - New internal "conn-id" concept that:
    1. No longer depends on http server implementations to
       properly implement identity
    2. Greatly improves logging output, easing debugging

  - General logging improvements to ease debugging
  - Now expose internal `conns_` state to ease debugging

  - Added server-side ping timeout to match client side,
    and to catch unexpected cases where http server is
    never able to identify a connection as broken.

  - Simplified internal API for updating `conns_` state.

  - More robust handling when events fire in unexpected
    order (e.g. :on-close firing before :on-open handshake).

  - Generally improved clarity and robustness.

Note: some relevant additions will also be made to the
reference example project to aid debugging related to
@ptaoussanis
Copy link
Member

Okay, [com.taoensso/sente "1.18.2-alpha2"] is up on Clojars and includes a significant reworking of the connection management code.

Using this version, I seem to be unable to produce any lingering dead connections on my system. But since I also struggled with the previous version, it would be great to get confirmation from you if possible.

I've added some additional relevant debugging tools to the reference example on master, incl.:

  • connected-uids_ and conns_ are now both printed in the server loop every 10 secs
  • New buttons have been added:
    • Print connected uids
    • Test repeated logins

Hopefully everything works on your end.
As soon as you can get me a confirmation, I'll cut a stable point release.

@krajj7
Copy link
Author

krajj7 commented Jul 11, 2023

Tried 1.18.2-alpha2, made sure to lein clean etc., but unfortunately I am still seeing the problem. I got some websocket connections stuck as well as an AJAX one.

Here is a log that ends with two stuck websocket connections:
1.18.2-alpha2_ws.log

Here is a log that ends with one AJAX connection stuck:
1.18.2-alpha2_ajax.log

@ptaoussanis
Copy link
Member

Wow, that's so crazy ^^

Will go through your logs in a moment.

In the meantime- is there anything about your system that's unusual? It would be so helpful if I could reproduce this better on my end.

What browser and OS are you on? Are you running any browser plugins or other software that could be doing network conditioning? Are you changing any options or software in the reference example?

You mentioned earlier that you need to add Guava for your JVM to work - what JVM is that?

To clarify: this behaviour definitely shouldn't happen in any environment, it's a bug in Sente if this can occur. I just feel bad repeatedly wasting your time with updates that don't actually solve the problem.

Might be helpful if I try replicate your environment if possible.

@krajj7
Copy link
Author

krajj7 commented Jul 11, 2023

I don't think there's anything unusual about my setup. I'm not changing anything in the project except for logging (added file logger, :trace by default) and the guava lib. I don't use any unusual networking software or plugins.

HW: Lenovo Yoga Slim 7 15ITL05 laptop
OS: Debian testing
JVM: Java 17.0.8-ea OpenJDK 64-Bit Server VM (Debian packaged)
Browsers where I reproduced the problem (with the newest alpha):

  • Chromium Version 114.0.5735.198 (Official Build) with 20 various extensions
  • same but with an empty --temp-profile (no extensions)
  • Firefox 102.13.0esr without extensions

I don't mind continuing testing new versions as needed. I can see that it would be much better if you could reproduce the issue yourself, but I don't know what could help with that.

@ptaoussanis
Copy link
Member

👍 And just on the off-chance that it's relevant, why is Guava necessary in your environment?

@krajj7
Copy link
Author

krajj7 commented Jul 11, 2023

I think it's some dependency conflict. The server runs without the guava dep, but when I try to compile the javascript I get this error:

Execution error (NoSuchMethodError) at com.google.javascript.jscomp.deps.ModuleLoader/createRootPaths (ModuleLoader.java:257).
'java.util.stream.Collector com.google.common.collect.ImmutableSortedSet.toImmutableSortedSet(java.util.Comparator)'

By default [com.google.guava/guava "20.0"] is used, adding an explicit [com.google.guava/guava "25.1-jre"] dependency makes it work.

@ptaoussanis
Copy link
Member

ptaoussanis commented Jul 11, 2023

So to confirm: if you just clone the Sente repo without making any changes, go to the example project's path - and execute lein start - you see the NoSuchMethodError above?

Edit: if so, can you share the contents of your ~/.lein/profiles.clj?

@krajj7
Copy link
Author

krajj7 commented Jul 11, 2023

So to confirm: if you just clone the Sente repo without making any changes, go to the example project's path - and execute lein start - you see the NoSuchMethodError above?

Yes.

Edit: if so, can you share the contents of your ~/.lein/profiles.clj?

I do have a bunch of things in ~/.lein/profiles.clj:

{:repl {:plugins [ [cider/cider-nrepl "0.31.0"]
                  [lein-cljfmt "0.9.2"] ] }
 :user
 {:plugins [[lein-kibit "0.1.8"]
            [lein-cloverage "1.2.4"]
            [jonase/eastwood "1.4.0"]
            [lein-localrepo "0.5.4"]
            [lein-ancient "0.7.0"]
            [lein-nvd "2.0.0"] ]
   :dependencies [[inspector-jay "0.3"]
                  [cljfmt "0.9.2"]
                  [fipp "0.6.26"] ; cljfmt
                  [com.cemerick/pomegranate "1.1.0"]
                  [org.clojure/tools.namespace "1.4.4"]
                  [criterium "0.4.6"] ]
    :repl-options {:timeout 5184000 }}}

ptaoussanis added a commit that referenced this issue Jul 12, 2023
…ids`

Big thanks to @krajj7 for the report and huge assistance debugging!

As part of the investigation into this issue, and due to another recent
related issue (#429), I decided that we were overdue for a refactor
of Sente's connection management system.

The old system had grown overly complex, and left too much room for
edge cases and timing issues.

This commit introduces a major refactor of the system, with an
emphasis on robustness and improved observability.

Specific improvements include:

  - New internal "conn-id" concept that:
    1. No longer depends on http server implementations to
       properly implement identity
    2. Greatly improves logging output, easing debugging

  - General logging improvements to ease debugging
  - Now expose internal `conns_` state to ease debugging

  - Added server-side ping timeout to match client side,
    and to catch unexpected cases where http server is
    never able to identify a connection as broken.

    Note that this new feature is currently opt-in[1], but
    will be enabled by default in a future release.

  - Simplified internal API for updating `conns_` state.

  - More robust handling when events fire in unexpected
    order (e.g. :on-close firing before :on-open handshake).

  - Generally improved clarity and robustness.

Note: some related additions will also be made to the
reference example project in another commit.

[1] Provide a value (e.g. 5000) for the new `:ws-ping-timeout-ms`
    option to `make-channel-socket-server!`
ptaoussanis added a commit that referenced this issue Jul 12, 2023
Incl. notably some additional debugging tools for [#431]:

  - `connected-uids_` and `conns_` now both printed in loop
  - New buttons:
    - Print connected uids
    - Test repeated logins
@ptaoussanis
Copy link
Member

ptaoussanis commented Jul 12, 2023

Hi Jan- thanks for the additional info. Indeed, nothing sticks out as obviously relevant 👍

I've just pushed [com.taoensso/sente "1.18.2-alpha3"] to Clojars.

This includes a rewrite of the last code that I believe could have been causing problems.

Could you please try this and let me know? 🙏

I've also updated the reference example again to include some additional debugging tools - notably a toggle to simulate unreliable but unbroken connections.

@krajj7
Copy link
Author

krajj7 commented Jul 12, 2023

Hi Peter,

So far I haven't been able to trigger the issue with 1.18.2-alpha3, so that seems promising 👍

I'm going to try a longer experiment to confirm. I deployed the new alpha to my app where I originally noticed the problem and will script periodic browser freezes/restarts to happen during the night. That should test a lot of reconnects.

So I'll report tomorrow, unless I notice any problem sooner.

@ptaoussanis
Copy link
Member

I'm going to try a longer experiment to confirm. [...] So I'll report tomorrow, unless I notice any problem sooner.

Excellent, thank you so much! I'll wait on word from you then to push 1.18.2.

@krajj7
Copy link
Author

krajj7 commented Jul 13, 2023

I'm happy to report that my experiment didn't reveal any problems 👍

I tested re/connections to my app as well as the example project at the same time, more than 1800 sessions in total with random user-ids and none of them got stuck.

Thank you for working on this.

@ptaoussanis
Copy link
Member

That's great news, thanks so much for all your patience and assistance with debugging this!

ptaoussanis added a commit that referenced this issue Jul 13, 2023
…ids`

Big thanks to @krajj7 for the report and huge assistance debugging!

As part of the investigation into this issue, and due to another recent
related issue (#429), I decided that we were overdue for a refactor
of Sente's connection management system.

The old system had grown overly complex, and left too much room for
edge cases and timing issues.

This commit introduces a major refactor of the system, with an
emphasis on robustness and improved observability.

Specific improvements include:

  - New internal "conn-id" concept that:
    1. No longer depends on http server implementations to
       properly implement identity
    2. Greatly improves logging output, easing debugging

  - General logging improvements to ease debugging
  - Now expose internal `conns_` state to ease debugging

  - Added server-side ping timeout to match client side,
    and to catch unexpected cases where http server is
    never able to identify a connection as broken.

    Note that this new feature is currently opt-in[1], but
    will be enabled by default in a future release.

  - Simplified internal API for updating `conns_` state.

  - More robust handling when events fire in unexpected
    order (e.g. :on-close firing before :on-open handshake).

  - Generally improved clarity and robustness.

Note: some related additions will also be made to the
reference example project in another commit.

[1] Provide a value (e.g. 5000) for the new `:ws-ping-timeout-ms`
    option to `make-channel-socket-server!`
ptaoussanis added a commit that referenced this issue Jul 13, 2023
Incl. notably some additional debugging tools for [#431]:

  - `connected-uids_` and `conns_` now both printed in loop
  - New buttons:
    - Print connected uids
    - Test repeated logins
@ptaoussanis
Copy link
Member

[com.taoensso/sente "1.19.0"] is now up on Clojars, release notes here 👍

Note that because of the amount of code touched, I decided in hindsight to be cautious and release this work as part of a bigger version bump (1.19 instead of 1.18.2).

Closing for now, but please feel free to re-open if you do encounter any more related trouble.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants