Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UI causes browser to run the system out of memory #5414

Closed
rbasralian opened this issue Apr 19, 2024 · 5 comments · Fixed by #5501
Closed

UI causes browser to run the system out of memory #5414

rbasralian opened this issue Apr 19, 2024 · 5 comments · Fixed by #5501
Assignees
Labels
bug Something isn't working jsapi
Milestone

Comments

@rbasralian
Copy link
Contributor

Something about the UI apparently causes Safari to use almost-inconceivable amounts of memory (the DHC UI is the only thing I had open in Safari here). It seems related to either stopping/restarting the DHC container or letting my computer go to sleep and wake back up (or maybe both). Given enough time, it causes my Mac to pop up a "your system is out of memory" dialog.

image

Whatever causes this also causes the server to repeatedly log "No AuthenticationRequestHandler registered for type Bearer":

image

Here is a server-side thread dump from when this was occurring: thread_dump.txt

Version info (this is a locally-built image off of commit 61ae61b6):

Engine Version: 0.34.0-SNAPSHOT
Web UI Version: 0.69.1
Java Version: 21.0.2
Barrage Version: 0.6.0
Browser Name: Safari 17.4
OS Name: macOS 10.15.7

I don't have browser logs from when this was occurring but will try to get them if it happens again.

@rbasralian rbasralian added bug Something isn't working triage labels Apr 19, 2024
@mattrunyon
Copy link
Contributor

Did you have anything running in your UI?

@dsmmcken
Copy link
Contributor

That's a lot of no auth request handler errors happening extremely fast, looking at the timestamps.

@niloc132
Copy link
Member

The JS API's design is intended to make it impossible for it to accidentally queue up many requests like this on its own - until an auth refresh succeeds or fails, the new one won't be passed to setTimeout. Is it possible that something in the UI is using setInterval or the like which might cause this when the tab resumes, handling dozens/hundreds of enqueued calls to complete?

For what its worth, the thread dump doesn't show any activity - and specifically nothing that shows a grpc call is presently being handled to refresh a token.

@niloc132
Copy link
Member

We've found a way that this could happen in the JS API, it requires a table to fail before the disconnect, and the disconnect has to be long enough for the auth token to expire. The client then gets into a bad state of trying to reconnect the subscription for the failed table in a microtask "loop", resulting rapid calls to the server ... which then each fail with auth issues.

I'll move this to deephaven-core and handle it there.

@niloc132 niloc132 transferred this issue from deephaven/web-client-ui Apr 25, 2024
@niloc132 niloc132 assigned niloc132 and unassigned mattrunyon Apr 25, 2024
@niloc132 niloc132 added the jsapi label Apr 25, 2024
@rcaudy rcaudy added this to the 2. April 2024 milestone Apr 26, 2024
@niloc132
Copy link
Member

Here's a simple set of steps to reproduce locally, simulating a network issue:

  • Start deephaven-core on one port, assuming 10000 for now
  • Forward another port to 10000 so that the app is available on a second port - I used ssh localhost -L 10001:localhost:10000 to also be able to connect to the app on 10001
  • Connect to the web IDE on port 10001 in a browser, http://localhost:10001
  • Start a command for a table that is doomed to fail, for example:
    from deephaven import time_table
    t = time_table("PT1s").update(["I = i == 15 ? null : `` + i", "J = I.split(`,`)"])
  • Wait for the table to load, then error out
  • End the SSH tunnel. Usually this involves ctrl-d to end the shell, then ctrl-c to ask ssh to disconnect without waiting for all forwarded ports to finish
  • Wait for the web IDE to acknowledge the disconnect, usually just a few seconds
  • Reconnect the ssh tunnel promptly - losing the auth token isn't required as I said it was above (that is only necessary for the repeated "No auth handler for type Bearer"

Expected, table stays broken, but the rest of the page keeps working
Actual, page freezes, constant stream of error messages on the server.

Note that this might not be quite the same as what @rbasralian originally found, but it does have many of the same characteristics - fixing this hopefully will prevent the constant reconnect loop originally seen (but not yet reproduced).

niloc132 added a commit that referenced this issue May 17, 2024
DHC reconnects are able to restore server streams to an existing session, but
most of the JS API was written to assume that a lost connection requires
rebuilding objects on the server by replaying operations. This fix handles
cases where a table failed and then a network error occurred, causing the
table to be stuck unable to reconnect, since the table has failed.

Two bugs prevented this from working, both cases where after some operation
couldn't be scheduled, a microtask would immediately try again, leading
effectively to an infinite loop in the browser. Table subscriptions are fixed
by first checking if the table is running, so can be subscribed, and table
refetch is fixed by using null for its fetcher, and during refetch if the
fetcher is null either fail right away with the existing fail message, or
succeed right away.

This fix currently makes it possible for a failed table on a reconnected worker
to not signal that it is still failed - this will be addressed in a follow-up.

Fixes #5414
stanbrub pushed a commit that referenced this issue May 17, 2024
DHC reconnects are able to restore server streams to an existing session, but
most of the JS API was written to assume that a lost connection requires
rebuilding objects on the server by replaying operations. This fix handles
cases where a table failed and then a network error occurred, causing the
table to be stuck unable to reconnect, since the table has failed.

Two bugs prevented this from working, both cases where after some operation
couldn't be scheduled, a microtask would immediately try again, leading
effectively to an infinite loop in the browser. Table subscriptions are fixed
by first checking if the table is running, so can be subscribed, and table
refetch is fixed by using null for its fetcher, and during refetch if the
fetcher is null either fail right away with the existing fail message, or
succeed right away.

This fix currently makes it possible for a failed table on a reconnected worker
to not signal that it is still failed - this will be addressed in a follow-up.

Fixes #5414
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working jsapi
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants