Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

White screen after socket reconnect #4253

Open
chriswi93 opened this issue Jan 22, 2025 · 10 comments · May be fixed by #4271
Open

White screen after socket reconnect #4253

chriswi93 opened this issue Jan 22, 2025 · 10 comments · May be fixed by #4271

Comments

@chriswi93
Copy link

chriswi93 commented Jan 22, 2025

Description

Hello,

at first I would like to take this opportunity to praise this great project. I think NiceGUI is a really great package for quickly developing a user interface with Python. Many thanks to the developers of this project!

However, we have been observing an issue with NiceGUI for some time now and we could not find a way to fix it. We hope someone here can help to fix this bug as we don't know what is going on under the hood of NiceGUI. The issue is quite critical for us and we have already put a lot of effort into debugging the issue, but have had little success so far. According to our observations, the issue is not limited to the latest versions and also occurs in earlier versions < 2.0.

The issue is quite difficult to debug. Below I will try to describe my observations as best as I can, but unfortunately I can't provide a minimal reproducible code example. It seems like every NiceGUI application is affected by this issue, even very simple applications, and so far I have not found a way to actually reproduce it.

Description

We have noticed for some time now that the socket connection of our NiceGUI application in the browser is regularly closed from time to time. Afterwards the entire content of the application in the browser is deleted and a white screen is shown to the user. Contrary to our expectations, a reload of the page is not triggered. Sometimes this happens after ten minutes, sometimes after three hours or later:

Image

There is no clear pattern that could indicate what is causing the issue. It is unclear why the connection is closed, as there does not appear to be an actual network problem. As you can see in the image, a new socket connection is successfully established immediately after the initial socket connection is closed. However, after the second socket connection is established, the server sends a “clear screen” instruction to the browser:

42["update",{"0":null,"1":null,"2":null,"3":null,"4":null,"5":null,"6":null,"7":null,"8":null,"9":null,"10":null,"11":null,"12":null,"13":null,"14":null,"15":null,"16":null,"17":null,"18":null,"19":null,"20":null,"21":null,"22":null,"23":null,"24":null,"25":null,"26":null,"27":null,"28":null,"29":null,"30":null,"31":null,"32":null,"33":null,"34":null,"35":null,"36":null,"37":null,"38":null,"39":null,"40":null,"41":null,"42":null,"43":null,"44":null,"45":null,"46":null,"47":null,"48":null,"49":null,"50":null,"51":null,"52":null,"53":null,"54":null,"55":null,"56":null,"57":null,"58":null,"59":null,"60":null,"61":null,"62":null,"63":null,"64":null,"65":null,"66":null,"67":null,"68":null,"69":null,"70":null,"71":null,"72":null,"73":null,"74":null,"75":null,"76":null,"77":null,"78":null,"79":null,"80":null,"81":null,"82":null,"83":null,"84":null,"85":null,"86":null,"87":null,"88":null,"89":null,"90":null,"91":null,"92":null,"93":null,"94":null,"95":null,"96":null,"97":null,"98":null,"99":null,"100":null,"101":null,"102":null,"103":null,"104":null,"105":null,"106":null,"107":null,"108":null,"109":null,"110":null,"111":null,"112":null,"113":null,"114":null,"115":null,"116":null,"117":null,"118":null,"119":null,"120":null,"121":null,"122":null,"123":null,"124":null,"125":null,"126":null,"127":null,"128":null,"129":null,"130":null,"131":null,"132":null,"133":null,"134":null,"135":null,"136":null,"137":null,"138":null,"139":null,"140":null,"141":null,"142":null,"143":null,"144":null,"145":null,"146":null,"147":null,"148":null,"149":null,"150":null,"151":null,"152":null,"153":null,"154":null,"155":null,"156":null,"157":null,"158":null,"159":null,"160":null,"161":null,"162":null,"163":null,"164":null,"165":null,"166":null,"167":null,"168":null,"169":null,"170":null,"171":null,"172":null,"173":null,"174":null,"175":null,"176":null,"177":null,"180":null,"181":null,"_id":10}]

This instruction causes the browser to delete the entire content of the application. Afterwards the content of the “app” div element in the browser is empty:

<div id="app" data-v-app=""><!----></div>

I tried to debug the issue by filtering out the “clear screen” instruction sent by the NiceGUI server:

socket_emit = socketio.async_server.AsyncServer.emit

async def socket_emit_wrapper(self, event, data=None, **kwargs):
    """Socket outbound traffic"""
    if data and event == "update":
        if isinstance(data, dict):
            if len(data) > 0 and all([val is None for key, val in data.items() if key != "_id"]):
                LOGGER.error("detected clear screen operation was not sent to client")
                return
    await socket_emit(self, event, data=data, **kwargs)

socketio.async_server.AsyncServer.emit = socket_emit_wrapper

Subsequently, it is indeed the case that the content of the page is no longer deleted after the initial socket connection was closed and a new socket connection ist established. However, all elements in the user interface are frozen and do not respond to new user inputs anymore. In the background it can be observed that the client sends messages to the server via the newly established socket connection, but the server does not respond.

Important: The socket connection is established, and ping events are exchanged between the client and server, but the NiceGUI server does no longer respond to events happening in the browser. I would guess that the connection with socketio was successfully established, but there is any issue such that NiceGUI does not recognize that there is a new connection with the same client id and this connection is still alive.

Test Case

The initial scenario is that I have several tabs open in my browser and I am waiting until the issue occurs. Let's take a look at what happens in the browser (example for the client ID 5d29b567-fbfc-4e3c-9895-93670ca6e1b9):

Initial Socket Connection

10:50:31.217 - first socket message received from server
...
11:56:56.458 - last socket message received from server (ping)
-> Socket connection is closed

Before the socket connection is closed, everything works as expected:
Image

New socket connection established after 67 minutes

11:57:05.103 - first socket message received from server
11:57:19.297 - update event with "null" message content received from server (causes a white screen for the user)

As you can see, a new socket connection is established just a few seconds after the first socket connection was closed. After 15 seconds, a “clear screen” instruction is then sent from the server to the client, which deletes the content of the entire application in the browser:
Image

Now let's take a look at what has happened on the server in the meantime:

10:50:30,886 - connected client id 5d29b567-fbfc-4e3c-9895-93670ca6e1b9: created 2 seconds ago
11:57:04,793 - connected client id 5d29b567-fbfc-4e3c-9895-93670ca6e1b9: created 3996 seconds ago
11:57:18,866 - disconnected client id 5d29b567-fbfc-4e3c-9895-93670ca6e1b9: created 4010 seconds ago

As you can see, a new socket connection for the same client-id is established after 67 minutes (11:57:04,793), without there having been a disconnect event beforehand. 15 seconds later, the socket connection is closed unexpectedly (11:57:18,866) and shortly before the “clear screen” instruction arrives at the browser (11:57:19.297). It is quite strange that a disconnect event for the client id happens on the server, but the newly established connection in the browser is still alive.

It is also interesting to mention that the same issue occurs almost simultaneously with another client id (11:57:02,799), but only after 82 minutes the client was created:

10:35:27,789 - connected client id 82576291-c9c1-4195-a4bc-03cdad303183: created 6 seconds ago
11:57:02,799 - connected client id 82576291-c9c1-4195-a4bc-03cdad303183: created 4901 seconds ago
11:57:18,466 - disconnected client id 82576291-c9c1-4195-a4bc-03cdad303183: created 4916 seconds ago

Steps to Reproduce

The issue occurs with every NiceGUI application I have deployed so far in our Azure Kubernetes environment, even for very simple applications. We have not yet found a way to reproduce the issue. The only option we have at the moment is to open the application in the browser and wait for the issue to occur.

We have tried the following to reproduce the issue:
1) Run the application locally
So far the issue has not yet occurred, but we will continue to test it over a longer period of time.

2) Chrome Dev Tools - enable offline mode
The popup "Connection is lost - Trying to Reconnect" is shown as expected. After disabling the offline mode, a new socket connection is established and everything works as expected.

I would guess the issue occurs because the socket connection is closed in the browser without there being an actual connection problem. We don't know why the connection is closed at all, but something must be different compared to 2)

Conclusions

  1. We found out that the “clear screen” instruction is always sent from the server to the client after a disconnect event. What is the reason for this?

  2. My guess about the issue is that the NiceGUI server only recognizes much later than the browser that the first connection has been closed and a disconnect event occurs with a delay of 15 seconds in the example above. In the meantime, however, a new connection already exists for the same client ID and, just as it usually happens after every disconnect, a “clear screen” instruction is sent to the client. However, this “clear screen” instruction is received by the newly established connection and not by the already closed connection. This results in the user seeing a white screen in the browser. It seems like the messages sent to the socket connections are mixed up and reconnection does not work as expected.

  3. After that NiceGUI does not seem to recognize the client ID anymore, although socket.io still keeps a connection open in the background. In the browser I can see that a socket connection has been established and it exchanges ping events with the server, but the NiceGUI server no longer responds to user input (for example, a click on a button).

A symptom of the issue might be the following error message that appears from time to time, although I cannot say for sure whether the error is related to the issue:

12:42:25,079 - nicegui:123 - ERROR - '910d9914-1c92-437c-9cbd-f927b288cdee'
Traceback (most recent call last):
  File "/opt/conda/envs/.../lib/python3.10/site-packages/nicegui/background_tasks.py", line 52, in _handle_task_result
    task.result()
  File "/opt/conda/envs/.../lib/python3.10/site-packages/nicegui/client.py", line 256, in handle_disconnect
    self.delete()
  File "/opt/conda/envs/.../lib/python3.10/site-packages/nicegui/client.py", line 314, in delete
    del Client.instances[self.id]
KeyError: '910d9914-1c92-437c-9cbd-f927b288cdee'

The error has already been reported here and apparently has not been fixed yet: 2915

Environment

Azure Kubernetes (Ingress Nginx), Docker Container
NiceGUI 2.9.1 (Python 3.10), but also observed with NiceGUI < 2.0
Tested Browsers: Edge, Chrome, Firefox

Socket Config

Observed in Browser:

pingInterval: 8000
pingTimeout: 4000

NiceGUI:

ui.run(..., reconnect_timeout=10.0)
@rodja
Copy link
Member

rodja commented Jan 23, 2025

Thank you so much for this in-depth issue description @chriswi93. We noticed this strange "white screen after idle time" but have not been able to pinpoint it yet. Thanks to your efforts, we now have very promising lead:

The "clear event" seems to be caused by

self.remove_all_elements()

inside of client.delete which is called whenever we do have a disconnect:

nicegui/nicegui/client.py

Lines 250 to 260 in 5032ca4

def handle_disconnect(self) -> None:
"""Wait for the browser to reconnect; invoke disconnect handlers if it doesn't."""
async def handle_disconnect() -> None:
await asyncio.sleep(self.page.resolve_reconnect_timeout())
for t in self.disconnect_handlers:
self.safe_invoke(t)
for t in core.app._disconnect_handlers: # pylint: disable=protected-access
self.safe_invoke(t)
if not self.shared:
self.delete()
self._disconnect_task = background_tasks.create(handle_disconnect())

The code is there to clean up resources which are not longer needed (because the client disconnected). But we missed the fact that a socket reconnect will also trigger a disconnect on the old connection. In such a case the resources are still needed and we should not clear them.

@falkoschindler
Copy link
Contributor

Very interesting! Yesterday we noticed this white screen problem on nicegui.io and tried reverting individual features. We thought the problem would have started with version 2.10, but today we could reproduce it with 2.9.1 as well. And apparently the problem also occurs with older NiceGUI versions, like in our field friend project.

We noticed another important detail: When one page is blank, other pages keep working normally. And it seems like only the index page is affected. This is important for testing because https://nicegui.io/documentation/ can still be reachable while https://nicegui.io/ returns a white page.

@chriswi93
Copy link
Author

@rodja Would be great if it helps to fix it!
However, at the moment I can not see how to fix the issue. Maybe something like a connection count for each client would make sense and elements for a client are only deleted if there are zero connections left. But in that case we should know for sure that every new connection ends in a disconnect event or at least is garbage collected after a while. At the moment I can not see a disconnect event happening directly after closing the "white screen" tab, but maybe this is the case because NiceGUI is not aware that socketio still holds a connection in the background.

@falkoschindler I am not sure if this is actually the same issue. We have other non index pages that suffer from the same behavior. I also tested to change @ui.page('/') to @ui.page('/index') and the same issue occurs after while. I don't think it is limited to the index page.

@falkoschindler
Copy link
Contributor

@chriswi93 You may be right and there are two similar but different bugs at play. Let's focus on the white page after a short reconnect.

I think @rodja is right and the disconnect handler is the only place which could clear the whole page once a connection has successfully been established. The docstring says:

Wait for the browser to reconnect; invoke disconnect handlers if it doesn't.

I don't understand the part "if it doesn't [reconnect]". There doesn't seem to be code for checking if the disconnect has been canceled.

As far as I can tell, the reconnect logic has been added in 0c3e522. It might have never worked correctly for reconnecting clients.

On the other hand, if the disconnect handler is utterly broken, why can't we reproduce the white page by toggling the network connection via the developer console? The UI seems to reconnect correctly. Especially PR #3199 wouldn't have been possible if clients always clear their content during a reconnect.

Is it a meticulous timing issue? Or does a reconnecting socket trigger the disconnect handler only under certain conditions?

@rodja
Copy link
Member

rodja commented Jan 25, 2025

Wait for the browser to reconnect; invoke disconnect handlers if it doesn't.

@falkoschindler the normal "disconnect" flow from the server perspective is like this: socketio becomes a disconnect event and calls our handle_disconnect function. There we start a cleanup task which is delayed until the reconnect timeout expires. If a new connection is established in the meantime, the handle_handshake function cancels the cleanup task and all is good.
The issue here is that somehow on hosted servers (probably having a reverse proxy) the connection is severed in a way that the browser establishes a new connection before the server becomes the disconnect event. Hence the cleanup task is not canceled.
Why can't we reproduce it locally? It only happens with remote servers after a long idle time in the tab. Maybe there are some reverse proxy behaviours which stop a client websocket connection but still holds on a bit for the server?

@falkoschindler
Copy link
Contributor

@rodja Ah I see! I didn't think about self._disconnect_task.
What if we get two disconnect events before the next handle_handshake is called? The _disconnect_task is overwritten and one of them won't be canceled. Two disconnects might be possible if await sio.enter_room(sid, client.id) takes some time and client.handle_handshake() is only called afterwards.

@falkoschindler falkoschindler linked a pull request Jan 25, 2025 that will close this issue
@falkoschindler
Copy link
Contributor

@chriswi93 I just created PR #4271, drafting an idea how to make sure all disconnect tasks get canceled as soon as a handshake succeeds. Can you, please, try if this branch still shows white pages from time to time? Thanks!

@falkoschindler falkoschindler linked a pull request Jan 25, 2025 that will close this issue
@rodja
Copy link
Member

rodja commented Jan 25, 2025

I'm not sure if #4271 will fix this issue. From what @chriswi93 described in #4253 (comment), there are two connect events followed by a disconnect. Therefore I think that it's not caused by one disconnect replacing another. If we could somehow determine the current socket-id for the client, the creation of the disconnect task could be skipped. But before adding such code I am still looking for a way to reproduce this locally...

@chriswi93
Copy link
Author

chriswi93 commented Jan 25, 2025

I agree with @rodja. I think the root cause is that in case of a reconnect the handshake is done before the disconnect.
I am almost sure that I have now understood what's going on:

Remember the timing of the last socket message received from the server for the initial socket connection in the browser:

11:56:56.458 - last socket message received from server (ping)

As far as I know the socket server can only recognize that a client has actually disconnected if it does not receive a ping for pingInterval + pingTimeout milliseconds. In my case this is 12 seconds. So in the worst case the server detects a disconnect at 11:56:56.458 + 12 seconds = 11:57:08.458.

Afterwards handle_disconnect() is called and the background task is created. The background task has to sleep for reconnect_timeout seconds. In my case this is 10 seconds. So the resources for the client are actually cleaned up at: 11:57:08.458 + 10 seconds = 11:57:18.458. Remember from the server logs that my disconnect handler was called at 11:57:18,866:

11:57:18,866 - disconnected client id 5d29b567-fbfc-4e3c-9895-93670ca6e1b9: created 4010 seconds ago

So the estimated time almost matches the observed time in the logs.

To summarize the background task to clean up the client resources in handle_disconnect() must have started at 11:57:08.458 while handle_handshake for the newly established socket connection was called much earlier at 11:57:04,793:

11:57:04,793 - connected client id 5d29b567-fbfc-4e3c-9895-93670ca6e1b9: created 3996 seconds ago

Therefore the code in handle_handshake had no chance to cancel the async task because the async task was created much later. I think the assumptions that were made when the code was written were that in case of a reconnect, handle_disconnect() is always called before handle_handshake for the same client id. But as we can see, this assumption does not always hold.

How to fix

To fix the issue I tested the code below. It overrides handle_handshake and handle_disconnect. Have a look at the commented lines in the code. The counter NUM_ACTIVE_CONNECTIONS is increased when a client connects and decreased when a client disconnects. For n active connections I skip the background task n-1 times. So one background task should be sufficient to clean up the resources for a client.

import nicegui.client

NUM_ACTIVE_CONNECTIONS = {}

handle_handshake = nicegui.client.Client.handle_handshake

def handle_handshake_wrapper(self, next_message_id) -> None:
    handle_handshake(self, next_message_id)
    # for each client: count active connections
    NUM_ACTIVE_CONNECTIONS[self.id] = NUM_ACTIVE_CONNECTIONS.get(self.id, 0) + 1
    logger.error("after handshake, %i active connections, client id %s", NUM_ACTIVE_CONNECTIONS[self.id], self.id)

def handle_disconnect_wrapper(self) -> None:
    """Wait for the browser to reconnect; invoke disconnect handlers if it doesn't."""
    async def handle_disconnect() -> None:
        sleep_timeout = self.page.resolve_reconnect_timeout()
        logger.error("before sleep (%i), %i active connections, client id %s", sleep_timeout, NUM_ACTIVE_CONNECTIONS[self.id], self.id)
        await asyncio.sleep(self.page.resolve_reconnect_timeout())
        logger.error("after sleep (%i), %i active connections, client id %s", sleep_timeout, NUM_ACTIVE_CONNECTIONS[self.id], self.id)
        for t in self.disconnect_handlers:
            self.safe_invoke(t)
        for t in nicegui.client.core.app._disconnect_handlers:  # pylint: disable=protected-access
            self.safe_invoke(t)
        if not self.shared:
            logger.error("delete elements, %i active connections, client id %s", NUM_ACTIVE_CONNECTIONS[self.id], self.id)
            self.delete()
    
    # if there are n active connections:
    # skip n - 1 disconnect events before actually cleaning up resources for the client
    if NUM_ACTIVE_CONNECTIONS[self.id] <= 1:
        logger.error("disconnect: start background task for cleanup, %i active connections, client id %s", NUM_ACTIVE_CONNECTIONS[self.id], self.id)
        self._disconnect_task = nicegui.client.background_tasks.create(handle_disconnect())
    else:
        logger.error("disconnect: do nothing, %i active connections, client id %s", NUM_ACTIVE_CONNECTIONS[self.id], self.id)
    NUM_ACTIVE_CONNECTIONS[self.id] -= 1

nicegui.client.Client.handle_handshake = handle_handshake_wrapper
nicegui.client.Client.handle_disconnect = handle_disconnect_wrapper

Results

A quick test for just one client id suggests that it works as expected. Here you can see the server logs:

Open the application in the browser:

2025-01-25 15:56:39,686 - ERROR - after handshake, 1 active connections, client id 4871385d-4bef-4fec-9913-04d5fafc7655
2025-01-25 15:56:39,773 - ERROR - connected client id 4871385d-4bef-4fec-9913-04d5fafc7655: created 4 seconds ago

A reconnect happens after 11 minutes (but client resources are not cleaned up!):

2025-01-25 16:07:51,584 - ERROR - after handshake, 2 active connections, client id 4871385d-4bef-4fec-9913-04d5fafc7655
2025-01-25 16:07:51,587 - ERROR - connected client id 4871385d-4bef-4fec-9913-04d5fafc7655: created 676 seconds ago
2025-01-25 16:07:54,648 - ERROR - disconnect: do nothing, 2 active connections, client id 4871385d-4bef-4fec-9913-04d5fafc7655

-> the application still works as expected and there is no white screen

The application browser tab is closed (and client resources are cleaned up properly):

2025-01-25 16:14:39,658 - ERROR - disconnect: start background task for cleanup, 1 active connections, client id 4871385d-4bef-4fec-9913-04d5fafc7655
2025-01-25 16:14:39,660 - ERROR - before sleep (10), 0 active connections, client id 4871385d-4bef-4fec-9913-04d5fafc7655
2025-01-25 16:14:49,661 - ERROR - after sleep (10), 0 active connections, client id 4871385d-4bef-4fec-9913-04d5fafc7655
2025-01-25 16:14:49,663 - ERROR - delete elements, 0 active connections, client id 4871385d-4bef-4fec-9913-04d5fafc7655
2025-01-25 16:14:49,669 - ERROR - disconnected client id 4871385d-4bef-4fec-9913-04d5fafc7655: created 1094 seconds ago

This was just one quick test. Our team will test this more extensively next week.
Please let me know if I have missed something.

@rodja
Copy link
Member

rodja commented Jan 25, 2025

Wonderful @chriswi93. Mainly to find a solution for #4270 I have deployed your suggested fix on https://nicegui.io. It looks like both white screen issues are fixed!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants