-
-
Notifications
You must be signed in to change notification settings - Fork 640
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
White screen after socket reconnect #4253
Comments
Thank you so much for this in-depth issue description @chriswi93. We noticed this strange "white screen after idle time" but have not been able to pinpoint it yet. Thanks to your efforts, we now have very promising lead: The "clear event" seems to be caused by Line 315 in 5032ca4
inside of Lines 250 to 260 in 5032ca4
The code is there to clean up resources which are not longer needed (because the client disconnected). But we missed the fact that a socket reconnect will also trigger a disconnect on the old connection. In such a case the resources are still needed and we should not clear them. |
Very interesting! Yesterday we noticed this white screen problem on nicegui.io and tried reverting individual features. We thought the problem would have started with version 2.10, but today we could reproduce it with 2.9.1 as well. And apparently the problem also occurs with older NiceGUI versions, like in our field friend project. We noticed another important detail: When one page is blank, other pages keep working normally. And it seems like only the index page is affected. This is important for testing because https://nicegui.io/documentation/ can still be reachable while https://nicegui.io/ returns a white page. |
@rodja Would be great if it helps to fix it! @falkoschindler I am not sure if this is actually the same issue. We have other non index pages that suffer from the same behavior. I also tested to change |
@chriswi93 You may be right and there are two similar but different bugs at play. Let's focus on the white page after a short reconnect. I think @rodja is right and the disconnect handler is the only place which could clear the whole page once a connection has successfully been established. The docstring says:
I don't understand the part "if it doesn't [reconnect]". There doesn't seem to be code for checking if the disconnect has been canceled. As far as I can tell, the reconnect logic has been added in 0c3e522. It might have never worked correctly for reconnecting clients. On the other hand, if the disconnect handler is utterly broken, why can't we reproduce the white page by toggling the network connection via the developer console? The UI seems to reconnect correctly. Especially PR #3199 wouldn't have been possible if clients always clear their content during a reconnect. Is it a meticulous timing issue? Or does a reconnecting socket trigger the disconnect handler only under certain conditions? |
@falkoschindler the normal "disconnect" flow from the server perspective is like this: socketio becomes a disconnect event and calls our |
@rodja Ah I see! I didn't think about |
@chriswi93 I just created PR #4271, drafting an idea how to make sure all disconnect tasks get canceled as soon as a handshake succeeds. Can you, please, try if this branch still shows white pages from time to time? Thanks! |
I'm not sure if #4271 will fix this issue. From what @chriswi93 described in #4253 (comment), there are two connect events followed by a disconnect. Therefore I think that it's not caused by one disconnect replacing another. If we could somehow determine the current socket-id for the client, the creation of the disconnect task could be skipped. But before adding such code I am still looking for a way to reproduce this locally... |
I agree with @rodja. I think the root cause is that in case of a reconnect the handshake is done before the disconnect. Remember the timing of the last socket message received from the server for the initial socket connection in the browser:
As far as I know the socket server can only recognize that a client has actually disconnected if it does not receive a ping for Afterwards
So the estimated time almost matches the observed time in the logs. To summarize the background task to clean up the client resources in
Therefore the code in How to fixTo fix the issue I tested the code below. It overrides import nicegui.client
NUM_ACTIVE_CONNECTIONS = {}
handle_handshake = nicegui.client.Client.handle_handshake
def handle_handshake_wrapper(self, next_message_id) -> None:
handle_handshake(self, next_message_id)
# for each client: count active connections
NUM_ACTIVE_CONNECTIONS[self.id] = NUM_ACTIVE_CONNECTIONS.get(self.id, 0) + 1
logger.error("after handshake, %i active connections, client id %s", NUM_ACTIVE_CONNECTIONS[self.id], self.id)
def handle_disconnect_wrapper(self) -> None:
"""Wait for the browser to reconnect; invoke disconnect handlers if it doesn't."""
async def handle_disconnect() -> None:
sleep_timeout = self.page.resolve_reconnect_timeout()
logger.error("before sleep (%i), %i active connections, client id %s", sleep_timeout, NUM_ACTIVE_CONNECTIONS[self.id], self.id)
await asyncio.sleep(self.page.resolve_reconnect_timeout())
logger.error("after sleep (%i), %i active connections, client id %s", sleep_timeout, NUM_ACTIVE_CONNECTIONS[self.id], self.id)
for t in self.disconnect_handlers:
self.safe_invoke(t)
for t in nicegui.client.core.app._disconnect_handlers: # pylint: disable=protected-access
self.safe_invoke(t)
if not self.shared:
logger.error("delete elements, %i active connections, client id %s", NUM_ACTIVE_CONNECTIONS[self.id], self.id)
self.delete()
# if there are n active connections:
# skip n - 1 disconnect events before actually cleaning up resources for the client
if NUM_ACTIVE_CONNECTIONS[self.id] <= 1:
logger.error("disconnect: start background task for cleanup, %i active connections, client id %s", NUM_ACTIVE_CONNECTIONS[self.id], self.id)
self._disconnect_task = nicegui.client.background_tasks.create(handle_disconnect())
else:
logger.error("disconnect: do nothing, %i active connections, client id %s", NUM_ACTIVE_CONNECTIONS[self.id], self.id)
NUM_ACTIVE_CONNECTIONS[self.id] -= 1
nicegui.client.Client.handle_handshake = handle_handshake_wrapper
nicegui.client.Client.handle_disconnect = handle_disconnect_wrapper ResultsA quick test for just one client id suggests that it works as expected. Here you can see the server logs: Open the application in the browser:
A reconnect happens after 11 minutes (but client resources are not cleaned up!):
-> the application still works as expected and there is no white screen The application browser tab is closed (and client resources are cleaned up properly):
This was just one quick test. Our team will test this more extensively next week. |
Wonderful @chriswi93. Mainly to find a solution for #4270 I have deployed your suggested fix on https://nicegui.io. It looks like both white screen issues are fixed! |
Description
Hello,
at first I would like to take this opportunity to praise this great project. I think NiceGUI is a really great package for quickly developing a user interface with Python. Many thanks to the developers of this project!
However, we have been observing an issue with NiceGUI for some time now and we could not find a way to fix it. We hope someone here can help to fix this bug as we don't know what is going on under the hood of NiceGUI. The issue is quite critical for us and we have already put a lot of effort into debugging the issue, but have had little success so far. According to our observations, the issue is not limited to the latest versions and also occurs in earlier versions < 2.0.
The issue is quite difficult to debug. Below I will try to describe my observations as best as I can, but unfortunately I can't provide a minimal reproducible code example. It seems like every NiceGUI application is affected by this issue, even very simple applications, and so far I have not found a way to actually reproduce it.
Description
We have noticed for some time now that the socket connection of our NiceGUI application in the browser is regularly closed from time to time. Afterwards the entire content of the application in the browser is deleted and a white screen is shown to the user. Contrary to our expectations, a reload of the page is not triggered. Sometimes this happens after ten minutes, sometimes after three hours or later:
There is no clear pattern that could indicate what is causing the issue. It is unclear why the connection is closed, as there does not appear to be an actual network problem. As you can see in the image, a new socket connection is successfully established immediately after the initial socket connection is closed. However, after the second socket connection is established, the server sends a “clear screen” instruction to the browser:
This instruction causes the browser to delete the entire content of the application. Afterwards the content of the “app” div element in the browser is empty:
I tried to debug the issue by filtering out the “clear screen” instruction sent by the NiceGUI server:
Subsequently, it is indeed the case that the content of the page is no longer deleted after the initial socket connection was closed and a new socket connection ist established. However, all elements in the user interface are frozen and do not respond to new user inputs anymore. In the background it can be observed that the client sends messages to the server via the newly established socket connection, but the server does not respond.
Important: The socket connection is established, and ping events are exchanged between the client and server, but the NiceGUI server does no longer respond to events happening in the browser. I would guess that the connection with socketio was successfully established, but there is any issue such that NiceGUI does not recognize that there is a new connection with the same client id and this connection is still alive.
Test Case
The initial scenario is that I have several tabs open in my browser and I am waiting until the issue occurs. Let's take a look at what happens in the browser (example for the client ID 5d29b567-fbfc-4e3c-9895-93670ca6e1b9):
Initial Socket Connection
Before the socket connection is closed, everything works as expected:
New socket connection established after 67 minutes
As you can see, a new socket connection is established just a few seconds after the first socket connection was closed. After 15 seconds, a “clear screen” instruction is then sent from the server to the client, which deletes the content of the entire application in the browser:
Now let's take a look at what has happened on the server in the meantime:
As you can see, a new socket connection for the same client-id is established after 67 minutes (11:57:04,793), without there having been a disconnect event beforehand. 15 seconds later, the socket connection is closed unexpectedly (11:57:18,866) and shortly before the “clear screen” instruction arrives at the browser (11:57:19.297). It is quite strange that a disconnect event for the client id happens on the server, but the newly established connection in the browser is still alive.
It is also interesting to mention that the same issue occurs almost simultaneously with another client id (11:57:02,799), but only after 82 minutes the client was created:
Steps to Reproduce
The issue occurs with every NiceGUI application I have deployed so far in our Azure Kubernetes environment, even for very simple applications. We have not yet found a way to reproduce the issue. The only option we have at the moment is to open the application in the browser and wait for the issue to occur.
We have tried the following to reproduce the issue:
1) Run the application locally
So far the issue has not yet occurred, but we will continue to test it over a longer period of time.
2) Chrome Dev Tools - enable offline mode
The popup "Connection is lost - Trying to Reconnect" is shown as expected. After disabling the offline mode, a new socket connection is established and everything works as expected.
I would guess the issue occurs because the socket connection is closed in the browser without there being an actual connection problem. We don't know why the connection is closed at all, but something must be different compared to 2)
Conclusions
We found out that the “clear screen” instruction is always sent from the server to the client after a disconnect event. What is the reason for this?
My guess about the issue is that the NiceGUI server only recognizes much later than the browser that the first connection has been closed and a disconnect event occurs with a delay of 15 seconds in the example above. In the meantime, however, a new connection already exists for the same client ID and, just as it usually happens after every disconnect, a “clear screen” instruction is sent to the client. However, this “clear screen” instruction is received by the newly established connection and not by the already closed connection. This results in the user seeing a white screen in the browser. It seems like the messages sent to the socket connections are mixed up and reconnection does not work as expected.
After that NiceGUI does not seem to recognize the client ID anymore, although socket.io still keeps a connection open in the background. In the browser I can see that a socket connection has been established and it exchanges ping events with the server, but the NiceGUI server no longer responds to user input (for example, a click on a button).
A symptom of the issue might be the following error message that appears from time to time, although I cannot say for sure whether the error is related to the issue:
The error has already been reported here and apparently has not been fixed yet: 2915
Environment
Azure Kubernetes (Ingress Nginx), Docker Container
NiceGUI 2.9.1 (Python 3.10), but also observed with NiceGUI < 2.0
Tested Browsers: Edge, Chrome, Firefox
Socket Config
Observed in Browser:
NiceGUI:
The text was updated successfully, but these errors were encountered: