Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gtkperf hangs gtk3 server #2211

Closed
totaam opened this issue Mar 13, 2019 · 19 comments
Closed

gtkperf hangs gtk3 server #2211

totaam opened this issue Mar 13, 2019 · 19 comments
Labels

Comments

@totaam
Copy link
Collaborator

totaam commented Mar 13, 2019

Issue migrated from trac ticket # 2211

component: server | priority: critical | resolution: upstream

2019-03-13 12:53:41: antoine created the issue


Running gtkperf -a in a loop, and running xpra info in another loop in parallel eventually causes the server to hang.
It shows new connection but times them all out.

Running gtkperf -a in a loop alone does not crash.

Running xpra info alone in a loop does not crash or leak any memory.

@totaam
Copy link
Collaborator Author

totaam commented Mar 13, 2019

2019-03-13 12:58:53: antoine changed status from new to assigned

@totaam
Copy link
Collaborator Author

totaam commented Mar 13, 2019

2019-03-13 12:58:53: antoine edited the issue description

@totaam
Copy link
Collaborator Author

totaam commented Mar 18, 2019

2019-03-18 15:11:36: antoine commented


The server goes AWOL following this message in the server log:

(Xpra:7593): Gdk-WARNING **: 21:51:39.364: GdkWindow 0x400d81 unexpectedly destroyed
/usr/lib64/python3.7/site-packages/xpra/x11/models/window.py:327: Warning: g_object_get_qdata: assertion 'G_IS_OBJECT (object)' failed
  self.corral_window = None
/usr/lib64/python3.7/site-packages/xpra/x11/models/window.py:327: Warning: g_object_set_qdata_full: assertion 'G_IS_OBJECT (object)' failed
  self.corral_window = None
/usr/lib64/python3.7/site-packages/xpra/x11/models/window.py:327: Warning: g_object_remove_toggle_ref: assertion 'G_IS_OBJECT (object)' failed
  self.corral_window = None
2019-03-18 21:51:48,306 Warning: timeout on screen updates for window 1,
2019-03-18 21:51:48,306  already delayed for more than 15 seconds

(Xpra:7593): Gdk-WARNING **: 21:51:53.976: GdkWindow 0x400def unexpectedly destroyed
2019-03-18 21:51:55,098 Error: connection timed out: unix-domain socket:/run/user/1000/xpra/desktop-20

Even with python2, things can degenerate, though not as fast and not as badly.

@totaam
Copy link
Collaborator Author

totaam commented Mar 18, 2019

2019-03-18 15:12:17: antoine uploaded file info.txt (81.8 KiB)

xpra info of a hung python2 server after ~10000 gtkperf windows are shown

@totaam
Copy link
Collaborator Author

totaam commented Mar 18, 2019

2019-03-18 15:30:44: antoine uploaded file info-py3.txt (172.7 KiB)

python3 server just before it hung

@totaam
Copy link
Collaborator Author

totaam commented Mar 18, 2019

2019-03-18 16:38:22: antoine uploaded file verify-main-thread.patch (6.9 KiB)

verify that we are in the main thread in more places

@totaam
Copy link
Collaborator Author

totaam commented Mar 18, 2019

2019-03-18 16:39:47: antoine commented


r22114 adds extra thread checking to the gtk X11 context manager code, and the patch above adds yet more checks. But none of those trigger any warnings.

Could it be a race condition that is made more likely by the extra delay introduced by the "xpra info" query?

@totaam
Copy link
Collaborator Author

totaam commented Mar 19, 2019

2019-03-19 02:10:41: antoine changed priority from blocker to critical

@totaam
Copy link
Collaborator Author

totaam commented Mar 19, 2019

2019-03-19 02:10:41: antoine commented


When the server seems stuck, it is consuming a lot of CPU time in the X11 event loop.
Simply killing the gtkperf client loop and waiting for things to settle eventually brings it back to life, so I'm lowering the priority as it's not a crash, "just" an X11 DoS.

Is GTK3 that much slower? (New ticket: #2219) Less efficient? Gets more events?

The hang is often preceded by:

2019-03-18 23:56:53,637 Warning: timeout on screen updates for window 1,
2019-03-18 23:56:53,637  already delayed for more than 15 seconds

Which could happen if the main thread is too busy to answer.

How can we avoid getting in this situation, or detect it and deal with it?
Maybe as part of the move to wayland, we can move away from GTK? We already have window models, do we really need those GDK windows to wrap the X11 window?

@totaam
Copy link
Collaborator Author

totaam commented Aug 20, 2019

2019-08-20 06:38:17: antoine commented


GTK3 is now slower: #2219#comment:3

@totaam
Copy link
Collaborator Author

totaam commented Sep 5, 2019

2019-09-05 07:17:57: antoine uploaded file server-last-info-nokeyboard.txt (84.0 KiB)

last server info captured before it becomes unresponsive, without keyboard data

@totaam
Copy link
Collaborator Author

totaam commented Sep 5, 2019

2019-09-05 10:11:17: antoine commented


Different symptoms this time:

  • server goes unresponsive and shows:
2019-09-05 13:10:10,898 Error: connection timed out: unix-domain socket:/run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:10:11,309 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:10:12,312 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:10:12,312 New unix-domain connection received on /run/xpra/localhost.localdomain-2
2019-09-05 13:10:12,390 Warning: timeout on screen updates for window 1,
2019-09-05 13:10:12,390  already delayed for more than 15 seconds
2019-09-05 13:10:28,884 Warning: timeout on screen updates for window 2472,
2019-09-05 13:10:28,884  already delayed for more than 15 seconds
2019-09-05 13:10:31,200 Warning: timeout on screen updates for window 1,
2019-09-05 13:10:31,200  already delayed for more than 15 seconds
2019-09-05 13:10:32,311 Error: connection timed out: unix-domain socket:/run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:10:32,735 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:10:33,320 Error: connection timed out: unix-domain socket:/run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:10:33,321 Error: connection timed out: unix-domain socket:/run/xpra/localhost.localdomain-2
2019-09-05 13:10:33,415 Missing property or wrong property type _MOTIF_WM_HINTS (motif-hints)
2019-09-05 13:10:33,736 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:10:33,736 New unix-domain connection received on /run/xpra/localhost.localdomain-2
2019-09-05 13:10:40,357 Missing property or wrong property type WM_TRANSIENT_FOR (window)
2019-09-05 13:10:40,361 Missing property or wrong property type _NET_WM_WINDOW_TYPE (['atom'])
2019-09-05 13:10:49,568 Warning: timeout on screen updates for window 2617,
2019-09-05 13:10:49,568  already delayed for more than 15 seconds
2019-09-05 13:10:53,737 Error: connection timed out: unix-domain socket:/run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:10:54,170 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:10:54,737 Error: connection timed out: unix-domain socket:/run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:10:54,737 Error: connection timed out: unix-domain socket:/run/xpra/localhost.localdomain-2
2019-09-05 13:10:55,115 Warning: timeout on screen updates for window 2657,
2019-09-05 13:10:55,115  already delayed for more than 15 seconds
2019-09-05 13:10:55,169 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:10:55,169 New unix-domain connection received on /run/xpra/localhost.localdomain-2
2019-09-05 13:10:58,143 Warning: timeout on screen updates for window 1,
2019-09-05 13:10:58,144  already delayed for more than 15 seconds
^C
2019-09-05 13:11:09,004 sound source got signal SIGINT

And another one:

2019-09-05 13:19:13,201 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:19:13,204 New unix-domain connection received on /run/xpra/localhost.localdomain-2
2019-09-05 13:19:13,607 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2

(Xpra:21966): Gdk-WARNING **: 13:19:14.578: GdkWindow 0x400a59 unexpectedly destroyed
2019-09-05 13:19:14,607 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:19:14,608 New unix-domain connection received on /run/xpra/localhost.localdomain-2
/usr/lib64/python3.7/site-packages/xpra/x11/models/window.py:327: Warning: g_object_get_qdata: assertion 'G_IS_OBJECT (object)' failed
  self.corral_window = None
/usr/lib64/python3.7/site-packages/xpra/x11/models/window.py:327: Warning: g_object_set_qdata_full: assertion 'G_IS_OBJECT (object)' failed
  self.corral_window = None
/usr/lib64/python3.7/site-packages/xpra/x11/models/window.py:327: Warning: g_object_remove_toggle_ref: assertion 'G_IS_OBJECT (object)' failed
  self.corral_window = None
2019-09-05 13:19:21,559 processed structured info request from unix-domain socket:/run/user/1000/xpra/localhost.localdomain-2 in 8ms
2019-09-05 13:19:22,041 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:19:23,042 New unix-domain connection received on /run/xpra/localhost.localdomain-2
2019-09-05 13:19:23,042 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2

(Xpra:21966): Gdk-WARNING **: 13:19:31.031: GdkWindow 0x400bda unexpectedly destroyed

(Xpra:21966): Gdk-WARNING **: 13:19:39.869: GdkWindow 0x400c5b unexpectedly destroyed
2019-09-05 13:19:43,062 Error: connection timed out: unix-domain socket:/run/user/1000/xpra/localhost.localdomain-2
2019-09-05 13:19:43,462 New unix-domain connection received on /run/user/1000/xpra/localhost.localdomain-2

This seems to happen when we reach around 2200 to 2500 windows.

r23718 allows us to see how busy the UI thread is, logging the time it takes to service a timer call with a zero delay.
Normally, that's around 0.10ms on my system. When running this stress test, it creeps up and by the time the server stops responding, the delay is around 35ms. Roughly 350 times higher..
Meanwhile the info requests are still being processed around 7ms. That's until the server no longer processes socket connection attempts.

The one thing that always seems to happen before the server goes AWOL is this:

(Xpra:31569): Gdk-WARNING **: 16:07:11.444: GdkWindow 0x400bf7 unexpectedly destroyed

The GTK3 code looks like this:

if (!GDK_WINDOW_DESTROYED (window))
{
  if (GDK_WINDOW_TYPE(window) != GDK_WINDOW_FOREIGN)
     g_warning ("GdkWindow %#lx unexpectedly destroyed", GDK_WINDOW_XID (window));
  _gdk_window_destroy (window, TRUE);
}

The window that is triggering this warning is not a foreign window, so it must be our corral window?


  • right around the same time, the client showed:
Traceback (most recent call last):
  File "/usr/lib64/python3.7/site-packages/xpra/client/gl/gl_window_backing_base.py", line 368, in redraw
    with self.gl_context():
AttributeError: __enter__
Traceback (most recent call last):
  File "/usr/lib64/python3.7/site-packages/xpra/client/gl/gl_window_backing_base.py", line 368, in redraw
    with self.gl_context():
AttributeError: __enter__
Traceback (most recent call last):
  File "/usr/lib64/python3.7/site-packages/xpra/client/gl/gl_window_backing_base.py", line 368, in redraw
    with self.gl_context():
AttributeError: __enter__
Traceback (most recent call last):
  File "/usr/lib64/python3.7/site-packages/xpra/client/gl/gl_window_backing_base.py", line 368, in redraw
    with self.gl_context():
AttributeError: __enter__

This should be fixed by r23712 and this is probably unrelated as the server problems occur even without having a client connected.

@totaam
Copy link
Collaborator Author

totaam commented Sep 5, 2019

2019-09-05 10:41:49: antoine changed title from gtkperf + xpra info hangs to gtkperf + xpra info hangs gtk3 server

@totaam
Copy link
Collaborator Author

totaam commented Sep 5, 2019

2019-09-05 10:41:49: antoine commented


This does not cause any problems with python2 / GTK3, where the UI thread service time remains constant and usually lower than 0.10ms.
Something might be leaking in GTK3.

@totaam
Copy link
Collaborator Author

totaam commented Sep 5, 2019

2019-09-05 14:41:46: antoine changed title from gtkperf + xpra info hangs gtk3 server to gtkperf hangs gtk3 server

@totaam
Copy link
Collaborator Author

totaam commented Sep 5, 2019

2019-09-05 14:41:46: antoine commented


xpra info shows us that the server has locked up, but it isn't required to trigger it.

The good news is that if the gtkperf loop is stopped quickly enough, the server does recover and can service new clients, including xpra info.

@totaam
Copy link
Collaborator Author

totaam commented Sep 8, 2019

2019-09-08 10:23:46: antoine changed status from assigned to closed

@totaam
Copy link
Collaborator Author

totaam commented Sep 8, 2019

2019-09-08 10:23:46: antoine set resolution to upstream

@totaam
Copy link
Collaborator Author

totaam commented Sep 8, 2019

2019-09-08 10:23:46: antoine commented


I don't know how to debug this further.
It really looks like a regression in GTK, so I'm closing this as 'upstream'.

Applications that have a behaviour similar to gtkperf should be considered pathological and are not going to be supported with the GTK3 server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant