Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

client memory leak #457

Closed
totaam opened this issue Nov 15, 2013 · 40 comments
Closed

client memory leak #457

totaam opened this issue Nov 15, 2013 · 40 comments

Comments

@totaam
Copy link
Collaborator

totaam commented Nov 15, 2013

After running the client for many hours with glxgears at full throttle, the client eventually ran out of memory and got killed by the oom-killer.
Unfortunately, I was looking for a server leak with "xpra info" (none to be found), so I don't have much data.
I'll have to try again with different encodings to see if this is encoding related or something else. This is not sound related as I had --no-speaker.
The memory usage seems to go up slowly, at a rate of about 1MB per minute.

@totaam
Copy link
Collaborator Author

totaam commented Nov 15, 2013

Good news is that this is encoding related and not some obscure python object leak:x264 leaks but none of the other encodings do (I've tried vpx,rgb and png).

So the problem clearly is in the dec_avcodec code. Will try to deal with this together with #415 (also in dec_avcodec).

Temporary workaround is to use another encoding... sadly, x264 is the best one.

@totaam
Copy link
Collaborator Author

totaam commented Nov 16, 2013

2013-11-16 23:27:25: madmonk commented


Thanks for looking into this.

Perhaps you could consider doing the sort of thing they do in mythtv which I believe is either statically linking the library or shipping their own version of the library. I believe they also sited a multitude of possible issues on having to support different versions of libraries from different distributions.

I think this would cut your support overhead as well as potentially making the application more stable which if you are expecting to use it for a few years without any major problems due to upgrades etc. would be great. I assume it would also make it easier to add any custom patches required for the library.

This also brings me back to the suggestion of having a test harness or some sort of automated testing that would attempt to look for issues like this or any regressions in the future. Easier said than done I know!

@totaam
Copy link
Collaborator Author

totaam commented Nov 26, 2013

... statically linking the library or shipping their own version ...

No, not when the distributions in question already ship a version we can run against: there is no way we can shoulder the maintenance cost of static builds and bug-fix/security updates for those.

it would also make it easier to add any custom patches required for the library

We do not require any such patches.

What we should do is report the bug to vendor and hope that they fix it..

In this case, it does look like an Ubuntu bug once again, I cannot reproduce this on other distros (though I haven't tried them all) - so I am lowering the priority and will probably release 0.10.10 as it is.

@totaam
Copy link
Collaborator Author

totaam commented Dec 7, 2013

r4872 fixes the memleak with ffmpeg2, the fix for the old libav versions found in Ubuntu is likely to be similar.

@totaam
Copy link
Collaborator Author

totaam commented Dec 12, 2013

2013-12-12 15:54:55: totaam uploaded file old-libav.patch (1.1 KiB)

updated patch that doesn't seem to leak

@totaam
Copy link
Collaborator Author

totaam commented Dec 12, 2013

found nothing with valgrind, but when I removed part of the old-libav patch (see old-libav.patch), it seems to not leak anymore.

Please test, there are new beta packages available which include this patch.

@totaam
Copy link
Collaborator Author

totaam commented Dec 16, 2013

2013-12-16 09:46:04: totaam commented


OK, that took ages to test properly and narrow down as repeated runs produced different results.
[[BR]]
Here's how I ended up testing to mazimize the chances of reproducing the problems:

  • apply the updated libav-nofree patch (see r4955) or the plain old-libav patch, plus whatever patches are needed to build (usually pixfmt and no0RGB)
  • I used a 0.11 beta server running on Fedora 19 (version and distro should not be relevant)
  • client applications: an xterm and glxgears (glxgears to generate enough frames to be able to spot the memory leak - assuming it is frame related..)
  • forcing client encoding to use h264 (since that's the encoding that seemed to cause the leak - see comment:2)
  • running each client for at least 10 minutes and measuring memory usage (which should remain about the same once a few frames have been decoded for each window, just giving it enough time to warm some buffers and build some of the internal statistics)
  • maximizing glxgears then unmaximize, repeat 20 times to be sure
  • disabling sound (..)

The detailed per-distro per-build-type results can be found in #477

[[BR]]
I eventually found that the memory would only leak if sound was enabled and there were sound overruns.

OTOH, potential solutions:

  • fixing the actual leak (probably requires fixing gstreamer's buffer handling) - unlikely
  • when we get overruns, increase the buffer size so it is less likely to happen again
  • ?

[[BR]]

It looked like it was encoding related and distro specific, probably because:

  • h264 causes more CPU load, which makes it more likely that sound data will accumulate
  • older distros are using older versions of everything, which usually means they are slower and more buggy

@totaam
Copy link
Collaborator Author

totaam commented Dec 20, 2013

2013-12-20 01:37:08: afarr commented


Trying to use glxgears on my fedora 19 server xpra session xterms produces the error Error: couldn't get an RGB, Double-buffered visual

Trying to find fix for glxgears led down a rabbit hole (from which I eventually fled with no success).

Testing instead by running google-chrome playing a video:

Windows does not seem to have a leak. Starts at about 100 MB, after 20 minutes 112 MB.
Maximize/un-maximize: ... when un-maximizing the memory shows 140-150 MB, which drops within a second to 120 (+/-) MB (presumably dropping memory use with un-maximized vs. maximized).
No appreciable difference between the first and the 25th iteration.

(running 2 windows puts memory at 150 MB/ 170 MB when one is maximized)

OSX likewise seems to have no leaks.
0.11.0-r4940 client runs at about 96 MB both before and after 20 minutes (with a laptop, one displayed google-chrome running the above, one minimized background google-chrome running the same in the background).
Memory only hopped between about 99 and 94 MB as I maximized/un-maximized 25 times.
With 2 windows displaying the javascript, maximizing/un-maximizing one then the other 25 time memory ranged from 100-107 MB.

Still working on getting a suitable fedora client to test.

@totaam
Copy link
Collaborator Author

totaam commented Dec 20, 2013

2013-12-20 01:49:23: totaam commented


As per comment:7, the leak seems to be triggered by sound overruns:

I eventually found that the memory would only leak if sound was enabled and there were sound overruns.
[[BR]]
Did you get any overruns? Did you try to trigger any?

You can use:

XPRA_SOUND_QUEUE_TIME=50 xpra attach...

To lower the size of the sound queue, which should make it easier to get overruns. (the default value is currently 450)

@totaam
Copy link
Collaborator Author

totaam commented Dec 21, 2013

2013-12-21 02:25:47: afarr commented


With the window client (0.11.0-r4940), setting XPRA_SOUND_QUEUE_TIME=50, there are regular speaker resets and the memory seems to climb rather surprisingly.

  • First test (with hulu plaing in google-chrome)
    starts: 56 MB
    10 minutes: 154 MB
    20 minutes: 237 MB
    30 minutes: 302 MB
    50 minutes: 413 MB and the speakers turned off and wouldn't restart
    memory kept rising, even with speakers refusing to restart
    105 minutes: 591 MB
    132 minutes connection lost - server crashed and had to be killed (kill -15)

  • Second test
    start: 80 MB
    20 minutes: 256 MB
    35 minutes: 321 MB
    50 minutes: 414 MB
    60 minutes: 459 MB

[[BR]]

With osx client, I couldn't cause speaker re-starts. Even setting XPRA_SOUND_QUEUE_TIME=5 memory didn't rise significantly after 30 minutes (from 190 to 195 MB)

@totaam
Copy link
Collaborator Author

totaam commented Dec 21, 2013

2013-12-21 03:07:50: totaam commented


there are regular speaker resets and the memory seems to climb rather surprisingly

OK, so this tells us that restarts caused by overruns do cause a memory leak.
Let's hope this is the only one.

[[BR]]

[..] and the speakers turned off and wouldn't restart [..]

Any sound debug info?

[[BR]]

server crashed and had to be killed

How did it crash? It has to be alive to be killed..
Is this reproducible? What was in the log? Memory usage? What version was this running?
Assuming it is just stuck, can we get a gdb backtrace of when it is stuck?

[[BR]]

With osx client, I couldn't cause speaker re-starts

Any sound debug log? We used to get overruns reliably, what has changed?

@totaam
Copy link
Collaborator Author

totaam commented Dec 26, 2013

2013-12-26 23:41:49: afarr commented


Testing again with win client 0.11.0 r4940 I was unable to reproduce the speaker re-start failures or the server lock-ups, and thusly couldn't get any logs. (ISP was having problems the day of the initial tests, might have been contributing more to the issues than I'd expected.)

I discovered that I wasn't using the command line with the osx client correctly, hence the lack of speaker re-starts. Correctly lowering the XPRA_SOUND_QUEUE_TIME to 50 does cause speaker restarts, and there is still no sign of a memory leak with osx.

Using 0.11.0 r4998 osx client and fedora 19 5006 server with XPRA_SOUND_QUEUE_TIME=50 I got the following:

start 74 MB
10 minutes 84 MB
20 minutes 108 MB
30 minutes 120 MB
40 minutes 127 MB
50 minutes 102 MB
60 minutes 112 MB
70 minutes 112 MB
80 minutes 110 MB
90 minutes 119 MB

repeating with XPRA_SOUND_QUEUE_TIME=10, to be extra thorough, I got about the same:

start 77 MB
10 minutes 95 MB
20 minutes 107 MB
30 minutes 116 MB
40 minutes 99 MB
50 minutes 100 MB
60 minutes 104 MB

It looks like osx is behaving, here at least.

@totaam
Copy link
Collaborator Author

totaam commented Dec 27, 2013

2013-12-27 18:49:46: afarr commented


I just noticed that my second round of tests with windows client (with a behaving ISP) didn't post. Correcting that oversight, I'll post them again. The memory still grew, but it was a far more reasonable rate of growth.

win 0.11.0r5016
test 1: connection sound_time=50
0 minutes 84 MB
10 minutes 109 MB
25 minutes 134 MB
35 minutes 163 MB
50 minutes 191 mb
60 minutes 196 MB
80 minutes 221 MB
95 minutes 256 MB
120 minutes 265 MB

test 2
0 minutes 86 MB
10 minutes 157 MB
25 minutes 176 MB
30 minutes 178 MB
40 minutes 193 MB
50 minutes 211 MB
60 minutes 219 MB
70 minutes 248 MB
80 minutes 258 MB

@totaam
Copy link
Collaborator Author

totaam commented Dec 28, 2013

2013-12-28 15:15:29: totaam commented


It leaks, but what we want to confirm is what causes the leak.

ie: Does it leak without sound? Does it leak with sound but without sound overruns? How many overruns were there during those tests? Did they coincide with the increase? Does stopping the sound stop the leak? etc

@totaam
Copy link
Collaborator Author

totaam commented Jan 1, 2014

2014-01-01 00:06:19: afarr commented


Testing is becoming really drawn out, erratic, and confusing.

It seems like low XPRA_SOUND_QUEUE_TIME settings do indeed cause more memory growth, but yet it seems like the default sound setting is actually, inexplicably, more efficient than actually using the --no-speaker setting.

With default settings I got the following:

0 minutes 116 MB
15 minutes 121 MB
30 minutes 127 MB

With XPRA_SOUND_QUEUE_TIME=50 I got:

0 minutes 116 MB
15 minutes 135 MB
30 minutes 156 MB

While with --no-speaker (engaged from client-side) I got:

0 minutes 100 MB
20 minutes 130 MB
30 minutes 158 MB

Just out of curiosity I also tried --no-pulseaudio server-side... which didn't disable the sound, surprisingly, and turned up the following:

0 minutes 92 MB
10 minutes 95 MB
20 minutes 99 MB
30 minutes 102 MB

Once I'd tried that though, I was no longer able to disable speakers on the client-side with --no-speaker until after I'd rebooted the server VM.

Trying --no-speaker from the server side gave the following:

0 minutes 85 MB
10 minutes 98 MB
20 minutes 105 MB
30 minutes 122 MB

The output was hard to reproduce however, as another attempt with --no-speaker --opengl=on on the client-side gave the following:

0 minutes 87 MB
10 minutes 91 MB
20 minutes 99 MB
30 minutes 108 MB
40 minutes 113 MB

Open GL doesn't seem to be a large contributor long term though, trying with --no-speaker --opengl=off client-side came up with this:

0 minutes 68 MB
10 minutes 81 MB
20 minutes 95 MB
30 minutes 107 MB
40 minutes 123 MB
50 minutes 127 MB
60 minutes 133 MB

which isn't hugely different than with the --opengl=on.

Setting XPRA_SOUND_QUEUE_TIME=50 --opengl=off came up with this:

0 minutes 70 MB
10 minutes 83 MB
20 minutes 111 MB
30 minutes 151 MB
40 minutes 179 MB
50 minutes 198 MB
60 minutes 218 MB

which is extremely similar to what I got with defaults (which, I think, is opengl=on... though now I might have to go back and double check that).

I did notice a little difference between memory when running one long video vs. running a player that loaded a number of shorter videos over the same period of time.

(--no-speaker, long video)
0 minutes 100 MB
20 minutes 130 MB
30 minutes 158 MB

(--no-speaker, lots of shorts)
0 minutes 100 MB
20 minutes 159 MB
30 minutes 174 MB

I was basically testing by streaming hulu.com (streaming tv) - and I did notice that the memory would often jump by a couple 2-5 MB whenever a commercial loaded... and SIGINTing the chrome browser from an xpra session xterm and leaving the xpra session to run with no video, just inactive xterms, I noted that the memory didn't grow - but it also didn't diminish.

With the XPRA_SOUND_QUEUE_TIME=50 I saw (& heard) a lot of speaker restarts. With the default setting I still saw them occasionally. As expected, I did not see any with --no-speaker, at least not when the flag worked.

Anyway, does any of this give you a hint? Would you like me to try out other codecs? Try other flags? Let me know if anything comes to mind.

@totaam
Copy link
Collaborator Author

totaam commented Jan 1, 2014

2014-01-01 03:33:43: totaam commented


It seems like low XPRA_SOUND_QUEUE_TIME settings do indeed cause more memory growth, but yet it seems like the default sound setting is actually, inexplicably, more efficient than actually using the --no-speaker setting.
[[BR]]
That would be very odd.
[[BR]]
The problem with the figures above is that I can't see how many overruns there were for each test, so we can't correlate overruns with leaks. My guess is that each overrun will cause a small leak, either a fixed amount of memory, or maybe the amount of buffers in the queue at the time, or both.
[[BR]]

Just out of curiosity I also tried --no-pulseaudio server-side... which didn't disable the sound, surprisingly [...]
[[BR]]
It isn't meant to, the man page says:

--no-pulseaudio
  Disables the starting of a pulseaudio server with the session.

If a pulseaudio server already exists, it makes no difference. In fact, in that case, you should see an error message when not using the flag. (see the pulseaudio warning note in the FAQ)
[[BR]]

Once I'd tried that though, I was no longer able to disable speakers on the client-side with --no-speaker until after I'd rebooted the server VM.
[[BR]]
I don't understand what that means: no longer able ... Also, why would rebooting a VM affect the process running inside it?
[[BR]]

Trying --no-speaker from the server side gave the following:
[[BR]]
Using the flag server side or client side should not make any difference, from the help page:
These options can be used to turn certain features on or off, they can be specified on the client or on the server, but the client cannot enable them if they are disabled on the server.
[[BR]]

I did notice a little difference between memory when running one long video vs. running a player that loaded a number of shorter videos over the same period of time.
[[BR]]
Could be caused by key frames in h264 decoding, which take up more space, and are meant to be kept around longer.
[[BR]]

Anyway, does any of this give you a hint? Would you like me to try out other codecs? Try other flags? Let me know if anything comes to mind.
[[BR]]
Yes, I think there may well be two different leaks:

  • a sound leak
  • a video leak
    [[BR]]
    So let's start by isolating each one. Please try a different picture encoding (ie: plain rgb or jpeg). First with --no-speaker and --opengl=off: I am hoping you won't be seeing any leak at all in that case.

Assuming this behaves ok, turn things on again one at a time to confirm which specific setting(s) really do cause a leak.

Another thing that is worth trying is sound on its own, with the video window minimized so we don't generate any video frame decoding at all. Bearing in mind that the lack of video traffic may cause fewer sound overruns, which makes the results harder to interpret if we don't show how many overruns there were in both cases.

Please also post the output of: Encoding_Info.exe
[[BR]]

If the leak is h264 decoding related, it is then worth checking if the same leak occurs with vpx, and if not we have at least the option of trying ffmpeg2 decoding instead (#415). Later on, it may also be worth trying vpx decoding via ffmpeg instead of native: that would tell us if the leak is in the ffmpeg core (buffer handling), of the h264 decoding code (the easiest way of doing this will be to delete the vpx codec from library.zip - or building the win32 installer with vpx turned off).

[[BR]]
One reassuring fact is that I did not see any memory leaks on all the *nix client platforms that I tested before releasing 0.10.10: 4 versions of Ubuntu, 3 versions of Debian, 3 versions of Fedora, some BSDs - the most interesting ones amongst these are: Ubuntu 12.04 (no h264 decoding), Ubuntu 13.10 (latest libav), Fedora 19 (current ffmpeg1) and Fedora 20 or later (ffmpeg2). But then again, I only tested with glxgears, you may want to verify this assumption by running your usual test process against some of those to confirm.

@totaam
Copy link
Collaborator Author

totaam commented Jan 2, 2014

2014-01-02 19:17:54: afarr commented


Just a couple quick clarifications for the moment (in case there's something more than mere pedantry involved).

Once I'd tried that though, I was no longer able to disable speakers on the client-side with --no-speaker until after I'd rebooted the server VM.

I was no longer able... means that, once I had tried the --no-pulseaudio then, after a control-c SIGINT to shut down the server xpra session on the given display, when I started a new xpra server session and then tried to attach with a client-side windows xpra session... the --no-speaker option no longer had any effect (there was still sound).

Stopping and starting the server (and client) xpra sessions did not fix the issue, and neither did a fedora 19 server-side killall pulseaudio. I finally had to reboot the fedora server, before the --no-speaker flag would work, client-side, as expected. (I suppose I will have to repeat that some time and see if it is something reproducible, or if it was just a result of a (VM) server running too long.)

As for the number of speaker resets (a far more potentially useful bit of information), with the default sound queue setting, I would say there were resets maybe every 5 or 10 minutes at most... and often none for 20 minutes or more at a time. Meanwhile, with the queue set to 50 there were often speaker restarts every couple of seconds, sometimes multiple per second. Obviously, when --no-speaker worked, then there were no speaker restarts.

Anyway, I'll dig in and try some permutations now.

@totaam
Copy link
Collaborator Author

totaam commented Jan 3, 2014

2014-01-03 02:50:23: afarr commented


Testing with --encoding=rgb The memory usage stayed in the 30 - 80 MB range, over the course of about an hour of streaming video. The results were ball park similar with --no-speaker, XPRA_SOUND_QUEUE_TIME=50, or with the default sound setting. In all these cases the opengl was off, but the earlier tests showed no particular difference with opengl on (although I could test that explicitly to be sure if you think it might help).

With sound queue at default, I was getting 2 - 5 speaker restarts per hour, with sound queue set to 50 I was getting a speaker restart every 2 seconds or so (and it wasn't uncommon for the speakers to fail to restart for a few minutes). Again, there was no appreciable difference between the two (69 MB at 40 minutes with default sound, 72 MB at 40 minutes with XPRA_SOUND_QUEUE_TIME=50).

Some testing with --encoding=vpx showed about the same as with rgb, and there wasn't any appreciable difference with or without speakers (64 MB at 20 minutes with --no-speaker vs. 63 MB at 20 minutes with speakers at default).

Just for completeness, the Encoding_Info.exe produced this:

codecs/csc modules found:
* PIL                  : True       <module 'PIL' from 'C:\Program Files (x86)\Xpra\library.zip\PIL\__init__.pyc'>
* enc_vpx              : True       <module 'xpra.codecs.vpx.encoder' from 'C:\Program Files (x86)\Xpra\xpra.codecs.vpx.encoder.pyd'>
* dec_vpx              : True       <module 'xpra.codecs.vpx.decoder' from 'C:\Program Files (x86)\Xpra\xpra.codecs.vpx.decoder.pyd'>
* enc_x264             : False
* enc_nvenc            : False
* csc_swscale          : True       <module 'xpra.codecs.csc_swscale.colorspace_converter' from 'C:\Program Files (x86)\Xpra\xpra.codecs.csc_swscale.colorspace_converter.pyd'>
* csc_cython           : True       <module 'xpra.codecs.csc_cython.colorspace_converter' from 'C:\Program Files (x86)\Xpra\xpra.codecs.csc_cython.colorspace_converter.pyd'>
* csc_opencl           : False
* csc_nvcuda           : False
* dec_avcodec          : True       <module 'xpra.codecs.dec_avcodec.decoder' from 'C:\Program Files (x86)\Xpra\xpra.codecs.dec_avcodec.decoder.pyd'>
* enc_webp             : True       <module 'xpra.codecs.webm.encode' from 'C:\Program Files (x86)\Xpra\library.zip\xpra\codecs\webm\encode.pyc'>
* enc_webp_lossless    : True       <module 'xpra.codecs.webm.encode' from 'C:\Program Files (x86)\Xpra\library.zip\xpra\codecs\webm\encode.pyc'>
* webp_bitmap_handlers : True       <module 'xpra.codecs.webm.handlers' from 'C:\Program Files (x86)\Xpra\library.zip\xpra\codecs\webm\handlers.pyc'>
* dec_webp             : True       <module 'xpra.codecs.webm.decode' from 'C:\Program Files (x86)\Xpra\library.zip\xpra\codecs\webm\decode.pyc'>

codecs versions:
* cython               : (0, 1)
* PIL                  : 1.1.7
* avcodec              : (54, 92, 100)
* vpx                  : v1.2.0
* webp                 : 0.2.2
* swscale              : (2, 2, 100)

Press Enter to close

@totaam
Copy link
Collaborator Author

totaam commented Jan 3, 2014

2014-01-03 16:17:21: totaam uploaded file avcodec_nothreading.patch (1.0 KiB)

disables threading in avcodec decoding

@totaam
Copy link
Collaborator Author

totaam commented Jan 3, 2014

2014-01-03 16:18:38: totaam commented


First, for comment:21

then tried to attach with a client-side windows xpra session... the --no-speaker option no longer had any effect (there was still sound).

[[BR]]
I don't understand how that is possible. Are you sure the sound is coming from xpra and not some other sound forwarding system (VirtualBox or other).
Please create a new ticket and include full command lines used, debug information from both xpra info and the session info dialog (showing sound codecs and sound status).


Now back to the leak.

Let me try to recap the previous comments. Please confirm.

To reproduce the leak, we need a combination of things:

  • sound enabled and getting overruns (can be forced using XPRA_SOUND_QUEUE_TIME=50)
  • h264 encoding: using other encodings prevents the leak
  • win32 client (or Ubuntu?) - OSX is not affected with the exact same settings

[[BR]]
Things to try next:

  • is this a regression? did the same thing occur with v0.7.x or older versions?
  • does threading make any difference? (apply [/attachment/ticket/457/avcodec_nothreading.patch] to client)
  • does ffmpeg2 fix the problem? (see [ffmpeg 2.0 support #415#comment:6])

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 01:08:52: afarr commented


I'll confirm your summary.

  • With 0.11 clients with sound enable and getting overruns the leak is evident.
  • Encodings other than h264 display no sign of the leak.
  • OSX displays no sign of the leak even with the same settings.

I would also note, however, with h264 the memory leak is evident even with the --no-speakers option.

Testing with 0.9.(various) and 0.10.(various) also shows no sign of the leak, with default sound queue or with XPRA_SOUND_QUEUE_TIME=50. (It looks like it is a regression.)

I'll try to get some testing done with the threading as soon as I can.

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 02:28:27: afarr commented


Testing with the avcodec_nothreading.patch with a 0.11.0 r5153 win client with default SOUND_QUEUE_TIME (but about 20 speaker re-starts in 40 minutes) it looks like the memory growth is rather stable, also.

0 minutes 103 MB
10 minutes 110 MB
20 minutes 116 MB
30 minutes 120 MB
40 minutes 123 MB

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 04:01:31: totaam uploaded file avcodec-v10-to-v11.diff (19.8 KiB)

diff between v0.10 version of the codec and v0.11

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 04:47:49: totaam uploaded file avcodec-noopaque.patch (1.6 KiB)

partial revert of r4815: use frame.data instead of opaque field

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 06:23:50: totaam commented


And... the plot thickens, I am seeing the leak now on Fedora 19 (and I don't think I saw it before when I did a thorough testing of many distros, for example for #477, but maybe this leak is in ffmpeg and not in the libav fork used by Debian and Ubuntu.. or maybe it is too slow to notice?), without sound enabled, and with or without the noopaque patch below. Taking ownership of the ticket now that I can reproduce it reliably (I hope). I am still puzzled by the effect that sound restarts seem to have on this bug (OTOH: a threading issue?, an existing issue made worse by the extra load?)

On Fedora 19, 0.10.11 is also affected by this leak (and maybe leaking a tad less).
So the next step will be to really identify which versions and which distros leak and by how much, because at this point it looks like all the assumptions from previous comments were wrong.


(probably not very relevant but still)

Before I found out that this is not a regression (at least on Fedora, probably others too, contrary to comment:24), I had spent a lot of time inspecting the code changes in avcodec: [/attachment/ticket/457/avcodec-v10-to-v11.diff] shows the full diff between current v0.10.x branch and current trunk. Summary of noteworthy changes:

  • now using read-write memory (but we never write to it, this is just for opencl uploading which fails with read-only buffers)
  • added vp8 and vp9 decoding support
  • using the "opaque" member for the frame reference (seemed a likely culprit)
  • using strong types in functions (cdef types rather than python types)
  • misc: debugging added, whitespace, error handling, ..

This patch: [/attachment/ticket/457/avcodec-noopaque.patch] does not seem to make any noticeable difference.

Full list of changesets affecting this file since v0.10.0: r5105, r5004, r4997, r4990, r4989, r4933, r4851, r4816, r4815, r4812, r4810, r4786, r4707, r4667, r4666, r4661, r4657, r4652, r4533, r4480, r4345, r4292, r4245, r4232, r4196, r4180, r4178, r4175, r4174, r4172, r4122. (that's less than 32 changesets - and a fair few are clearly not particularly interesting - or broken, so it should take no more than 5 attempts to bisect)

@totaam
Copy link
Collaborator Author

totaam commented Jan 9, 2014

2014-01-09 09:48:56: totaam commented


Some resident memory usage statistics (RES in MB) as seen with top on Fedora 19, time is measured in CPU time (as shown by top - which is probably a fairer comparison than wall clock time, and easier to collect too).
Running two windows: xterm + glxgears, using x264 encoding and --no-mmap and --no-speaker switches.

||# Xpra Version||SHR||0||2 min||4 min||6 min||8 min||10 min||20 min||30 min||40 min||loss rate in MB/hour||h264 Kiloframes/hour||
||0.7.8||27||61||61||61|| || || || || || ||0||890||
||0.8.9||27||67||68||68|| || || || || || ||0||796||
||0.9.8||27||73||74||74||74||74||74||74||74||74||0||810||
||0.10.11||37||101||103||107||113||118||125||160||192||216||182||810||
||0.10.11 OpenGL||58||127||130||141||144||145||147||162||181||202||112||740||
||trunk r5155||39||96||100||102||105||111||115||156||210||260||228||826||
||trunk r5155 OpenGL||66||136||146||153||158||162||169||196||228||262||189||780||

Will continue to update this table. An easy way to get those numbers is to leave this command running in a terminal with a long history (or send it to a log file):

while true; do sleep 5; top -c -n 1 | grep attach; done

The rate per hour is simply extrapolated from the largest sample collected. The frames per hour can be obtained the same way with:

xpra info| grep frames | egrep "h264|x264"

Still TODO:

  • bisect 0.9.x to 0.10.x
  • test vpx via avcodec: does it also leak with 0.10.x and later?
  • test with rgb/png for a long time (still without sound): no leak for sure?
  • confirm that there is a separate sound leak: run with a non-leaky encoding (png/rgb?) and cause overruns
  • could this be CSC related rather than encoding related (or a combination of)?

Not directly related to this ticket:

  • why does opengl handle fewer frames?
  • we have more than doubled the resident memory since 0.7.x... why?
  • 0.9.x seems to choose a higher quality setting (no subsampling?) than 0.10.x and later

@totaam
Copy link
Collaborator Author

totaam commented Jan 10, 2014

2014-01-10 06:35:14: totaam commented


As per comment:28, bisecting 0.9.x to 0.10.x:

  • r3800: good (stable at 124MB)
  • 4240: good (133MB)
  • r4640: bad: 135MB to 139MB in 2 mins
  • r4440: bad: 135MB to 138MB in 2 mins
  • r4340: bad: 152MB to 156MB in 2 mins (big: opengl was on)
  • r4290: good (129MB)
  • r4325: bad: 155MB to 157MB in 2 mins (big: opengl was on)
  • r4310: good (119MB)
  • r4318: bad: 118MB to 122MB in 2 mins (114 to 117MB with --no-speaker)
  • 4314: bad: 114MB to 118MB in 2 mins (--no-speaker)
  • skipped irrelevant (non-trunk) changesets...
  • r4310: good (114MB) - tested again with --no-speaker
  • r4311: bad: 114MB to 117MB in 2 mins

So the "bad" changeset is r4311, but this only fixes the auto-refresh code, so the leak must be in the auto-refresh code..

@totaam
Copy link
Collaborator Author

totaam commented Jan 10, 2014

2014-01-10 07:47:37: totaam changed status from assigned to new

@totaam
Copy link
Collaborator Author

totaam commented Jan 10, 2014

2014-01-10 07:47:37: totaam changed owner from totaam to afarr

@totaam
Copy link
Collaborator Author

totaam commented Jan 10, 2014

2014-01-10 07:47:37: totaam commented


Running with --auto-refresh-delay=0 prevents the leak and confirms it.
Since the auto-refresh uses a different encoding, I then looked at the encodings used for the auto-refresh.

Using webp as primary encoding does leak, and fast too! 4GB of memory in just over 1 minute with glxgears!

So r5159 ensures we don't use webp unless requested (will backport it), and I've created #491 to fix this particular leak.

Now, for this ticket afarr:

  • please review comment:26 onwards (see also ffmpeg 2.0 support #415#comment:12)
  • is the leak fixed (at least without sound enabled)? or was the sound somehow causing more webp frames and therefore more leaks?
    (if there is still a leak with any encoding and sound restarts, let's put it in a separate ticket pointing back to this one)

@totaam
Copy link
Collaborator Author

totaam commented Jan 15, 2014

2014-01-15 18:54:09: maxmylyn commented


Meant to leave this yesterday, but I guess it didn't save.

Anyways:

I'm running 5178 in Windows and watched a twitch.tv stream for a while and took down the memory usage from Task Manager (top isn't available in Windows) every 5 minutes. Here's the relevant data:

|| Default everything ||
|| Time || Memory Usage (in Megabytes) ||
|| Start || 114.9mb ||
|| 5 minutes || 119.9mb ||
|| 10 minutes || 121.7mb ||
|| 15 minutes || 122.3mb ||
|| 20 minutes || 123.9mb ||
|| 25 minutes || 122.9mb ||
|| 30 minutes || 125.6mb ||
|| 35 minutes || 125.7mb ||
|| 40 minutes || 122.3mb ||
|| 45 minutes || 122.3mb ||
|| 50 minutes || 122.4mb ||
|| 55 minutes || 122.6mb ||
|| 60 minutes || 123.0mb ||

I restarted Xpra and went back to the same stream and reran it with --no-speaker as one of the arguments.

|| With --no-speaker ||
|| Time || Memory Usage (in Megabytes) ||
|| Start || 116.7mb ||
|| 5 minutes || 119.2mb ||
|| 10 minutes || 119.3mb ||
|| 15 minutes || 119.2mb ||
|| 20 minutes || 120.0mb ||
|| 25 minutes || 120.1mb ||
|| 30 minutes || 120.3mb ||
|| 35 minutes || 120.0mb ||
|| 40 minutes || 119.6mb ||

@totaam
Copy link
Collaborator Author

totaam commented Jan 17, 2014

2014-01-17 09:29:25: totaam commented


I am most definitely seeing a leak, with mplayer, firefox or even just an xterm running top.

It goes up regularly, but does go back down if I minimize the window. So freeing the avcodec context is enough to free the memory.

Can you please try the latest beta build to see if the problem can be reproduced on your system with the same build I have used (latest beta):
[http://xpra.org/beta/windows/]

@totaam
Copy link
Collaborator Author

totaam commented Jan 17, 2014

2014-01-17 22:25:40: maxmylyn commented


I ran it through a couple different things to test the memory usage:

first test:

twitch.tv stream in Firefox:

|| Time || Memory Usage (in Megabytes) ||
|| Start || 108.8 ||
|| 5 minutes || 145.7 ||
|| 10 minutes || 145.4 ||
|| 15 minutes || 146.5 ||
|| 20 minutes || 145.7 ||
|| 25 minutes || 144.1 ||
|| 30 minutes || 144.6 ||

I hit the fullscreen button on the twitch viewer and it caused some pretty awful looking screen flickering. It was bad enough I had to un-fullscreen it. So then I ran the browser as fullscreen and left it like that for another 15 minutes:

|| Time || Memory Usage (in Megabytes) ||
|| Start || 143.2mb || Quick note: CPU usage was very similar ||
|| 5 minutes || 143.2mb ||
|| 10 minutes || 143.2mb ||

At around 11 minutes the server disconnected me so we powered it off to give it some more memory and hard drive space to load on a longer video file to play. After an hour or two we got it up and running again so I loaded up an episode of Top Gear I had sitting on my hard drive. The file is supposed to be 1 hour and 2 minutes but about 15 minutes into it mplayer exits saying it hit the end of the file. I'm thinking it didn't convert properly from .mkv. Interestingly Windows Media Player won't open the file, but VLC will on my laptop.

I digress, but here are the logs from watching the same file in non-fullscreen, native 1280x720 window size, and "fullscreen". Interestingly Xpra wouldn't maximize the mplayer window to fullscreen, I'll attach a screenshot to demonstrate.

|| 1280 x 720 window size ||
|| Time || Memory Usage (in Megabytes) ||
|| Start || 118.0mb ||
|| 5 minutes || 132.0mb ||
|| 10 minutes || 131.5mb ||

|| "Fullscreen" - Somewhere around 1700 x 1080 ||
|| Time || Memory Usage (in Megabytes) ||
|| Start || 141.9mb ||
|| 5 minutes || 143.2mb ||
|| 10 minutes || 143.3mb ||

In both tests, the memory started to climb regularly, but after around 5 minutes it holds steady to within a megabyte or two more or less memory. I haven't been able to produce a memory leak.

@totaam
Copy link
Collaborator Author

totaam commented Jan 17, 2014

2014-01-17 22:26:14: maxmylyn uploaded file Xpra not going full fullscreen.png (2118.2 KiB)

This is a crop of the Xpra window not quite going fullscreen.
Xpra not going full fullscreen.png

@totaam
Copy link
Collaborator Author

totaam commented Jan 18, 2014

2014-01-18 03:43:47: totaam commented


maxmylyn: there is nothing wrong with fullscreen in this case, this is how mplayer works unless you add the -zoom command line argument (when hardware scaling is not available IIRC). Even if there was, I don't see what this has to do with this bug. FYI: there is a ticket for multihead and fullscreen #496, or create a new one if this is a different issue. Is this in a virtual machine by any chance? Is the "twitch viewer" a flash player or html5?

@totaam
Copy link
Collaborator Author

totaam commented Jan 20, 2014

2014-01-20 17:49:28: maxmylyn commented


The server we are running off of is a virtual machine, but the client is my laptop, which runs Windows 7 natively. And the "twitch viewer" is Flash currently but they're working on switching over to HTML5 for performance reasons.

@totaam
Copy link
Collaborator Author

totaam commented Jan 31, 2014

2014-01-31 22:27:12: afarr commented


With a windows 7 client, running r5319 client, running an idle xterm and a second xterm with top my memory only climbs from 74 MB to 76 MB after half an hour.

Starting glxgears from the idle xterm bumps the memory up to 85 MB, but after another half an hour it only rose to 86 MB (glxgears ran steadily at 525-535 FPS, and seemed to be running at about 115% cpu, according to the top.

Interestingly, accidentally maximizing the xterm running top, despite promptly restoring its size, did bump the memory up to 96 MB... though another 10 minutes of running left it still at 96 MB.

I get the impression it is not video related. I'll see about running something with sound and the window minimized and see what happens.

@totaam
Copy link
Collaborator Author

totaam commented Feb 8, 2014

2014-02-08 00:30:47: maxmylyn commented


Tested with r5383 Win7 client. I left the KDFC (a local classical radio station) webplayer running in the background for an hour playing music. The memory only changed by 32 kilobytes from 107.024mb to 107.056.

link for the interested:
http://www.kdfc.com/pages/15744854.php?

@totaam
Copy link
Collaborator Author

totaam commented Mar 17, 2014

See also #535 and #465.

@totaam
Copy link
Collaborator Author

totaam commented Apr 8, 2014

2014-04-08 16:40:01: totaam commented


Note: as of r6046, the leak detection code can also be used from the client, documented on wiki: [/wiki/Debugging#MemoryLeaks Debugging Memory Leaks]

I believe this particular memory leak may well be race in dec_avcodec2 (which would explain why it only happened under specific circumstances), will follow up in #465.

Closing. Feel free to re-open if you have more information.

FYI: in the future, we may move to statically linked dec_avcodec2 decoder builds for all Debian / Ubuntu, as this reduces the maintenance cost of having 2 versions of the "libav" based decoder.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant