Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

high server CPU usage can cause sound "Dropped samples" #1067

Closed
totaam opened this issue Dec 30, 2015 · 24 comments
Closed

high server CPU usage can cause sound "Dropped samples" #1067

totaam opened this issue Dec 30, 2015 · 24 comments
Labels

Comments

@totaam
Copy link
Collaborator

totaam commented Dec 30, 2015

Issue migrated from trac ticket # 1067

component: sound | priority: major | resolution: fixed

2015-12-30 09:52:34: antoine created the issue


Seen this on a heavily loaded system:

sound source pipeline warning: Can't record audio fast enough
sound source pipeline warning: ("gstaudiobasesrc.c(866): gst_audio_base_src_create (): \
    /GstPipeline:pipeline0/GstPulseSrc:pulsesrc0:\n \
    Dropped 10584 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.",)

This warning can get repeated hundreds of times too.

The usual way of dealing with this problem is to add a queue element after the source.
Problem is that this is likely to increase the sound latency, which could mess with av-sync (#835).
We could make the queue quite small and leaky, but this could lead to choppy sound instead... At least this would get rid of the warnings.

@totaam
Copy link
Collaborator Author

totaam commented Dec 30, 2015

2015-12-30 10:15:29: antoine changed owner from antoine to afarr

@totaam
Copy link
Collaborator Author

totaam commented Dec 30, 2015

2015-12-30 10:15:29: antoine commented


Done in r11533 using a small leaky queue (50ms max by default).
We take the level of this queue into account when calculating the av-sync delay, so this should not be causing av-sync problems. (and can be extended further to calculate the total server side sound pipeline latency if we can get to it - maybe now that we use gstreamer 1.x?)
The code re-uses the queue utility code, so the size of this queue can be changed using:

XPRA_SOUND_QUEUE_SOURCE_TIME=VALUE_IN_MILLIS

The current server-side sound source pipeline queue time can be seen with:

xpra info | grep speaker.queue.cur

@afarr: this is a FYI, please close.
Hopefully we will never see the "Dropped samples" again after this change.

@totaam
Copy link
Collaborator Author

totaam commented Jan 5, 2016

2016-01-05 02:07:50: afarr changed status from new to closed

@totaam
Copy link
Collaborator Author

totaam commented Jan 5, 2016

2016-01-05 02:07:50: afarr set resolution to fixed

@totaam
Copy link
Collaborator Author

totaam commented Jan 5, 2016

2016-01-05 02:07:50: afarr commented


Did some tests with 0.17.0 r11535 against fedora 23 0.17.0 r11580.

I was able to get one single instance of the "Dropped samples" error... while running sudo stress --cpu 4 --timeout 66 on the server, with multiple browsers running video, the session info playing the Graphs, and with a -d printing debug flag on client side.

I couldn't get another instance though, even running the stress up to 12 (sudo stress --cpu 12 --timeout 66) and dropping the sound queue to 50 ms, once I turned off the debugging.

Since I assume that one repro isn't worth troubling over, I'll close this.

@totaam
Copy link
Collaborator Author

totaam commented Jan 5, 2016

2016-01-05 19:02:47: afarr changed status from closed to reopened

@totaam
Copy link
Collaborator Author

totaam commented Jan 5, 2016

2016-01-05 19:02:47: afarr removed resolution (was fixed)

@totaam
Copy link
Collaborator Author

totaam commented Jan 5, 2016

2016-01-05 19:02:47: afarr commented


Not sure if I closed this too soon, but leaving a session running with video playing overnight, I was able to trigger the pipeline warnings again (probably when the client sleep cycle started to kick in), along with a new warning and a couple of tracebacks.

For some context, I left the session to continue idly at about 18:30, got a flurry of the warnings and traceback between 21:07 and 08:30... and didn't wake the client up until about 09:28. I'll re-open this and hand it to you to decide what, if anything, to do about it. (Also worth noting, this was tested with an OSX client and the sleep cycle did not trigger any client crashes or problem reports, which I recall there was a ticket about, though I can't find the ticket itself... and the sound was still running come the morning, despite server warnings.)

/usr/lib64/python2.7/site-packages/xpra/server/window/video_subregion.py:79: Warning: Source ID 3002163 was not found when attempting to remove it
  self.source_remove(rt)
2016-01-04 21:07:18,688 using 'png' as non-video fallback using <bound method WindowVideoSource.pillow_encode of WindowVideoSource(3 : (821, 562))>
2016-01-04 21:24:22,283 using 'png' as non-video fallback using <bound method WindowVideoSource.pillow_encode of WindowVideoSource(4 : (622, 567))>
2016-01-04 23:27:56,040 sound source pipeline warning: Can't record audio fast enough
2016-01-04 23:28:10,265 sound source pipeline warning: ("gstaudiobasesrc.c(866): gst_audio_base_src_create (): /GstPipeline:pipeline0/GstPulseSrc:pulsesrc0:\nDropped 10584 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.",)
2016-01-04 23:28:27,530 sound source pipeline warning: Can't record audio fast enough
2016-01-04 23:28:34,249 sound source pipeline warning: ("gstaudiobasesrc.c(866): gst_audio_base_src_create (): /GstPipeline:pipeline0/GstPulseSrc:pulsesrc0:\nDropped 11025 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.",)
2016-01-04 23:28:35,216 sound source pipeline warning: Can't record audio fast enough

...

2016-01-04 23:29:26,337 sound source pipeline warning: Can't record audio fast enough
2016-01-04 23:29:26,337 sound source pipeline warning: ("gstaudiobasesrc.c(866): gst_audio_base_src_create (): /GstPipeline:pipeline0/GstPulseSrc:pulsesrc0:\nDropped 13230 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.",)
2016-01-04 23:29:26,337 sound source pipeline warning: Can't record audio fast enough
2016-01-04 23:29:26,337 sound source pipeline warning: ("gstaudiobasesrc.c(866): gst_audio_base_src_create (): /GstPipeline:pipeline0/GstPulseSrc:pulsesrc0:\nDropped 10584 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.",)
2016-01-04 23:29:26,337 sound source pipeline warning: Can't record audio fast enough
2016-01-04 23:29:26,337 sound source pipeline warning: ("gstaudiobasesrc.c(866): gst_audio_base_src_create (): /GstPipeline:pipeline0/GstPulseSrc:pulsesrc0:\nDropped 21168 samples. This is most likely because downstream can't keep up and is consuming samples too slowly.",)
2016-01-05 01:59:50,544 error on window 4
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 467, in recalculate_delays
    len(maximized_wids)>0 and wid not in maximized_wids)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 115, in calculate_batch_delay
    WindowSource.calculate_batch_delay(self, has_focus, other_is_fullscreen, other_is_maximized)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 812, in calculate_batch_delay
    calculate_batch_delay(self.wid, self.window_dimensions, has_focus, other_is_fullscreen, other_is_maximized, self.is_OR, self.soft_expired, self.batch_config, self.global_statistics, self.statistics)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/batch_delay_calculator.py", line 39, in calculate_batch_delay
    factors = statistics.get_factors(low_limit, batch.delay)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_stats.py", line 122, in get_factors
    factors.append(calculate_for_average(metric, avg1MB, recent1MB, weight_offset=1.0, weight_div=div))
  File "xpra/server/cystats.pyx", line 149, in xpra.server.cystats.calculate_for_average (xpra/server/cystats.c:3239)
ValueError: cannot convert float NaN to integer
2016-01-05 08:35:32,715 error on window 1
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 467, in recalculate_delays
    len(maximized_wids)>0 and wid not in maximized_wids)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 115, in calculate_batch_delay
    WindowSource.calculate_batch_delay(self, has_focus, other_is_fullscreen, other_is_maximized)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 812, in calculate_batch_delay
    calculate_batch_delay(self.wid, self.window_dimensions, has_focus, other_is_fullscreen, other_is_maximized, self.is_OR, self.soft_expired, self.batch_config, self.global_statistics, self.statistics)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/batch_delay_calculator.py", line 39, in calculate_batch_delay
    factors = statistics.get_factors(low_limit, batch.delay)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_stats.py", line 122, in get_factors
    factors.append(calculate_for_average(metric, avg1MB, recent1MB, weight_offset=1.0, weight_div=div))
  File "xpra/server/cystats.pyx", line 149, in xpra.server.cystats.calculate_for_average (xpra/server/cystats.c:3239)
ValueError: cannot convert float NaN to integer
2016-01-05 08:35:32,716 error on window 2
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 467, in recalculate_delays
    len(maximized_wids)>0 and wid not in maximized_wids)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 115, in calculate_batch_delay
    WindowSource.calculate_batch_delay(self, has_focus, other_is_fullscreen, other_is_maximized)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 812, in calculate_batch_delay
    calculate_batch_delay(self.wid, self.window_dimensions, has_focus, other_is_fullscreen, other_is_maximized, self.is_OR, self.soft_expired, self.batch_config, self.global_statistics, self.statistics)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/batch_delay_calculator.py", line 39, in calculate_batch_delay
    factors = statistics.get_factors(low_limit, batch.delay)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_stats.py", line 122, in get_factors
    factors.append(calculate_for_average(metric, avg1MB, recent1MB, weight_offset=1.0, weight_div=div))
  File "xpra/server/cystats.pyx", line 149, in xpra.server.cystats.calculate_for_average (xpra/server/cystats.c:3239)
ValueError: cannot convert float NaN to integer
2016-01-05 08:35:51,964 using 'png' as non-video fallback using <bound method WindowVideoSource.pillow_encode of WindowVideoSource(3 : (821, 562))>
2016-01-05 09:31:13,148 Warning: delayed region timeout
2016-01-05 09:31:13,149  region is 15 seconds old, will retry - bad connection?
...

@totaam
Copy link
Collaborator Author

totaam commented Jan 5, 2016

2016-01-05 19:03:09: afarr changed status from reopened to new

@totaam
Copy link
Collaborator Author

totaam commented Jan 5, 2016

2016-01-05 19:03:09: afarr changed owner from afarr to antoine

@totaam
Copy link
Collaborator Author

totaam commented Jan 6, 2016

2016-01-06 03:39:30: antoine changed owner from antoine to afarr

@totaam
Copy link
Collaborator Author

totaam commented Jan 6, 2016

2016-01-06 03:39:30: antoine commented


A good catch in there:

/usr/lib64/python2.7/site-packages/xpra/server/window/video_subregion.py:79: Warning: Source ID 3002163 was not found when attempting to remove it
  self.source_remove(rt)

is a valid complaint... it's a harmless race condition which I have tried to make less likely in r11600. It may still get hit, but we can just ignore it. (preventing it completely would be too expensive: it would require locking)

And another one:

using 'png' as non-video fallback using <bound method WindowVideoSource.pillow_encode of WindowVideoSource(3 : (821, 562))>

Happens when the video encoder gets removed between the time we check it and the time we use it.
This should be quite rare, but you seem to have hit it twice in 20 minutes.
That's something to keep an eye on, in case it happens too much: this would be a sign that we are re-cycling video contexts too quickly, which is very expensive. Being able to reproduce this would help.


This one is more difficult: ValueError: cannot convert float NaN to integer.
I believe that r11601 should prevent that, but since I can't reproduce this here, I can't be certain.

All 3 problems may be related, is there anything special about this setup that could help me reproduce?

@totaam
Copy link
Collaborator Author

totaam commented Jan 6, 2016

2016-01-06 19:01:00: afarr commented


Not sure if there was anything particular to this test. Same versions as listed in comment:2. As I recall, since I was running the stress with timers, there wouldn't've been any stress running. It was an osx 10.9..4 laptop, running a firefox window launched from an xterm (playing videos) and a google-chrome window also launched from an xterm (playing awful cnn video).

I presumed that the laptop client had tried to go to sleep long before the server started outputting the warnings - but a quick check shows that I have the display set to turn off after a little over an hour but set to prevent computer from sleeping, not set to put hard disks to sleep, and set to wake for wi fi access - though I believe I was running wired... so I suppose it was just the display that had turned off, without the machine itself having gone to sleep.

No idea what actively triggers the errors though.

@totaam
Copy link
Collaborator Author

totaam commented Jan 6, 2016

2016-01-06 21:15:10: afarr commented


Ohh... just remembered a detail that I deal with so much that I'd almost forgotten how relevant it might be... was testing with an Intel Iris graphics card, which means I had to force opengl with --opengl=on.

Maybe Intel is striking again?

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2016

2016-01-07 01:19:52: afarr commented


Hmm... managed to hit the ValueError: cannot convert float NaN to integer error again.

Was trying more of the same testing, with 0.16.1 11604 osx client and 0.16.1 r11607 fedora 23 server (running lots of video while doing other things) - but then tried passing the session to a windows client (0.17.0 r11588, mac-mini bootcamping windows, also running Intel 4000 graphics card, using --opengl=on to force) then grabbing it back to the osx client.

Upon grabbing session back to the osx client, I got a new traceback (will include here in case it helps with this), though trying to repeat I didn't get the error again:

2016-01-06 14:42:18,208 Disconnecting client '10.0.11.162:60418':
2016-01-06 14:42:18,219  new client (this session does not allow sharing)
2016-01-06 14:42:18,226 error on window 4
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 469, in recalculate_delays
    ws.reconfigure()
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 895, in reconfigure
    self.update_encoding_options(force_reload)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 661, in update_encoding_optio                  ns
    self.check_pipeline_score(force_reload)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 717, in check_pipeline_score
    scores = self.get_video_pipeline_options(encoding, w, h, self.pixel_format, force_reload)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 802, in get_video_pipeline_op                  tions
    csc_specs = self.video_helper.get_csc_specs(src_format)
AttributeError: 'NoneType' object has no attribute 'get_csc_specs'
2016-01-06 14:42:18,265 sound source stopping
2016-01-06 14:42:18,295 xpra client disconnected.
2016-01-06 14:42:18,303 Python/Gtk2 Mac OSX client version 0.16.1-11604
2016-01-06 14:42:18,303  connected from 'Schadenfreude.local' as 'Schadenfreude'

I let the session continue, but looking at it later noticed that I got the ValueError: cannot convert float NaN to integer error server side at about the same time I got some long timer warnings from the UI thread...

Server:

2016-01-06 15:52:09,646 error on window 1
Traceback (most recent call last):
  File "/usr/lib64/python2.7/site-packages/xpra/server/source.py", line 468, in recalculate_delays
    len(maximized_wids)>0 and wid not in maximized_wids)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_video_source.py", line 115, in calculate_batch_delay
    WindowSource.calculate_batch_delay(self, has_focus, other_is_fullscreen, other_is_maximized)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_source.py", line 817, in calculate_batch_delay
    calculate_batch_delay(self.wid, self.window_dimensions, has_focus, other_is_fullscreen, other_is_maximized, self.is_OR, self.soft      _expired, self.batch_config, self.global_statistics, self.statistics)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/batch_delay_calculator.py", line 39, in calculate_batch_delay
    factors = statistics.get_factors(low_limit, batch.delay)
  File "/usr/lib64/python2.7/site-packages/xpra/server/window/window_stats.py", line 121, in get_factors
    factors.append(calculate_for_average(metric, avg1MB, recent1MB, weight_offset=1.0, weight_div=div))
  File "xpra/server/cystats.pyx", line 149, in xpra.server.cystats.calculate_for_average (xpra/server/cystats.c:3239)
    info = {"avg"   : int(1000.0*avg),
ValueError: cannot convert float NaN to integer

Client:

2016-01-06 15:52:06,642 Warning: long timer waiting time,
2016-01-06 15:52:06,643  UI thread polling waited 5.1 seconds longer than intended (5.6 vs 0.5)
2016-01-06 15:52:06,643 UI thread is now blocked
2016-01-06 15:52:06,648 UI thread is running again, resuming

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2016

2016-01-07 01:22:56: afarr commented


As a probably separate issue, though probably not worth a ticket of its own, as I was passing the session back and forth between clients, the 0.17.0 windows client output the following warning (which I suspect is just a result of connecting 0.17 client to 0.16.1 server, though wanted to mention it) upon each new connection.

2016-01-06 17:18:12,679 Warning: bencode import failed: No module named 'xpra.net.bencode'

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2016

2016-01-07 01:28:46: antoine changed status from new to assigned

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2016

2016-01-07 01:28:46: antoine changed owner from afarr to antoine

@totaam
Copy link
Collaborator Author

totaam commented Jan 7, 2016

2016-01-07 01:28:46: antoine commented


  • the bencode warning is the one I was talking about in switch to gstreamer 1.x on win32 #1041#comment:3 and we can safely ignore it
  • the AttributeError: 'NoneType' object has no attribute 'get_csc_specs' should be fixed in r11610 (low priority backport as this is harmless)
  • the ValueError: cannot convert float NaN to integer is bugging me: I see no way of getting this error unless avg is NaN, and avg=avg_value/div so either avg_value is NaN or div is zero? (this is Cython not Python code..), but:
  • div # logp(max(rss, minspeed)/minspeed) with minspeedfloat(128*1024), so div should always be >0
  • avg = 1.0*1024*1024/self.avg_send_speed which comes after the newly added guard checking that avg_send_speed is >0...

@totaam
Copy link
Collaborator Author

totaam commented Jan 28, 2016

2016-01-28 20:55:52: antoine commented


Note: comment:3 onwards are not related to sound and should have been recorded somewhere else... oh well.

We now have follow up in: #1103

@totaam
Copy link
Collaborator Author

totaam commented Mar 16, 2016

2016-03-16 05:26:14: antoine changed status from assigned to closed

@totaam
Copy link
Collaborator Author

totaam commented Mar 16, 2016

2016-03-16 05:26:14: antoine set resolution to fixed

@totaam
Copy link
Collaborator Author

totaam commented Mar 16, 2016

2016-03-16 05:26:14: antoine commented


Let's close this one for now and re-open if we hit it again reliably under "normal" circumstances.

@totaam totaam closed this as completed Mar 16, 2016
@totaam
Copy link
Collaborator Author

totaam commented Apr 11, 2016

2016-04-11 02:54:11: antoine commented


The actual queue time env var XPRA_SOUND_QUEUE_SOURCE_TIME was not being honoured, r12371 fixes that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant