Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Risk of buffer underruns during seek with keylock enabled #11142

Open
daschuer opened this issue Dec 21, 2022 · 9 comments
Open

Risk of buffer underruns during seek with keylock enabled #11142

daschuer opened this issue Dec 21, 2022 · 9 comments

Comments

@daschuer
Copy link
Member

Bug Description

This is due to crossfading + settling time of the pitch shift library

Discussed in
#11120 (comment)

Solution: We need to move the crossfading to the input buffer.

Version

all

OS

all

@daschuer
Copy link
Member Author

#11120 make the issue worse.

@daschuer
Copy link
Member Author

Since the #11120 is merged, continue the discussion here:

@robbert-vdh

This is roughly three times the CPU.

Not quite. Keep in mind that Rubber Band processes the input in windows of n samples at a time. With the R2 algorithm (and default window size settings) that's 2048 samples, and with the R3 algorithm that's 4096 samples. If the Rubberband process function needs to produce, say, 128 samples, then it always needs to run 2048/4096 samples through the algorithm before it starts producing output. So when you add the padding the next window after the first one will be processed a little sooner, but after a reset the time stretcher will still only process a single window's worth of samples.

The 2048 frames window is moved by 512 frames and processed again. This means every process() call processes 2048 frames for a 512 frames output. It takes ~0.3 ms her in the best case, but can take also ~0,7 ms. Not sure what is blocking in the later case. The fist process() calls are "free", because rubberband has not enough samples to fill the first window.
This means consuming the start padding of 1024 frames will require two extra process() calls. This means initial assumption is correct! This is even worse in a low latency setup with a 256 frames buffer:

I have tested it with ramping disabled.

With Padding:

debug [Engine] 1003030 ns 256
debug [Engine]  284299 ns 256
debug [Engine]    3877 ns 256
debug [Engine]  299122 ns 256
debug [Engine]    8488 ns 256
debug [Engine]  318902 ns 256
debug [Engine]    7167 ns 256
debug [Engine]  402462 ns 256
debug [Engine]    6564 ns 256
debug [Engine]  739872 ns 256

Without padding:

debug [Engine] 436958 ns 256
debug [Engine]  11097 ns 256
debug [Engine] 468216 ns 256
debug [Engine]  10362 ns 256
debug [Engine] 499175 ns 256
debug [Engine]   5371 ns 256
debug [Engine] 491943 ns 256
debug [Engine]   4381 ns 256
debug [Engine] 594607 ns 256
debug [Engine]  11932 ns 256
debug [Engine] 541240 ns 256

I am considering to disable the padding via a compile flag or a preferences option.

Do you think we can distinguish use cases when the padding is required/affordable?

@daschuer
Copy link
Member Author

Without Hyperthreading:

debug [Engine] 1408800 ns 256
debug [Engine]  527792 ns 256
debug [Engine]    5734 ns 256
debug [Engine]  598796 ns 256
debug [Engine]    6280 ns 256
debug [Engine]  537126 ns 256
debug [Engine]   13375 ns 256
debug [Engine]  600413 ns 256
debug [Engine]    8464 ns 256
debug [Engine]  426023 ns 256
debug [Engine]    3929 ns 256
debug [Engine]  308489 ns 256
debug [Engine]    4276 ns 256
debug [Engine]  392389 ns 256
debug [Engine]   11705 ns 256
debug [Engine]  468086 ns 256

This is the result with -8 % and Hyperthreading disabled:

debug [Engine] 1531296 ns 235.52
debug [Engine]  366774 ns 235.52
debug [Engine]  337107 ns 235.52
debug [Engine]  350946 ns 235.52
debug [Engine]  328551 ns 235.52
debug [Engine]  311551 ns 235.52
debug [Engine]  132047 ns 235.52
debug [Engine]  163263 ns 235.52
debug [Engine] 1221300 ns 235.52
debug [Engine] 1161088 ns 235.52
debug [Engine] 1062449 ns 235.52
debug [Engine] 1237705 ns 235.52
debug [Engine] 1107989 ns 235.52
debug [Engine] 1121478 ns 235.52
debug [Engine] 1155890 ns 235.52
debug [Engine] 1130431 ns 235.52
debug [Engine] 1097135 ns 235.52
debug [Engine] 1216932 ns 235.52
debug [Engine] 1117292 ns 235.52
  • 8% (this is the worst) (underflow guarantee :-( )
debug [Engine]  5460367 ns 384
debug [Engine]  1105931 ns 384
debug [Engine]  1123574 ns 384
debug [Engine]  1091340 ns 384
debug [Engine]   512441 ns 384
debug [Engine]   191527 ns 384
debug [Engine]   283019 ns 384
debug [Engine]   461562 ns 384
debug [Engine]   878107 ns 384
debug [Engine]   440988 ns 384
debug [Engine]   508900 ns 384
debug [Engine]   503065 ns 384
debug [Engine]   532310 ns 384
debug [Engine]   598499 ns 384
debug [Engine]   267653 ns 384
debug [Engine]   273036 ns 384
debug [Engine]   292423 ns 384
debug [Engine]   545478 ns 384
debug [Engine]   311417 ns 384

Workaround: disable padding:

debug [Engine] 489506 ns 384
debug [Engine] 651018 ns 384
debug [Engine] 107713 ns 384
debug [Engine] 502916 ns 384
debug [Engine] 720728 ns 384
debug [Engine] 718260 ns 384
debug [Engine] 730008 ns 384
debug [Engine] 581509 ns 384
debug [Engine] 228926 ns 384
debug [Engine] 307710 ns 384
debug [Engine] 575784 ns 384

@robbert-vdh
Copy link
Contributor

I also ran the numbers and they don't quite line up with yours: (the averages aren't very useful for the 128 sample period size since most of the periods the algorithm won't be doing anything, so just look at the maxima)

image

Markdown version
('duration_ms', 'count') ('duration_ms', 'mean') ('duration_ms', 'std') ('duration_ms', 'min') ('duration_ms', '25%') ('duration_ms', '50%') ('duration_ms', '75%') ('duration_ms', 'max')
(128, 'R2, no padding', False) 2419 0.0727065 0.133185 0.00022 0.00034 0.000431 0.000882 0.493744
(128, 'R2, no padding', True) 5 0.287703 0.0282409 0.255233 0.272776 0.282395 0.298305 0.329804
(128, 'R2, with padding', False) 2481 0.0729526 0.133512 0.00024 0.00041 0.000561 0.000921 0.493744
(128, 'R2, with padding', True) 5 0.833309 0.0617512 0.727096 0.833327 0.857293 0.8715 0.87733
(128, 'R2, with padding, not dropped', False) 2223 0.0733349 0.133695 0.000311 0.000441 0.000521 0.000802 0.352317
(128, 'R2, with padding, not dropped', True) 5 0.291496 0.00976628 0.280551 0.286562 0.286813 0.300349 0.303204
(128, 'R3, no padding', False) 2594 0.213112 0.231615 0.00017 0.00034 0.0007215 0.45601 0.763324
(128, 'R3, no padding', True) 6 1.3241 1.93127 0.507581 0.520713 0.543102 0.564447 5.26601
(128, 'R3, with padding', False) 2557 0.200975 0.217931 0.00024 0.00035 0.000621 0.434111 0.615986
(128, 'R3, with padding', True) 5 3.97221 1.66891 3.18784 3.19398 3.24011 3.28224 6.95685
(128, 'R3, with padding, not dropped', False) 2890 0.204037 0.221124 0.000221 0.000391 0.000551 0.444145 0.677943
(128, 'R3, with padding, not dropped', True) 6 1.33815 2.1098 0.447025 0.473365 0.486972 0.492613 5.64462
(512, 'R2, no padding', False) 717 0.28882 0.090682 0.000741 0.302723 0.315217 0.323542 0.494165
(512, 'R2, no padding', True) 5 0.572012 0.0172997 0.541575 0.576831 0.578114 0.578625 0.584916
(512, 'R2, with padding', False) 390 0.284081 0.0887271 0.000802 0.30554 0.312736 0.3185 0.441525
(512, 'R2, with padding', True) 4 1.03961 0.176308 0.776189 1.02348 1.11987 1.136 1.14252
(512, 'R2, with padding, not dropped', False) 532 0.279834 0.0905585 0.000561 0.293586 0.310157 0.315898 0.460772
(512, 'R2, with padding, not dropped', True) 5 0.55508 0.019411 0.536274 0.545993 0.549159 0.556754 0.587221
(512, 'R3, no padding', False) 768 0.806678 0.165787 0.344442 0.809307 0.86953 0.894798 1.19747
(512, 'R3, no padding', True) 4 2.44791 2.44351 1.17635 1.17734 1.25182 2.52239 6.11167
(512, 'R3, with padding', False) 216 0.787303 0.167003 0.338521 0.788928 0.864116 0.888765 1.07252
(512, 'R3, with padding', True) 5 4.77156 1.90596 3.81508 3.85655 3.97673 4.03203 8.17743
(512, 'R3, with padding, not dropped', False) 465 0.789699 0.162433 0.367195 0.799863 0.854567 0.879203 1.11548
(512, 'R3, with padding, not dropped', True) 5 1.26595 0.0762737 1.15859 1.2433 1.25025 1.32523 1.35239

(table source)

The after_reset entries are the timings for the first scaleBuffer() call after a reset, and the other ones are the timings for the other scaleBuffer() calls. with padding is the new behavior, with padding, not dropped sets m_remainingPaddingInOutput to 0 at the start of retrieveAndDeinterleave so the processed padding is not thrown away, and no padding also doesn't add the padding after the reset (so the reset function just calls m_pRubberBand->reset()). There's a 2x difference between dropping the padding and not dropping the padding, which is kind of expected since it needs to compute more output (and remember, the algorithm works with overlap-add/an STFT).

Even without padding the first call after a reset sometimes takes significantly longer. Maybe it's possible to avoid calling reset() at all here?

@robbert-vdh
Copy link
Contributor

Oh yeah and:

I am considering to disable the padding via a compile flag or a preferences option.

Do you think we can distinguish use cases when the padding is required/affordable?

The padding is always needed, or the output will fade in over the coarse of the first 2048 produced samples (refer back to that squared Hann function I posted in the PR).

@robbert-vdh
Copy link
Contributor

But yes, when properly taking care of the padding you do need to process two additional overlapping windows, and that will indeed increase the DSP load on the first scaleBuffer() call after a reset. So a DSP load spike is expected. I don't think there's a way around this.

@daschuer
Copy link
Member Author

The current situation is:

  • Seeking a track with keylock enable will likely introduce a tempo changing drop out
    Before:
  • Seeking a track withe keylock enabled will introduce notable loudness dip

Related:

  • Random seeks will introduce a loudness dip anyway, because of cache misses in the caching reader
  • Loops are not affected because the input signal is looped with crossfading
  • When cuing in a track, you repeatably seek in headphones only, this must not change tempo of the playing track or create crackling.
  • Enable key-lock will introduce underflow risk.

For now I consider that we have fixed one problem by introducing another worse problem.

As a band-aid I like to disable the padding by default.

Later we may introduce a better solution:

  • enable it only for tracks contributing to master.
  • Disable it for random seek, where cache misses a likely anyway.
  • Fade the input signal (delays seeks by the rubberband latency)
  • Move rubberband to a worker thread (my suffer concurrent issues, if buffer is not read in time)

@robbert-vdh
Copy link
Contributor

Random seeks will introduce a loudness dip anyway, because of cache misses in the caching reader

Also, why does Mixxx always stream the file even if the decoded file would only take up a fraction of the available system memory? I feel like just reading the entire track upfront if it's, say, below 100 MB when decoded to PCM samples would help seek performance a lot and remove a source of realtime-unsafeness that causes real world performance issues.

@daschuer
Copy link
Member Author

I agree that this is not optimal and can be improved. The goal was original to be immediately ready to play a track after throwing it to a deck and the concerns the about long tracks which can be theoretically unlimited.
Currently we have even a concurrent disc access between the caching reader and the analyzer, which makes the issue worse.
I would be happy if one could change that.
Ideas:

  • Use the caching reader for analyzer
  • Cache the whole track until a reasonable size

This will free up a lot resources especially if you need to analyze a track during a performance for some reason.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants