-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][BYT-NOCODEC][BDW_WSB_RT286]Input/output error when simultaneous-playback-capture / multiple-pipeline-playback #3170
Comments
On BSW with onboard codec MAX98090 in I2S mode also has this issue with sof-dev(9eb3d58)+master(5564a90) Failed at 39/50
|
@slawblauciak sof-byt-nocodec.tplg is created from "sof-cht-nocodec.m4", updated in #3080 to remove SRC for Baytrail and CherryTrail by @plbossart |
@Liviali155 can you retry with the SOF PR #3245 The symptoms of no dmesg error, no trace and the -EIO error seem completely aligned with my findings on those platforms. |
After applied #3245,issue still can be reproduced on BSW with onboard codec MAX98090 in I2S mode and BYT MB with nocodec |
@Liviali155 can you retry with both #3245 and #3257 Somehow I have the feeling we have the same problem of not having enough memory resulting in some sort of underflow error. It's not clear e.g. why we have these repeated sof-logger messages in both issues
Seems to me like the firmware gets lost with bad pointers and can't recover in both #3170 and #3171 @mmaka1 @lgirdwood FYI |
@Liviali155 Also wondering if 61a2c75 (' dma: dw: fix locking and calculations in dw_dma_get_data_size') has an impact on multiple pipelines. This changes the behavior for dmic, I am trying to see if reverting it can help. Can you also try on your side. |
@Liviali155 Adding PR#3245, @3257 and reverting 61a2c75 seems to solve the issue for 100 iterations. branch here: https://github.com/plbossart/sof/tree/fix/multi-pipelines |
@plbossart Used https://github.com/plbossart/sof/tree/fix/multi-pipelines(commit:d9c6405a) to test,issue still can be reproduced on byt-nocodec , failed at 287/1000, seems the reproduce rate is lower than before |
Ack @Liviali155, same on my side. My first run worked for 100 iteration, but on the second a failure happened on iteration 9/1000. We need to figure this one out, something's not right with scheduling/concurrency. @lgirdwood FYI |
I don't know how to turn on mixer debug? Mixers are used in all platforms, but here indeed the use of the mixer is different: it's the first element in a 'DAI' pipeline |
Oh, I think individual debug is still blocked on UUID PR, Maybe just be easier to change trace_dbg to trace_err in mixer_copy().
Yep, but I no longer see a sof-byt-nocodec.m4 topology in master, so could your topology binary be stale here and potentially causing an issue ? I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs.... |
@lgirdwood the topology is there, it's generated from the cht m4.
I see the PGA component reporting 0 available frames when this bug occurs... |
Please let me know if you see the DMA complaining about uder/overruns. If we dont see this then the data could be getting stuck in the PGA or mixer. Btw, there may be some PCM converter between DMA and PGA that could block too. |
@lgirdwood no, I don't see any of those |
I'm wondering why the generic nocodec topology sof-cht-nocodec.m4 uses DMA schedulers? |
all byt/cht topologies use DMA schedulers, it's not limited to the nocodec case. |
I'm curious about scheduling domains. In some cases (like in UP2 case) they seem to be freely replaceable - you can use one or another. In other cases (like BYT nocodec) only one works. What are the conditions for each domain to be applicable? And it does look like the DMA scheduling domain has got some issues. |
@lyakh in principle they should do the same thing, that is schedule timely pipeline work, but there are some differences in implementation, synchronisation and maybe IRQ runlevel. i.e. they are both triggered on IRQs, the timer domain schedules all work in order, whereas the DMA domain probably schedules on DMA IRQ and this may be asynchronous to other pipelines (and could block on other work finishing). |
The DMA scheduling doesn't work for HDaudio link DMAs, not interrupts are generated so you HAVE to use the timer-based scheduling for all HDAudio pipelines. |
And to build on this, even for Baytrail in master mode, the legacy closed-source firmware did not use DMA interrupts but also a 1ms external timer ticks, so I will assert that for the SSP using the timer or the DMA interrupt is essentially the same. I think the choice was more a case of not having to validate baytrail/cherrytrail, initially all scheduling was DMA based for early platforms and it stayed that way due to code inertia. |
@lgirdwood @plbossart thanks! I've tried blatantly replacing the DMA domain with the timer domain in the BYT nocodec topology and it isn't even loading now - DW DMA errors out with some missing configuration. Investigating. |
Hm so DMA scheduling is looked at more of as a legacy that is supported rather than the recommended way? |
If the interface is slave to an external device, the DMA scheduling is required. When the interface is master and synchronous with the timer tick, switching the two is a revalidation effort but I don't see how the performance might differ on paper. But as @lyakh shows above, in practice there might be implementation issues. Edit: to be clear, for Intel only the SSP can be slave to an external clock, the HDaudio, DMIC and SoundWire interfaces are all clock masters and the clocks are synchronous with the timers. |
To recap: I've found out that all "legacy" platforms (BYT, CHT, BDW, etc.) use DMA scheduling. An attempt to switch byt-nocodec to timer scheduling failed with firmware errors, which I since then have tried to debug and fix. I've found the reason why this doesn't work: the firmware DW DMA driver fails to set configuration in
This doesn't fail with DMA scheduling because then the This doesn't fail on non-legacy platforms, because they don't specify I tried fixing the above problem by allocating the necessary minimum (3) of SG elements in So, it looks like "legacy platforms" have multiple problems with timer-driven scheduling. It might be our best option ATM to make this a hard rule somewhere and try to fix DMA scheduling which we need anyway. |
This is a rule for when DMA uses HW LLI mode (since there is a race between writing back LL descriptors and resetting them when 2 periods are used). This rule should not apply for SW LLI on BYT. |
CI observed this issue again on BSW_CYN_MAX98090 and BYT_MB_NOCODEC. |
@keqiaozhang is there a way we can bisect to see when the problem re-appeared? |
@lgirdwood This issue remains visible in recent Intel daily tests, it's still a problem. |
buffer warning 'no bytes to produce' and ' no bytes to consume' don't appear these days . example : |
We are still seeing this in recent daily report. |
in inner daily 9751 and 9715, when check-playback/check-capture on BDW_WSR_RT286 , IO error happen in the first play |
@XiaoyunWu6666 I'm suspicious we have over budget MCPS given that both are HiFi2 and will use the generic C processing with the frag API. @singalsu fyi - lets retest this again after all the frag APIs users have been fixed. |
FWIW we seem to have an interrupt issue on Broadwell thesofproject/linux#3400 |
Still happening in daily 10146?model=BDW_WSB_RT286&testcase=multiple-pipeline-capture-50 Start Time: 2022-02-11 22:27:26 UTC |
Known issue thesofproject/sof#3170 has been polluting the test results for years. Signed-off-by: Marc Herbert <marc.herbert@intel.com>
As Broadwell (BDW) is a very old platform, we lower its priority and will not fix issues with multi-pipeline test cases on BDW. |
The test still failed yesterday in 10402?model=BDW_WSB_RT286&testcase=multiple-pipeline-capture-50, let's close this when thesofproject/sof-test#863 is merged so we stop testing this every day like someone is assigned to it. We had only 6 distinct failures in 10402 and this was one of them. |
@marc-hb I think we can close this again since https://github.com/intel-innersource/drivers.audio.ci.sof-framework/pull/185 got merged and also see current test 10444
|
Known issue thesofproject/sof#3170 has been polluting the test results for years. Signed-off-by: Marc Herbert <marc.herbert@intel.com>
Describe the bug
Input/output error when do simultaneous-playback-capture test
After error occured all the pipeline can work
To Reproduce
1."sudo reboot" to reboot system
2.cd sof-test
3.cd test-case
4.export TPLG=sof-byt-nocodec.tplg
5../simultaneous-playback-capture.sh -l 100
Reproduction Rate
1 round: failed at 13/100
Expected behavior
No error occured
Impact
Input/output error when do simultaneous-playback-capture test of aplay(0,0) and arecord (0,0)
No error dmesg,no error sof error trace
Environment
Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
Kernel: {sof-dev fa7850de}
SOF: {master:318dc9f7}
Name of the topology file
Topology: {sof-byt-nocodec.tplg }
Name of the platform(s) on which the bug is observed.
Platform: {BYT MB with nocodec}
dmesg0713.log
sof-logger0713.log
The text was updated successfully, but these errors were encountered: