Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][BYT-NOCODEC][BDW_WSB_RT286]Input/output error when simultaneous-playback-capture / multiple-pipeline-playback #3170

Closed
Liviali155 opened this issue Jul 13, 2020 · 39 comments · Fixed by #3339
Assignees
Labels
BSW Braswell bug Something isn't working as expected BYT Applies to Baytrail platform Intel Linux Daily tests This issue can be found in internal Linux daily tests P3 Low-impact bugs or features won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc)

Comments

@Liviali155
Copy link

Liviali155 commented Jul 13, 2020

Describe the bug
Input/output error when do simultaneous-playback-capture test
After error occured all the pipeline can work

To Reproduce
1."sudo reboot" to reboot system
2.cd sof-test
3.cd test-case
4.export TPLG=sof-byt-nocodec.tplg
5../simultaneous-playback-capture.sh -l 100

Reproduction Rate
1 round: failed at 13/100

Expected behavior
No error occured

Impact
Input/output error when do simultaneous-playback-capture test of aplay(0,0) and arecord (0,0)

Capture

ubuntu@jf-byt-mb-nocodec-1:~$ aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: sofnocodec [sof-nocodec], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: sofnocodec [sof-nocodec], device 1: PCM Deep Buffer (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0


ubuntu@jf-byt-mb-nocodec-1:~$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: sofnocodec [sof-nocodec], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0

No error dmesg,no error sof error trace

Environment
Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
Kernel: {sof-dev fa7850de}
SOF: {master:318dc9f7}
Name of the topology file
Topology: {sof-byt-nocodec.tplg }
Name of the platform(s) on which the bug is observed.
Platform: {BYT MB with nocodec}

dmesg0713.log
sof-logger0713.log

@Liviali155 Liviali155 added bug Something isn't working as expected BYT Applies to Baytrail platform P2 Critical bugs or normal features labels Jul 13, 2020
@Liviali155
Copy link
Author

On BSW with onboard codec MAX98090 in I2S mode also has this issue with sof-dev(9eb3d58)+master(5564a90)
Capture

Failed at 39/50

ubuntu@jf-bsw-cyn-max98090-3:~/sof-test/test-case$ aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: max98090 [sof-bytcht max98090], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: max98090 [sof-bytcht max98090], device 1: PCM Deep Buffer (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: Device [USB Audio Device], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0


ubuntu@jf-bsw-cyn-max98090-3:~/sof-test/test-case$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: max98090 [sof-bytcht max98090], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: Device [USB Audio Device], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

dmesg0714.log
sof-logger0714.log

@Liviali155 Liviali155 added the BSW Braswell label Jul 14, 2020
@mengdonglin
Copy link
Collaborator

@slawblauciak sof-byt-nocodec.tplg is created from "sof-cht-nocodec.m4", updated in #3080 to remove SRC for Baytrail and CherryTrail by @plbossart

@plbossart
Copy link
Member

@Liviali155 can you retry with the SOF PR #3245

The symptoms of no dmesg error, no trace and the -EIO error seem completely aligned with my findings on those platforms.

@Liviali155
Copy link
Author

After applied #3245,issue still can be reproduced on BSW with onboard codec MAX98090 in I2S mode and BYT MB with nocodec
Capture
dmesg0729.log
sof-logger0729.log

@plbossart
Copy link
Member

@Liviali155 can you retry with both #3245 and #3257

Somehow I have the feeling we have the same problem of not having enough memory resulting in some sort of underflow error. It's not clear e.g. why we have these repeated sof-logger messages in both issues

[241094039.427083] (241094032.000000) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241094042.916667] (        3.489583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5
[241095038.958333] (      996.041687) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241095043.697917] (        4.739583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5

Seems to me like the firmware gets lost with bad pointers and can't recover in both #3170 and #3171

@mmaka1 @lgirdwood FYI

@plbossart
Copy link
Member

@Liviali155 Also wondering if 61a2c75 (' dma: dw: fix locking and calculations in dw_dma_get_data_size') has an impact on multiple pipelines. This changes the behavior for dmic, I am trying to see if reverting it can help. Can you also try on your side.

@plbossart
Copy link
Member

@Liviali155 Adding PR#3245, @3257 and reverting 61a2c75 seems to solve the issue for 100 iterations. branch here: https://github.com/plbossart/sof/tree/fix/multi-pipelines
I will launch a longer test on my side.

@Liviali155
Copy link
Author

Liviali155 commented Jul 30, 2020

@plbossart Used https://github.com/plbossart/sof/tree/fix/multi-pipelines(commit:d9c6405a) to test,issue still can be reproduced on byt-nocodec , failed at 287/1000, seems the reproduce rate is lower than before
Capture

sof-logger0730.log
dmesg0730.log

@plbossart
Copy link
Member

Ack @Liviali155, same on my side. My first run worked for 100 iteration, but on the second a failure happened on iteration 9/1000.

We need to figure this one out, something's not right with scheduling/concurrency. @lgirdwood FYI

@lgirdwood
Copy link
Member

lgirdwood commented Aug 4, 2020

and will result in the above warnings. @plbossart can you turn on mixer debug and see if this aligns, I'm also wondering if mixers is used on the other platforms.

I don't know how to turn on mixer debug?

Mixers are used in all platforms, but here indeed the use of the mixer is different: it's the first element in a 'DAI' pipeline

@lgirdwood
Copy link
Member

and will result in the above warnings. @plbossart can you turn on mixer debug and see if this aligns, I'm also wondering if mixers is used on the other platforms.

I don't know how to turn on mixer debug?

Oh, I think individual debug is still blocked on UUID PR, Maybe just be easier to change trace_dbg to trace_err in mixer_copy().

Mixers are used in all platforms, but here indeed the use of the mixer is different: it's the first element in a 'DAI' pipeline

Yep, but I no longer see a sof-byt-nocodec.m4 topology in master, so could your topology binary be stale here and potentially causing an issue ?

I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....

@lyakh
Copy link
Collaborator

lyakh commented Aug 12, 2020

Yep, but I no longer see a sof-byt-nocodec.m4 topology in master, so could your topology binary be stale here and potentially causing an issue ?

@lgirdwood the topology is there, it's generated from the cht m4.

I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....

I see the PGA component reporting 0 available frames when this bug occurs...

@lgirdwood
Copy link
Member

I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....

I see the PGA component reporting 0 available frames when this bug occurs...

Please let me know if you see the DMA complaining about uder/overruns. If we dont see this then the data could be getting stuck in the PGA or mixer. Btw, there may be some PCM converter between DMA and PGA that could block too.

@lyakh
Copy link
Collaborator

lyakh commented Aug 13, 2020

Please let me know if you see the DMA complaining about uder/overruns. If we dont see this then the data could be getting stuck in the PGA or mixer. Btw, there may be some PCM converter between DMA and PGA that could block too.

@lgirdwood no, I don't see any of those

@lyakh
Copy link
Collaborator

lyakh commented Aug 13, 2020

I'm wondering why the generic nocodec topology sof-cht-nocodec.m4 uses DMA schedulers?

@plbossart
Copy link
Member

I'm wondering why the generic nocodec topology sof-cht-nocodec.m4 uses DMA schedulers?

all byt/cht topologies use DMA schedulers, it's not limited to the nocodec case.

@lyakh
Copy link
Collaborator

lyakh commented Aug 13, 2020

I'm curious about scheduling domains. In some cases (like in UP2 case) they seem to be freely replaceable - you can use one or another. In other cases (like BYT nocodec) only one works. What are the conditions for each domain to be applicable? And it does look like the DMA scheduling domain has got some issues.

@lgirdwood
Copy link
Member

lgirdwood commented Aug 13, 2020

@lyakh in principle they should do the same thing, that is schedule timely pipeline work, but there are some differences in implementation, synchronisation and maybe IRQ runlevel. i.e. they are both triggered on IRQs, the timer domain schedules all work in order, whereas the DMA domain probably schedules on DMA IRQ and this may be asynchronous to other pipelines (and could block on other work finishing).

@plbossart
Copy link
Member

The DMA scheduling doesn't work for HDaudio link DMAs, not interrupts are generated so you HAVE to use the timer-based scheduling for all HDAudio pipelines.
For SSP and DMIC, I think the two cases are equivalent, but the timer might be more efficient since there's only one tick and you can take care of all pipelines. I have never seen any data showing that 1 ms interrupt is actually a problem though.

@plbossart
Copy link
Member

And to build on this, even for Baytrail in master mode, the legacy closed-source firmware did not use DMA interrupts but also a 1ms external timer ticks, so I will assert that for the SSP using the timer or the DMA interrupt is essentially the same. I think the choice was more a case of not having to validate baytrail/cherrytrail, initially all scheduling was DMA based for early platforms and it stayed that way due to code inertia.

@lyakh
Copy link
Collaborator

lyakh commented Aug 13, 2020

@lgirdwood @plbossart thanks! I've tried blatantly replacing the DMA domain with the timer domain in the BYT nocodec topology and it isn't even loading now - DW DMA errors out with some missing configuration. Investigating.
EDIT: it is loading, it's failing later when trying to configure the first pipeline.

@paulstelian97
Copy link
Collaborator

Hm so DMA scheduling is looked at more of as a legacy that is supported rather than the recommended way?

@plbossart
Copy link
Member

plbossart commented Aug 13, 2020

Hm so DMA scheduling is looked at more of as a legacy that is supported rather than the recommended way?

If the interface is slave to an external device, the DMA scheduling is required. When the interface is master and synchronous with the timer tick, switching the two is a revalidation effort but I don't see how the performance might differ on paper. But as @lyakh shows above, in practice there might be implementation issues.

Edit: to be clear, for Intel only the SSP can be slave to an external clock, the HDaudio, DMIC and SoundWire interfaces are all clock masters and the clocks are synchronous with the timers.

@lyakh
Copy link
Collaborator

lyakh commented Aug 14, 2020

To recap: I've found out that all "legacy" platforms (BYT, CHT, BDW, etc.) use DMA scheduling. An attempt to switch byt-nocodec to timer scheduling failed with firmware errors, which I since then have tried to debug and fix.

I've found the reason why this doesn't work: the firmware DW DMA driver fails to set configuration in dw_dma_set_config() with an error:

ERROR dw_dma_set_config(): dma 1 channel 1 not enough elems for config with irq disabled 1

This doesn't fail with DMA scheduling because then the .irq_disabled flag isn't set and the driver then doesn't even check how many elements the configuration specifies.

This doesn't fail on non-legacy platforms, because they don't specify CONFIG_HOST_PTABLE, then the COMP_ATTR_HOST_BUFFER host / PCM attribute isn't set, so in create_local_elems() the hd->config SG array is allocated with buffer_count (5) elements and not with 1.

I tried fixing the above problem by allocating the necessary minimum (3) of SG elements in create_local_elems() and then also by changing DW_DMA_BUFFER_PERIOD_COUNT for !CONFIG_HW_LLI case to 3 too. That fixed two instances of DMA configuration failure but then the firmware failed later anyway.

So, it looks like "legacy platforms" have multiple problems with timer-driven scheduling. It might be our best option ATM to make this a hard rule somewhere and try to fix DMA scheduling which we need anyway.

@lgirdwood
Copy link
Member

I've found the reason why this doesn't work: the firmware DW DMA driver fails to set configuration in dw_dma_set_config() with an error:

ERROR dw_dma_set_config(): dma 1 channel 1 not enough elems for config with irq disabled 1

This doesn't fail with DMA scheduling because then the .irq_disabled flag isn't set and the driver

This is a rule for when DMA uses HW LLI mode (since there is a race between writing back LL descriptors and resetting them when 2 periods are used). This rule should not apply for SW LLI on BYT.

@keqiaozhang keqiaozhang reopened this Jan 5, 2021
@plbossart
Copy link
Member

@keqiaozhang is there a way we can bisect to see when the problem re-appeared?

@XiaoyunWu6666 XiaoyunWu6666 added the Intel Linux Daily tests This issue can be found in internal Linux daily tests label Jun 10, 2021
@plbossart
Copy link
Member

@lgirdwood This issue remains visible in recent Intel daily tests, it's still a problem.

@lgirdwood lgirdwood self-assigned this Jun 11, 2021
@XiaoyunWu6666
Copy link
Contributor

XiaoyunWu6666 commented Jun 17, 2021

[241094039.427083] (241094032.000000) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241094042.916667] (        3.489583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5
[241095038.958333] (      996.041687) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241095043.697917] (        4.739583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5

buffer warning 'no bytes to produce' and ' no bytes to consume' don't appear these days .
but we can still get ' WARN dai_copy(): nothing to copy' from dai

example :
inner dailytest 4705;model=BDW_WSB_RT286;testcase=multiple-pipeline-playback-50

@XiaoyunWu6666 XiaoyunWu6666 changed the title [BUG][BYT-NOCODEC]Input/output error when do simultaneous-playback-capture test [BUG][BYT-NOCODEC][BDW_WSB_RT286]Input/output error when simultaneous-playback-capture / multiple-pipeline-playback Jun 18, 2021
@keyonjie
Copy link
Contributor

We are still seeing this in recent daily report.

@XiaoyunWu6666
Copy link
Contributor

in inner daily 9751 and 9715, when check-playback/check-capture on BDW_WSR_RT286 , IO error happen in the first play

@lgirdwood
Copy link
Member

@XiaoyunWu6666 I'm suspicious we have over budget MCPS given that both are HiFi2 and will use the generic C processing with the frag API. @singalsu fyi - lets retest this again after all the frag APIs users have been fixed.

@plbossart
Copy link
Member

FWIW we seem to have an interrupt issue on Broadwell thesofproject/linux#3400

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 14, 2022

Still happening in daily 10146?model=BDW_WSB_RT286&testcase=multiple-pipeline-capture-50

Start Time: 2022-02-11 22:27:26 UTC
Kernel Branch: topic/sof-dev
Kernel Commit: 98119478
SOF Branch: main
SOF Commit: b8954754f055

@marc-hb marc-hb added P1 Blocker bugs or important features and removed P2 Critical bugs or normal features labels Feb 15, 2022
marc-hb added a commit to marc-hb/sof-test that referenced this issue Feb 19, 2022
Known issue thesofproject/sof#3170 has been
polluting the test results for years.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@mengdonglin mengdonglin added P3 Low-impact bugs or features won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc) and removed P1 Blocker bugs or important features labels Feb 22, 2022
@mengdonglin
Copy link
Collaborator

As Broadwell (BDW) is a very old platform, we lower its priority and will not fix issues with multi-pipeline test cases on BDW.

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 22, 2022

The test still failed yesterday in 10402?model=BDW_WSB_RT286&testcase=multiple-pipeline-capture-50, let's close this when thesofproject/sof-test#863 is merged so we stop testing this every day like someone is assigned to it.

We had only 6 distinct failures in 10402 and this was one of them.

@marc-hb marc-hb reopened this Feb 22, 2022
@XiaoyunWu6666
Copy link
Contributor

@marc-hb I think we can close this again since https://github.com/intel-innersource/drivers.audio.ci.sof-framework/pull/185 got merged and also see current test 10444
image
[console]

test case multiple-pipeline-capture-50.sh is SKIP!
Catch ignore field of test-case: won't fix https://github.com/thesofproject/sof/issues/3170!

marc-hb added a commit to thesofproject/sof-test that referenced this issue Feb 28, 2022
Known issue thesofproject/sof#3170 has been
polluting the test results for years.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
BSW Braswell bug Something isn't working as expected BYT Applies to Baytrail platform Intel Linux Daily tests This issue can be found in internal Linux daily tests P3 Low-impact bugs or features won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants