[BUG][BYT-NOCODEC][BDW_WSB_RT286]Input/output error when simultaneous-playback-capture / multiple-pipeline-playback #3170

Liviali155 · 2020-07-13T10:35:49Z

Describe the bug
Input/output error when do simultaneous-playback-capture test
After error occured all the pipeline can work

To Reproduce
1."sudo reboot" to reboot system
2.cd sof-test
3.cd test-case
4.export TPLG=sof-byt-nocodec.tplg
5../simultaneous-playback-capture.sh -l 100

Reproduction Rate
1 round: failed at 13/100

Expected behavior
No error occured

Impact
Input/output error when do simultaneous-playback-capture test of aplay(0,0) and arecord (0,0)

ubuntu@jf-byt-mb-nocodec-1:~$ aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: sofnocodec [sof-nocodec], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: sofnocodec [sof-nocodec], device 1: PCM Deep Buffer (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0


ubuntu@jf-byt-mb-nocodec-1:~$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: sofnocodec [sof-nocodec], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0

No error dmesg,no error sof error trace

Environment
Branch name and commit hash of the 2 repositories: sof (firmware/topology) and linux (kernel driver).
Kernel: {sof-dev fa7850de}
SOF: {master:318dc9f7}
Name of the topology file
Topology: {sof-byt-nocodec.tplg }
Name of the platform(s) on which the bug is observed.
Platform: {BYT MB with nocodec}

dmesg0713.log
sof-logger0713.log

The text was updated successfully, but these errors were encountered:

Liviali155 · 2020-07-14T02:32:52Z

On BSW with onboard codec MAX98090 in I2S mode also has this issue with sof-dev(9eb3d58)+master(5564a90)

Failed at 39/50

ubuntu@jf-bsw-cyn-max98090-3:~/sof-test/test-case$ aplay -l
**** List of PLAYBACK Hardware Devices ****
card 0: max98090 [sof-bytcht max98090], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 0: max98090 [sof-bytcht max98090], device 1: PCM Deep Buffer (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: Device [USB Audio Device], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0


ubuntu@jf-bsw-cyn-max98090-3:~/sof-test/test-case$ arecord -l
**** List of CAPTURE Hardware Devices ****
card 0: max98090 [sof-bytcht max98090], device 0: PCM (*) []
  Subdevices: 1/1
  Subdevice #0: subdevice #0
card 1: Device [USB Audio Device], device 0: USB Audio [USB Audio]
  Subdevices: 1/1
  Subdevice #0: subdevice #0

dmesg0714.log
sof-logger0714.log

mengdonglin · 2020-07-14T08:01:36Z

@slawblauciak sof-byt-nocodec.tplg is created from "sof-cht-nocodec.m4", updated in #3080 to remove SRC for Baytrail and CherryTrail by @plbossart

plbossart · 2020-07-28T23:59:35Z

@Liviali155 can you retry with the SOF PR #3245

The symptoms of no dmesg error, no trace and the -EIO error seem completely aligned with my findings on those platforms.

Liviali155 · 2020-07-29T03:00:06Z

After applied #3245,issue still can be reproduced on BSW with onboard codec MAX98090 in I2S mode and BYT MB with nocodec

dmesg0729.log
sof-logger0729.log

plbossart · 2020-07-29T20:23:14Z

@Liviali155 can you retry with both #3245 and #3257

Somehow I have the feeling we have the same problem of not having enough memory resulting in some sort of underflow error. It's not clear e.g. why we have these repeated sof-logger messages in both issues

[241094039.427083] (241094032.000000) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241094042.916667] (        3.489583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5
[241095038.958333] (      996.041687) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241095043.697917] (        4.739583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5

Seems to me like the firmware gets lost with bad pointers and can't recover in both #3170 and #3171

@mmaka1 @lgirdwood FYI

plbossart · 2020-07-29T20:45:47Z

@Liviali155 Also wondering if 61a2c75 (' dma: dw: fix locking and calculations in dw_dma_get_data_size') has an impact on multiple pipelines. This changes the behavior for dmic, I am trying to see if reverting it can help. Can you also try on your side.

plbossart · 2020-07-29T20:54:19Z

@Liviali155 Adding PR#3245, @3257 and reverting 61a2c75 seems to solve the issue for 100 iterations. branch here: https://github.com/plbossart/sof/tree/fix/multi-pipelines
I will launch a longer test on my side.

Liviali155 · 2020-07-30T03:36:45Z

@plbossart Used https://github.com/plbossart/sof/tree/fix/multi-pipelines(commit:d9c6405a) to test,issue still can be reproduced on byt-nocodec , failed at 287/1000, seems the reproduce rate is lower than before

sof-logger0730.log
dmesg0730.log

plbossart · 2020-07-30T14:50:07Z

Ack @Liviali155, same on my side. My first run worked for 100 iteration, but on the second a failure happened on iteration 9/1000.

We need to figure this one out, something's not right with scheduling/concurrency. @lgirdwood FYI

lgirdwood · 2020-08-04T16:00:39Z

and will result in the above warnings. @plbossart can you turn on mixer debug and see if this aligns, I'm also wondering if mixers is used on the other platforms.

I don't know how to turn on mixer debug?

Mixers are used in all platforms, but here indeed the use of the mixer is different: it's the first element in a 'DAI' pipeline

lgirdwood · 2020-08-05T09:31:54Z

and will result in the above warnings. @plbossart can you turn on mixer debug and see if this aligns, I'm also wondering if mixers is used on the other platforms.

I don't know how to turn on mixer debug?

Oh, I think individual debug is still blocked on UUID PR, Maybe just be easier to change trace_dbg to trace_err in mixer_copy().

Mixers are used in all platforms, but here indeed the use of the mixer is different: it's the first element in a 'DAI' pipeline

Yep, but I no longer see a sof-byt-nocodec.m4 topology in master, so could your topology binary be stale here and potentially causing an issue ?

I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....

lyakh · 2020-08-12T15:47:07Z

Yep, but I no longer see a sof-byt-nocodec.m4 topology in master, so could your topology binary be stale here and potentially causing an issue ?

@lgirdwood the topology is there, it's generated from the cht m4.

I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....

I see the PGA component reporting 0 available frames when this bug occurs...

lgirdwood · 2020-08-13T09:08:29Z

I do suspect that mixer is not correctly getting the state of it's sources/sinks and this results in 0 bytes to copy, but then I've no idea why does not complain about over/under runs....

I see the PGA component reporting 0 available frames when this bug occurs...

Please let me know if you see the DMA complaining about uder/overruns. If we dont see this then the data could be getting stuck in the PGA or mixer. Btw, there may be some PCM converter between DMA and PGA that could block too.

lyakh · 2020-08-13T12:39:18Z

Please let me know if you see the DMA complaining about uder/overruns. If we dont see this then the data could be getting stuck in the PGA or mixer. Btw, there may be some PCM converter between DMA and PGA that could block too.

@lgirdwood no, I don't see any of those

lyakh · 2020-08-13T13:20:35Z

I'm wondering why the generic nocodec topology sof-cht-nocodec.m4 uses DMA schedulers?

plbossart · 2020-08-13T13:25:15Z

I'm wondering why the generic nocodec topology sof-cht-nocodec.m4 uses DMA schedulers?

all byt/cht topologies use DMA schedulers, it's not limited to the nocodec case.

lyakh · 2020-08-13T13:57:47Z

I'm curious about scheduling domains. In some cases (like in UP2 case) they seem to be freely replaceable - you can use one or another. In other cases (like BYT nocodec) only one works. What are the conditions for each domain to be applicable? And it does look like the DMA scheduling domain has got some issues.

lgirdwood · 2020-08-13T14:29:58Z

@lyakh in principle they should do the same thing, that is schedule timely pipeline work, but there are some differences in implementation, synchronisation and maybe IRQ runlevel. i.e. they are both triggered on IRQs, the timer domain schedules all work in order, whereas the DMA domain probably schedules on DMA IRQ and this may be asynchronous to other pipelines (and could block on other work finishing).

plbossart · 2020-08-13T14:33:09Z

The DMA scheduling doesn't work for HDaudio link DMAs, not interrupts are generated so you HAVE to use the timer-based scheduling for all HDAudio pipelines.
For SSP and DMIC, I think the two cases are equivalent, but the timer might be more efficient since there's only one tick and you can take care of all pipelines. I have never seen any data showing that 1 ms interrupt is actually a problem though.

plbossart · 2020-08-13T14:35:18Z

And to build on this, even for Baytrail in master mode, the legacy closed-source firmware did not use DMA interrupts but also a 1ms external timer ticks, so I will assert that for the SSP using the timer or the DMA interrupt is essentially the same. I think the choice was more a case of not having to validate baytrail/cherrytrail, initially all scheduling was DMA based for early platforms and it stayed that way due to code inertia.

lyakh · 2020-08-13T14:59:12Z

@lgirdwood @plbossart thanks! I've tried blatantly replacing the DMA domain with the timer domain in the BYT nocodec topology and it isn't even loading now - DW DMA errors out with some missing configuration. Investigating.
EDIT: it is loading, it's failing later when trying to configure the first pipeline.

paulstelian97 · 2020-08-13T15:07:46Z

Hm so DMA scheduling is looked at more of as a legacy that is supported rather than the recommended way?

plbossart · 2020-08-13T15:30:47Z

Hm so DMA scheduling is looked at more of as a legacy that is supported rather than the recommended way?

If the interface is slave to an external device, the DMA scheduling is required. When the interface is master and synchronous with the timer tick, switching the two is a revalidation effort but I don't see how the performance might differ on paper. But as @lyakh shows above, in practice there might be implementation issues.

Edit: to be clear, for Intel only the SSP can be slave to an external clock, the HDaudio, DMIC and SoundWire interfaces are all clock masters and the clocks are synchronous with the timers.

lyakh · 2020-08-14T12:15:40Z

To recap: I've found out that all "legacy" platforms (BYT, CHT, BDW, etc.) use DMA scheduling. An attempt to switch byt-nocodec to timer scheduling failed with firmware errors, which I since then have tried to debug and fix.

I've found the reason why this doesn't work: the firmware DW DMA driver fails to set configuration in dw_dma_set_config() with an error:

ERROR dw_dma_set_config(): dma 1 channel 1 not enough elems for config with irq disabled 1

This doesn't fail with DMA scheduling because then the .irq_disabled flag isn't set and the driver then doesn't even check how many elements the configuration specifies.

This doesn't fail on non-legacy platforms, because they don't specify CONFIG_HOST_PTABLE, then the COMP_ATTR_HOST_BUFFER host / PCM attribute isn't set, so in create_local_elems() the hd->config SG array is allocated with buffer_count (5) elements and not with 1.

I tried fixing the above problem by allocating the necessary minimum (3) of SG elements in create_local_elems() and then also by changing DW_DMA_BUFFER_PERIOD_COUNT for !CONFIG_HW_LLI case to 3 too. That fixed two instances of DMA configuration failure but then the firmware failed later anyway.

So, it looks like "legacy platforms" have multiple problems with timer-driven scheduling. It might be our best option ATM to make this a hard rule somewhere and try to fix DMA scheduling which we need anyway.

lgirdwood · 2020-08-14T12:49:20Z

I've found the reason why this doesn't work: the firmware DW DMA driver fails to set configuration in dw_dma_set_config() with an error:

ERROR dw_dma_set_config(): dma 1 channel 1 not enough elems for config with irq disabled 1

This doesn't fail with DMA scheduling because then the .irq_disabled flag isn't set and the driver

This is a rule for when DMA uses HW LLI mode (since there is a race between writing back LL descriptors and resetting them when 2 periods are used). This rule should not apply for SW LLI on BYT.

keqiaozhang · 2021-01-05T06:04:16Z

CI observed this issue again on BSW_CYN_MAX98090 and BYT_MB_NOCODEC.
http://sof-ci.sh.intel.com/#/result/planresultdetail/1441?model=BSW_CYN_MAX98090&testcase=simultaneous-playback-capture-50
http://sof-ci.sh.intel.com/#/result/planresultdetail/1441?model=BYT_MB_NOCODEC&testcase=simultaneous-playback-capture-50

plbossart · 2021-01-12T20:09:42Z

@keqiaozhang is there a way we can bisect to see when the problem re-appeared?

plbossart · 2021-06-10T15:36:43Z

@lgirdwood This issue remains visible in recent Intel daily tests, it's still a problem.

XiaoyunWu6666 · 2021-06-17T07:59:05Z

[241094039.427083] (241094032.000000) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241094042.916667] (        3.489583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5
[241095038.958333] (      996.041687) c0 buffer       3.18        src/audio/buffer.c:211  comp_update_buffer_consume(), no bytes to consume, source->comp.id = 16, source->comp.type = 5, sink->comp.id = 1, sink->comp.type = 6
[241095043.697917] (        4.739583) c0 buffer       1.5         src/audio/buffer.c:172  comp_update_buffer_produce(), no bytes to produce, source->comp.id = 1, source->comp.type = 6, sink->comp.id = 2, sink->comp.type = 5

buffer warning 'no bytes to produce' and ' no bytes to consume' don't appear these days .
but we can still get ' WARN dai_copy(): nothing to copy' from dai

example :
inner dailytest 4705;model=BDW_WSB_RT286;testcase=multiple-pipeline-playback-50

keyonjie · 2021-11-16T03:08:01Z

We are still seeing this in recent daily report.

XiaoyunWu6666 · 2022-01-27T02:52:09Z

in inner daily 9751 and 9715, when check-playback/check-capture on BDW_WSR_RT286 , IO error happen in the first play

lgirdwood · 2022-01-27T15:05:04Z

@XiaoyunWu6666 I'm suspicious we have over budget MCPS given that both are HiFi2 and will use the generic C processing with the frag API. @singalsu fyi - lets retest this again after all the frag APIs users have been fixed.

plbossart · 2022-01-27T16:07:36Z

FWIW we seem to have an interrupt issue on Broadwell thesofproject/linux#3400

marc-hb · 2022-02-14T21:17:41Z

Still happening in daily 10146?model=BDW_WSB_RT286&testcase=multiple-pipeline-capture-50

Start Time: 2022-02-11 22:27:26 UTC
Kernel Branch: topic/sof-dev
Kernel Commit: 98119478
SOF Branch: main
SOF Commit: b8954754f055

Known issue thesofproject/sof#3170 has been polluting the test results for years. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

mengdonglin · 2022-02-22T08:37:43Z

As Broadwell (BDW) is a very old platform, we lower its priority and will not fix issues with multi-pipeline test cases on BDW.

marc-hb · 2022-02-22T18:24:54Z

The test still failed yesterday in 10402?model=BDW_WSB_RT286&testcase=multiple-pipeline-capture-50, let's close this when thesofproject/sof-test#863 is merged so we stop testing this every day like someone is assigned to it.

We had only 6 distinct failures in 10402 and this was one of them.

XiaoyunWu6666 · 2022-02-23T06:58:13Z

@marc-hb I think we can close this again since https://github.com/intel-innersource/drivers.audio.ci.sof-framework/pull/185 got merged and also see current test 10444

[console]

test case multiple-pipeline-capture-50.sh is SKIP!
Catch ignore field of test-case: won't fix https://github.com/thesofproject/sof/issues/3170!

Known issue thesofproject/sof#3170 has been polluting the test results for years. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

Liviali155 added bug Something isn't working as expected BYT Applies to Baytrail platform P2 Critical bugs or normal features labels Jul 13, 2020

Liviali155 added the BSW Braswell label Jul 14, 2020

mengdonglin assigned lyakh Aug 11, 2020

lgirdwood closed this as completed in #3339 Aug 25, 2020

keqiaozhang reopened this Jan 5, 2021

iuliana-prodan mentioned this issue Jan 29, 2021

ll_schedule: refining the scheduling policy to on demand #3768

Merged

XiaoyunWu6666 added the Intel Linux Daily tests This issue can be found in internal Linux daily tests label Jun 10, 2021

lgirdwood self-assigned this Jun 11, 2021

XiaoyunWu6666 changed the title ~~[BUG][BYT-NOCODEC]Input/output error when do simultaneous-playback-capture test~~ [BUG][BYT-NOCODEC][BDW_WSB_RT286]Input/output error when simultaneous-playback-capture / multiple-pipeline-playback Jun 18, 2021

marc-hb added P1 Blocker bugs or important features and removed P2 Critical bugs or normal features labels Feb 15, 2022

marc-hb added a commit to marc-hb/sof-test that referenced this issue Feb 19, 2022

multiple-pipeline.sh: temporarily disable capture on BDW

83f669c

Known issue thesofproject/sof#3170 has been polluting the test results for years. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

marc-hb mentioned this issue Feb 19, 2022

multiple-pipeline.sh: temporarily disable capture on BDW thesofproject/sof-test#863

Merged

mengdonglin added P3 Low-impact bugs or features won't fix This will not be worked on atm (e.g. a bug closed for lack of user request, hardware etc) and removed P1 Blocker bugs or important features labels Feb 22, 2022

mengdonglin closed this as completed Feb 22, 2022

XiaoyunWu6666 mentioned this issue Feb 22, 2022

[BUG] IPC timed out when multiple-pause-resume on BDW_WSB_RT286 #4859

Closed

marc-hb reopened this Feb 22, 2022

XiaoyunWu6666 closed this as completed Feb 23, 2022

marc-hb added a commit to thesofproject/sof-test that referenced this issue Feb 28, 2022

multiple-pipeline.sh: temporarily disable capture on BDW

ae42182

Known issue thesofproject/sof#3170 has been polluting the test results for years. Signed-off-by: Marc Herbert <marc.herbert@intel.com>

[BUG][BYT-NOCODEC][BDW_WSB_RT286]Input/output error when simultaneous-playback-capture / multiple-pipeline-playback #3170

[BUG][BYT-NOCODEC][BDW_WSB_RT286]Input/output error when simultaneous-playback-capture / multiple-pipeline-playback #3170

Comments

Liviali155 commented Jul 13, 2020 • edited Loading

Liviali155 commented Jul 14, 2020

mengdonglin commented Jul 14, 2020

plbossart commented Jul 28, 2020

Liviali155 commented Jul 29, 2020

plbossart commented Jul 29, 2020

plbossart commented Jul 29, 2020

plbossart commented Jul 29, 2020

Liviali155 commented Jul 30, 2020 • edited Loading

plbossart commented Jul 30, 2020

lgirdwood commented Aug 4, 2020 • edited by plbossart Loading

lgirdwood commented Aug 5, 2020

lyakh commented Aug 12, 2020

lgirdwood commented Aug 13, 2020

lyakh commented Aug 13, 2020

lyakh commented Aug 13, 2020

plbossart commented Aug 13, 2020

lyakh commented Aug 13, 2020 • edited Loading

lgirdwood commented Aug 13, 2020 • edited Loading

plbossart commented Aug 13, 2020

plbossart commented Aug 13, 2020

lyakh commented Aug 13, 2020 • edited Loading

paulstelian97 commented Aug 13, 2020

plbossart commented Aug 13, 2020 • edited Loading

lyakh commented Aug 14, 2020

lgirdwood commented Aug 14, 2020

keqiaozhang commented Jan 5, 2021

plbossart commented Jan 12, 2021

plbossart commented Jun 10, 2021

XiaoyunWu6666 commented Jun 17, 2021 • edited Loading

keyonjie commented Nov 16, 2021

XiaoyunWu6666 commented Jan 27, 2022

lgirdwood commented Jan 27, 2022

plbossart commented Jan 27, 2022

marc-hb commented Feb 14, 2022

mengdonglin commented Feb 22, 2022

marc-hb commented Feb 22, 2022 • edited Loading

XiaoyunWu6666 commented Feb 23, 2022

Liviali155 commented Jul 13, 2020 •

edited

Loading

Liviali155 commented Jul 30, 2020 •

edited

Loading

lgirdwood commented Aug 4, 2020 •

edited by plbossart

Loading

lyakh commented Aug 13, 2020 •

edited

Loading

lgirdwood commented Aug 13, 2020 •

edited

Loading

lyakh commented Aug 13, 2020 •

edited

Loading

plbossart commented Aug 13, 2020 •

edited

Loading

XiaoyunWu6666 commented Jun 17, 2021 •

edited

Loading

marc-hb commented Feb 22, 2022 •

edited

Loading