Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[LNL] hang, timeout or crash in pause-release #5109

Open
marc-hb opened this issue Jul 17, 2024 · 18 comments
Open

[LNL] hang, timeout or crash in pause-release #5109

marc-hb opened this issue Jul 17, 2024 · 18 comments
Labels
bug Something isn't working LNL Applies to Lunar Lake platform

Comments

@marc-hb
Copy link
Collaborator

marc-hb commented Jul 17, 2024

I spent a long time looking for an existing bug and I found many similar issues (see list below) but I think this one is either brand new or never filed yet. Part of the problem was that the Expect part of the pause-resume test was utterly buggy; that didn't help. Now that I rewrote it in thesofproject/sof-test#1218 we can finally pay less attention to the test code and a bit more attention to the product.

The WARNING: received == PAUSE == while in state recording! Ignoring. message is very surprising. It showed up more than once.
EDIT: this message is now gone since "fail fast" test fix thesofproject/sof-test#1226. Never look past the very first error which is usually now file descriptor in bad state.

2024-07-16T13:31:16-07:00

Linux Branch: topic/sof-dev
Commit: 1998ade4783a
Kconfig Branch: master
Kconfig Commit: 8189104a4f38
SOF Branch: main
Commit: 3051607efb4f
Zephyr Commit:650227d8c47f

https://sof-ci.01.org/softestpr/PR1218/build632/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50
https://sof-ci.01.org/softestpr/PR966/build659/devicetest/index.html
https://sof-ci.01.org/softestpr/PR812/build656/devicetest/index.html
https://sof-ci.01.org/linuxpr/PR5106/build4026/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50
https://sof-ci.01.org/sofpr/PR9305/build6551/devicetest/index.html

https://sof-ci.01.org/softestpr/PR1224/build686/devicetest/index.html

https://sof-ci.01.org/sofpr/PR9335/build6748/devicetest/index.html

The logs don't all look the same but here's a typical one:

t=7412 ms: cmd3 arecord Port0 2nd Capture: (16/50) Found volume ### | xx%, recording for 38 ms
t=7538 ms: cmd1 arecord Port0: (19/50) Found   === PAUSE ===  ,  pausing for 36 ms
t=7420 ms: cmd3 arecord Port0 2nd Capture: (16/50) Found   === PAUSE ===  ,  pausing for 22 ms
t=7553 ms: cmd3 arecord Port0 2nd Capture: (17/50) Found volume ### | xx%, recording for 47 ms
t=7553 ms: cmd3 arecord Port0 2nd Capture: WARNING: received == PAUSE == while in state recording! Ignoring.
t=7794 ms: cmd1 arecord Port0: (20/50) Found volume ### | xx%, recording for 36 ms
t=7919 ms: cmd1 arecord Port0: (20/50) Found   === PAUSE ===  ,  pausing for 44 ms
t=8170 ms: cmd1 arecord Port0: (21/50) Found volume ### | xx%, recording for 27 ms
t=8295 ms: cmd1 arecord Port0: (21/50) Found   === PAUSE ===  ,  pausing for 26 ms
t=8546 ms: cmd1 arecord Port0: (22/50) Found volume ### | xx%, recording for 29 ms


t=18714 ms: cmd1 arecord Port0: (49/50) Found volume ### | xx%, recording for 20 ms
t=18839 ms: cmd1 arecord Port0: (49/50) Found   === PAUSE ===  ,  pausing for 39 ms
t=19090 ms: cmd1 arecord Port0: (50/50) Found volume ### | xx%, recording for 29 ms
t=19215 ms: cmd1 arecord Port0: (50/50) Found   === PAUSE ===  ,  pausing for 33 ms
t=19215 ms: cmd1 arecord Port0: WARNING: volume was always 00%!
t=19215 ms: cmd1 arecord Port0: SUCCESS: /home/ubuntu/sof-test/case-lib/apause.exp arecord -D hw:0,0 -r 48000 -c 2 -f S16_LE -vv -i /dev/null

2024-07-16 07:53:47 UTC [REMOTE_ERROR] Still have expect process not finished after wait for 250
11568 expect /home/ubuntu/sof-test/case-lib/apause.exp cmd3 arecord Port0 2nd Capture 50 20 30 arecord -D hw:0,12 -r 48000 -c 2 -f S16_LE -vv -i /dev/null
11570 arecord -D hw:0,12 -r 48000 -c 2 -f S16_LE -vv -i /dev/null
2024-07-16 07:53:47 UTC [REMOTE_INFO] Starting func_exit_handler(1)
2024-07-16 07:53:47 UTC [REMOTE_ERROR] Starting func_exit_handler(), exit status=1, FUNCNAME stack:
2024-07-16 07:53:47 UTC [REMOTE_ERROR]  main()  @  /home/ubuntu/sof-test/test-case/multiple-pause-resume.sh
2024-07-16 07:53:48 UTC [REMOTE_INFO] pkill -TERM -f mtrace-reader.py
2024-07-16 07:53:48 UTC [REMOTE_INFO] nlines=24148 /home/ubuntu/sof-test/logs/multiple-pause-resume/2024-07-16-07:45:35-16732/mtrace.txt
+ grep -B 2 -A 1 -i --word-regexp -e ERR -e ERROR -e '' -e OSError /home/ubuntu/sof-test/logs/multiple-pause-resume/2024-07-16-07:45:35-16732/mtrace.txt
2024-07-16 07:53:50 UTC [REMOTE_INFO] ktime=1779 sof-test PID=10390: ending
2024-07-16 07:53:50 UTC [REMOTE_WARNING] Process(es) started by /home/ubuntu/sof-test/test-case/multiple-pause-resume.sh are still active, killing these process(es):
2024-07-16 07:53:50 UTC [REMOTE_WARNING] Catch pid: 11568 expect /home/ubuntu/sof-test/case-lib/apause.exp cmd3 arecord Port0 2nd Capture 50 20 30 arecord -D hw:0,12 -r 48000 -c 2 -f S16_LE -vv -i /dev/null
2024-07-16 07:53:50 UTC [REMOTE_WARNING] Kill cmd:'expect /home/ubuntu/sof-test/case-lib/apause.exp cmd3 arecord Port0 2nd Capture 50 20 30 arecord -D hw:0,12 -r 48000 -c 2 -f S16_LE -vv -i /dev/null' by kill -9
/home/ubuntu/sof-test/case-lib/hijack.sh: line 204: 11568 Killed                  "$TOPDIR"/case-lib/apause.exp "$shortname" "$repeat_count" "$rnd_min" "$rnd_range" "$cmd" -D "$dev" -r "$rate" -c "$channel" -f "$fmt" -vv -i "$file"
2024-07-16 07:53:50 UTC [REMOTE_INFO] Test Result: FAIL!
[ 1682.256786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2127, overruns 0
[ 1683.280786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1684.304788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2120, overruns 0
[ 1685.328786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2128, overruns 0
[ 1686.352788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1687.376788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2123, overruns 0
[ 1688.400788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2121, overruns 0
[ 1689.424788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2343, overruns 0
[ 1690.448788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2120, overruns 0
[ 1691.472788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1692.496788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2130, overruns 0
[ 1693.520788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2122, overruns 0
[ 1694.544788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2127, overruns 0
[ 1695.568788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1696.592788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2128, overruns 0
[ 1697.616788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2126, overruns 0
[ 1698.640788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2123, overruns 0
[ 1699.664788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2119, overruns 0
[ 1700.688786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2116, overruns 0
[ 1701.712788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2127, overruns 0
[ 1702.736788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2124, overruns 0
[ 1703.760788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2127, overruns 0
[ 1704.784788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1705.808786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2126, overruns 0
[ 1706.832788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2117, overruns 0
[ 1707.856786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2126, overruns 0
[ 1708.880786] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2128, overruns 0
[ 1709.904788] <inf> ll_schedule: zephyr_domain_thread_fn: ll core 0 timer avg 2079, max 2126, overruns 0

etc.

cc:

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 17, 2024

The kernel logs are unfortunately missing in https://sof-ci.01.org/softestpr/PR966/build659/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100, I don't know why. I checked the kernel logs on the device directly and something strange caught my attention: it looks like the DMIC SFX2 pipeline under pause/release test is being freed and restarted WHILE the test is still running!?

The firmware logs are available on the web but I don't know how to match the timestamps.

...
t=45791 ms: arecord: (84/100) Found   === PAUSE ===  ,  pausing for 173 ms
t=46092 ms: arecord: (85/100) Found volume ### | xx%, recording for 152 ms
t=46217 ms: arecord: (85/100) Found   === PAUSE ===  ,  pausing for 190 ms
t=50287 ms: arecord: (86/100) Found volume ### | xx%, recording for 166 ms
t=50304 ms: arecord: (86/100) Found   === PAUSE ===  ,  pausing for 119 ms
t=50436 ms: arecord: (87/100) Found volume ### | xx%, recording for 150 ms

t=50437 ms: arecord: WARNING: received == PAUSE == while in state recording! Ignoring.

<abrupt log stop while test is still running>

Should that WARNING be fatal? Maybe it would help collect the logs?

Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x46010004|0x6: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x12000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x12000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.14 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget module-copier.14.2 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_route_free: sof-audio-pci-intel-lnl 0000:00:1f.3: unbind modules gain.19.1:0 -> module-copier.19.1:0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget gain.19.1 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_route_free: sof-audio-pci-intel-lnl 0000:00:1f.3: unbind modules module-copier.19.1:0 -> gain.20.1:0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x46020004|0x10006: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x66000000|0x10006: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x46020004|0x10006: MOD_UNBIND
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x12010000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x12010000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.19 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget module-copier.19.1 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_route_free: sof-audio-pci-intel-lnl 0000:00:1f.3: unbind modules gain.20.1:0 -> host-copier.28.capture:0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget gain.20.1 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx      : 0x12020000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx reply: 0x32000000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_log_header: sof-audio-pci-intel-lnl 0000:00:1f.3: ipc tx done : 0x12020000|0x0: GLB_DELETE_PIPELINE
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget pipeline.20 freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_widget_free_unlocked: sof-audio-pci-intel-lnl 0000:00:1f.3: widget host-copier.28.capture freed
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_pcm_prepare: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: prepare stream 28 dir 1
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_pcm_hw_params: sof-audio-pci-intel-lnl 0000:00:1f.3: pcm: hw params stream 28 dir 1
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x40000 successful
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_hw_params: sof-audio-pci-intel-lnl 0000:00:1f.3: FW Poll Status: reg[0x80]=0x40000 successful
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-lnl 0000:00:1f.3: period_bytes:0x17700
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof_intel_hda_common:hda_dsp_stream_setup_bdl: sof-audio-pci-intel-lnl 0000:00:1f.3: periods:4
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_prepare_copier_module: sof-audio-pci-intel-lnl 0000:00:1f.3: copier dai-copier.DMIC.NoCodec-6.capture, type 28
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:snd_sof_get_nhlt_endpoint_data: sof-audio-pci-intel-lnl 0000:00:1f.3: dai index 0 nhlt type 2 direction 1 dev type 0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3: Looking for configuration:
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3:   vbus_id=0 link_type=2 dir=1, dev_type=0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3:   ch=4 fmt=32 rate=48000 (vbps is ignored for DMIC 32bit format)
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:intel_nhlt_get_endpoint_blob: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint count=8
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:nhlt_check_ep_match: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint: vbus_id=0 link_type=2 dir=1 dev_type = 0
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:nhlt_get_specific_cfg: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint format count=1
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_intel_dspcfg:nhlt_get_specific_cfg: sof-audio-pci-intel-lnl 0000:00:1f.3: Endpoint format: ch=4 fmt=32/32 rate=48000
Jul 16 20:52:43 jf-lnlm-rvp-nocodec-3 kernel: snd_sof:sof_ipc4_init_input_audio_fmt: sof-audio-pci-intel-lnl 0000:00:1f.3: matched audio format index for 48000Hz, 32bit, 4 channels: 2

Jul 16 21:13:04 jf-lnlm-rvp-nocodec-3 ubuntu[11727]: test failed, CI session https://sof-ci.ostc.intel.com/api/session/379913 now 
trying to reboot DUT ubuntu@jf-lnlm-rvp-nocodec-3.jf.intel.com over ssh

@plbossart
Copy link
Member

The DMIC SFX2 pipeline is a branch from a common part, so it's a special case that could be problematic

image

I am not sure how well DPCM deals with such capture branches, and it wouldn't be surprising if the SOF states are incorrect.

Need @ranj063 and @ujfalusi to chime in.

@ujfalusi
Copy link
Collaborator

This is not just DMIC SFX2, example: https://sof-ci.01.org/softestpr/PR812/build656/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50

I'm not sure what we are looking at here, but if I read it right the test fails because the aplay hw:0,2 is failing to die, right?
In the kernel log the last trigger to this stream is a START (at 852.134468).
hw:0,3 is toggling between PAUSE and RUNNING t hen finally hw:0,3 is closed, but I see no attempt to stop hw:0,2

What is interesting regarding to hw:0,2 is:
844.769414 open
844.769906 hw_params
844.797258 start
... toggling between paused/running
850.271098 stopped
852.114512 prepare + hw+params w/o closing it
852.134468 start

I don't know what the test is doing, but it does not appear to be trying to stop the playback.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 18, 2024

I don't know what the test is doing, but it does not appear to be trying to stop the playback.

This test passes in all other configurations and I think it may even it passes in that configuration sometimes. EDIT: for instance it was perfectly fine in daily 44012 (July 18th)

(Thanks for analyzing the logs!)

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 19, 2024

This failure looks a little bit different:

https://sof-ci.01.org/sofpr/PR9304/build6579/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50

t=1547 ms: cmd1 arecord Amp feedback: (3/50) Found   === PAUSE ===  ,  pausing for 47 ms
t=1806 ms: cmd1 arecord Amp feedback: (4/50) Found volume ### | xx%, recording for 36 ms
t=1931 ms: cmd1 arecord Amp feedback: (4/50) Found   === PAUSE ===  ,  pausing for 39 ms
t=2140 ms: cmd3 aplay Jack Out: (2/50) Found   === PAUSE ===  ,  pausing for 32 ms
t=2140 ms: cmd3 aplay Jack Out: ERROR: found VOLUME while paused!
t=2140 ms: cmd3 aplay Jack Out: aplay: do_pause:1567: pause push error: File descriptor in bad state

It does not seem to have any overrun!?

@marc-hb
Copy link
Collaborator Author

marc-hb commented Jul 19, 2024

This looks like memory corruption of some kind. Not the first time I see it. No overrun there.

https://sof-ci.01.org/sofpr/PR9314/build6595/devicetest/index.html?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50

[ 1225.549551] <inf> ll_schedule: zephyr_ll_task_schedule_common: task add 0xa01149c0 0xa00eabf4U priority 0 flags 0x0
[ 1225.549570] <inf> ll_schedule: zephyr_domain_register: zephyr_domain_register domain->type 1 domain->clk 5 domain->ticks_per_ms 38400 period 1000
[ 1225.549668] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
[ 1225.550661] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
[ 1225.551661] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
[ 1225.552383] <inf> ipc: ipc_cmd: rx	: 0x13000004|0x0
[ 1225.552391] <inf> pipe: pipeline_trigger: pipe:0 0x0 pipe trigger cmd 8
[ 1225.552396] <inf> ll_schedule: zephyr_ll_task_schedule_common: task add 0xa0115400 0xa00eabf4U priority 0 flags 0x0
[ 1225.552700] <inf> host_comp: host_get_copy_bytes_normal: comp:1 0x10004 no bytes to copy, available samples: 0, free_samples: 760
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
�����������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������
������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������������

In the same software configuration, the HDA model has mass overruns:

https://sof-ci.01.org/sofpr/PR9314/build6595/devicetest/index.html?model=LNLM_RVP_HDA&testcase=multiple-pause-resume-50


This one has this and then silence, no logs.

t=28189 ms: arecord: (67/100) Found volume ### | xx%, recording for 186 ms
t=28206 ms: arecord: (67/100) Found   === PAUSE ===  ,  pausing for 131 ms
t=28337 ms: arecord: (68/100) Found volume ### | xx%, recording for 139 ms
t=28337 ms: arecord: WARNING: received == PAUSE == while in state recording! Ignoring.

https://sof-ci.01.org/softestpr/PR1222/build676/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100

@ranj063
Copy link
Collaborator

ranj063 commented Jul 26, 2024

Here's my observation with one of the failure instances (https://sof-ci.01.org/linuxpr/PR5106/build4026/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=multiple-pause-resume-50)

The multiple pause resume combination that gets started is aplay on Port 0 (stream 0) and the DeepBuffer Port 0 (stream 31). Pause/release iterations start almost simultaneously for both the streams and proceed normally up until the stream 31 finishes 15 iterations. After it is released in the 15th iteration, the FW runs into an overrun causing aplay to try to recover from the xrun and from then onwards all we see is the stream prepare, stream trigger start followed by an immediate stream trigger stop indicating that aplay cannot recover from the xrun but it keeps trying endlessly. But parallelly, stream 0 continues with its pause/release iterations without any issues.

So the question is why does the FW run into an overrun after 15 pause/releases? I am afraid this issue looks very similar to the issue #5080 in the sense that there's a random xrun in the FW after many iterations and there's nothing unusual in the sequences leading up to it.

@plbossart
Copy link
Member

what happens if the "DeepBuffer Port 0" is skipped in the tests, do we see an xrun, ever?

If the xrun happens only in a case where pipelines are mixed, this could be a case of
a) an initial condition where real-time deadlines are not met and
b) a problematic state management where the state of the common part is not well handled.

IIRC we tested pause_push/pause_release with one stream playing, when all streams getting mixed are paused/released at the same time there could be races conditions left and right.

@ujfalusi
Copy link
Collaborator

@plbossart, what is interesting is that the issue (afaik) only can be reproduced on LNL.

@plbossart
Copy link
Member

plbossart commented Jul 26, 2024

we've seen in the SoundWire case errors that were different on LNL, the timing of the transitions seems a bit different and that seems to open the door to race conditions we didn't see or notice before.

@ranj063
Copy link
Collaborator

ranj063 commented Jul 26, 2024

what happens if the "DeepBuffer Port 0" is skipped in the tests, do we see an xrun, ever?

@plbossart yes, there're several instances of the xrun happening with pause/release with just one stream (DMIC Raw)

@ranj063
Copy link
Collaborator

ranj063 commented Jul 26, 2024

what happens if the "DeepBuffer Port 0" is skipped in the tests, do we see an xrun, ever?

If the xrun happens only in a case where pipelines are mixed, this could be a case of a) an initial condition where real-time deadlines are not met and b) a problematic state management where the state of the common part is not well handled.

IIRC we tested pause_push/pause_release with one stream playing, when all streams getting mixed are paused/released at the same time there could be races conditions left and right.

whats interesting in this particular case is that the Port 0 stream is going on with its test just fine while the other stream gets stuck randomly. My initial suspicion was on link DMA but now I feel like it might be the host DMA thats the problem as the logs show that after releasing we're reporting that there's nothing available to copy from the host DMA buffer. But this makes it even harder to understand because in the case of pause/release, the host DMA is left untouched, it just keeps running forever until the pipeline is completely stopped.

marc-hb added a commit to marc-hb/sof-test that referenced this issue Aug 1, 2024
Experience with thesofproject/linux#5109 shows
that this warning never seems harmless: the test ends up timing out and
failing anyway.

So, better failing fast for clearer and better logs.

Also increase the log level of press_space() to avoid state confusion.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 1, 2024

thesofproject/sof-test#1226 makes a huge difference, please review.

marc-hb added a commit to thesofproject/sof-test that referenced this issue Aug 1, 2024
Experience with thesofproject/linux#5109 shows
that this warning never seems harmless: the test ends up timing out and
failing anyway.

So, better failing fast for clearer and better logs.

Also increase the log level of press_space() to avoid state confusion.

Signed-off-by: Marc Herbert <marc.herbert@intel.com>
@marc-hb marc-hb changed the title [LNL] hang, timeout or crash in pause-release + mass overruns [LNL] hang, timeout or crash in pause-release Aug 3, 2024
@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 3, 2024

The failure logs have simplified and changed A LOT since my "fail fast" test fixes in thesofproject/sof-test#1226 was merged.

Summary of the results of daily test 44477 (Start Time: 2024-08-02 13:07:34 UTC)

  • No more xrun messages. So these were likely due to the test that kept trying to use the stream even after the first problem occured.
  • A few failures with aplay: do_pause:1567: pause push error: File descriptor in bad state
  • Some non zero exit status failures without any error message. Maybe the logs show some interesting behaviors and warnings?
  • A couple timeouts
  • LNLM_RVP_HDA had zero failure, 100% green that time!
  • Everything outside LNL has kept a 100% pass rate since the new apause.exp rewrite a few weeks ago. Failures have been specific to LNL for all that time.

@marc-hb marc-hb added the bug Something isn't working label Aug 3, 2024
@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 5, 2024

This failure has a file descriptor in bad state and an overrun:

https://sof-ci.01.org/sofpr/PR9323/build6836/devicetest/index.html?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100
Same in daily 44550?model=LNLM_RVP_NOCODEC&testcase=check-pause-resume-capture-100

This one has only a "non-zero exit status"

https://sof-ci.01.org/sofpr/PR9323/build6836/devicetest/index.html?model=LNLM_RVP_HDA&testcase=multiple-pause-resume-50

This one passed on the same system:

https://sof-ci.01.org/sofpr/PR9323/build6836/devicetest/index.html?model=LNLM_RVP_HDA&testcase=check-pause-resume-capture-100

HDA seems to have a higher pass rate and NOCODEC the worst, example:
https://sof-ci.01.org/linuxpr/PR5133/build4270/devicetest/index.html
NOCODEC has more to test and the test runs longer.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 7, 2024

In these runs, some tests passed and some tests failed in every LNL model: https://sof-ci.01.org/sofpr/PR9351/build6896/devicetest/index.html , https://sof-ci.01.org/linuxpr/PR5110/build4324/devicetest/index.html

So failures seem intermittent in every configuration.

@marc-hb
Copy link
Collaborator Author

marc-hb commented Aug 20, 2024

Daily run 44926 is interesting with 5 failures out of 6 test runs and they don't all look the same.

Same in https://sof-ci.01.org/sofpr/PR9299/build7192/devicetest/index.html

EDIT: in daily run 44967?model=LNLM_SDW_AIOC&testcase=multiple-pause-resume-50, the whole system crashed and become unresponsive. The journalctl logs end abruptly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working LNL Applies to Lunar Lake platform
Projects
None yet
Development

No branches or pull requests

4 participants