-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] timeout on STREAM_SD_OFFSET read when suspend-resume on CML Helios #4558
Comments
@slawblauciak what makes you think it's a kernel issue? There's nothing that changed recently. @lgirdwood Can you please look into this? |
@XiaoyunWu6666 The same problem affects other CML devices, it's not just CML_HEL_RT5682 It's an across-the-board problem for all CML devices using I2S/HDAudio or SoundWire links. That's very very suspicious. |
Reproduced in 10 iterations on CML SKU 09C6 (same as 0983)
kernel SHA1: 85bfa34a88d6 |
|
@XiaoyunWu6666 @marc-hb @fredoh9 @bardliao I would appreciate it if you can try to reproduce the issue on a CML device. My results show that there is a firmware regression since I don't see any issue with the latest kernel/older firmware. More details for the record - @lgirdwood FYI test 5464 July 26 GOOD so basically I bisected from yesterday's tested SOF commit |
@plbossart thanks for bisect , can you confirm if the trace patch or the pipeline patch is bad ? |
I don't understand the question @lgirdwood. My tests show what the first bad commit was, I don't know if the last patches on pipelines are correct or not. It could also be that my 'good' tests are 'green failures', and the issue happens with an earlier commit. |
Sorry, was not clear. I mean is the bisect clear cut or vague - I agree though this is a FW issue.
OR
I guess what you are saying it could be either above two commits (and needs further investigation) ? |
no, I am saying: the problem started with 0def905 or earlier. I don't have a 100% confidence in the 'good' results, but I trust the tests when they tell me the DSP boot failed. |
@XiaoyunWu6666 are you able to do some further investigation. OR redo the bisect with 20 iterations (based on @plbossart being able to reproduce with 10). Thanks ! |
Note that sometimes 10-20 iterations is not enough. I just ran a check which didn't fail in 50, and my last took 36 with the SOF main branch
|
With these two commands, I was able to run 100 iterations without issues.
test command:
|
And with these 3 commands I get an error in 11 iterations
I would say this confirms the bisect and rules out all following commits after 0def905 |
I can reproduce relatively quickly to on EDIT: I just reproduced at commit 98e8016, before the commits above. Note the DSP is permanently wedged after a failure, More precisely, after failure,
Then sof-test/kmod/tools/sof_insert.sh fails like this:
|
As I wrote in the previous comment, with So something else changed recently. I used the kernel that was already there, didn't touch it: Note I also tried to use a remote PDU to powercycle the device and not just reboot it (afraid of something like #4419) but I don't know effective that was. EDIT: the remote PDU cuts the power for about 1s. The battery has been disconnected an all devices. I know it's effective because the RTC clock loses track of time.
... while everything else is fine. So how could this be related to generic trace changes? |
@marc-hb Same problem with v1.8, that's hard to believe. After reverting two commits above @plbossart mentioned, I'm not able to reproduce the problem.
I used this command, |
Looking at daily test results 5539 I see the failures happened in two steps on some platforms and only the first step on others.
It was hard to believe for me too but I triple check that by watermarking the build with a
Use the same parameters as in the description. |
I can still reproduce the issue when 57ee04f and 0def905 are reverted on my CML SKU 098F device.
And test with latest v1.8.x firmware from https://github.com/thesofproject/sof-bin/tree/main/v1.8.x is still on going, so far so good. Edit: The test result with latest v1.8.x firmware is PASS on my side. Edit: My kernel commit is thesofproject/linux@85bfa34 |
So v1.8 would have been validated with a different kernel version. It's also possible the reproducability is dependent on the kernel version too. Has anyone tried the kernel from the v1.8 tag date with HEAD and v1.8 ? |
the SOF v1.8 was tested with this kernel: https://github.com/thesofproject/linux/tree/release-sof-1.8-20210524 |
So we do the following for stream START|STOP ret = snd_sof_dsp_read_poll_timeout(sdev,
HDA_DSP_HDA_BAR,
sd_offset, run,
((run & dma_start) == dma_start),
HDA_DSP_REG_POLL_INTERVAL_US,
HDA_DSP_STREAM_RUN_TIMEOUT);
if (ret < 0) {
dev_err(sdev->dev,
"error: %s: cmd %d: timeout on STREAM_SD_OFFSET read\n",
__func__, cmd);
return ret;
} Where the wait delay is 300uS
The DSP can take up to the next LL tick (up to 1ms) to do the stream DMA START | STOP. Can we try increasing the timeout to 1100uS (including some overhead). This could be too long for kernel trigger atomic context so we could also remove this polling check here and place it in the PCM position callback e.g. if (time_from_stream_trigger > 1100 && bit_no_set)
printk(KERN_ERR "dma not started); OR alternatively rely on the ALSA pointer timout of 100ms as indication of stream DMA status. |
We've had issues with firmware boot for a while now, see the query https://github.com/thesofproject/linux/issues?q=is%3Aissue+is%3Aopen+label%3ACSE What I see from the logs is that we seem to have different patterns of failure a. the pattern in this bug
b. a different pattern in kmod tests Marc's post above and in thesofproject/linux#3056
c. and yet another pattern in thesofproject/linux#3045
So I really wonder if we have multiple causes of failure. The changes with the DMA trace are fairly reproducible, it's rather unclear why. I will re-test with v1.8, I am not completely sure it's the same problem - I was e.g. unable to reproduce the issue with more than 100 iteration with just the 2 commits reverted. |
I am not fully convinced @lgirdwood, the key point is that we only see this issue in suspend-resume tests. If we had a problem with DMAs we would have seen it in regular playback/capture tests. It's the suspend-resume part that's really important. |
The trace uses the DMA too and is on the LL 1ms tick. Can you try and extend the delay ? |
Thanks @bardliao , obvious differences between compilers are C calling ABI, code speed, layout of TEXT and DATA. I can see 2 really interesting things from the data
i.e. GCC
and XCC
|
Some things to try
|
|
Note it's mostly optional at the developer level already. Otherwise yes, automation always uses it and depends on it right now. Discussed in thesofproject/sof-test#726
Testing less always finds fewer bugs ;-) |
sof-trace-gcc.txt
Yes, they are the same topology. I think it is because we have two capture devices and GCC failed at the first capture test and XCC finished two capture tests. And the dmesg log is truncated, that's why you see different stream ID.
|
@XiaoyunWu6666 please close on Monday if all is good. |
Please don't close this even when the tests pass. Reverting important features without a clue why they failed is not a solution. Set it to the lowest priority if you want. |
in inner daily [id:5635] CML Helios is good ,let's lower the priority of this issue. |
Close it now since relevant feature was reverted in #4578 Since this issue takes only several rounds to reproduce in testcase check-suspend-resume-with-capture-5 , we'll test how long it will take additionally to add this case in PR test. |
BTW, I can still reproduce the issue with #4578 merged in GCC build. That could be a different issue. |
happen in inner daily 6362 |
@XiaoyunWu6666 what's the reproducability now and is this fatal ? |
@lgirdwood
Concerning the number of testcases that this issue affects , it is not that fatal (about one each time). But its appearance causes the testcase to fail |
@XiaoyunWu6666 thanks for the update. |
The failure in 6362 looks different to me:
This could mean this issue hasn't happened since revert #4578 |
@XiaoyunWu6666 @1994lwz could you please check if thesofproject/linux#3166 helps fix this issue? |
Describe the bug
timeout on STREAM_SD_OFFSET read when suspend-resume on CML
inner daily /5515?model=CML_HEL_RT5682&testcase=check-suspend-resume-with-capture-5
To Reproduce
-S 1 -w 3
save some time while preserving an acceptable ssh experience. For faster reproduction on a local system try-S 1 -w 1
?-m capture
is critical:-m playback
never reproduced even after 100 iterations.Environment
Start Time: 2021-07-26 21:24:47 UTC
End Time: 2021-07-27 04:58:29 UTC
Kernel Branch: topic/sof-dev
Kernel Commit: 2113dc7f
SOF Branch: main
SOF Commit: 57ee04f
Topology:sof-cml-rt1011-rt5682.tplg
Platform:CML_HEL_RT5682
Screenshots or console output
[console]
The text was updated successfully, but these errors were encountered: