Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dma-trace: Align dma trace buffer correctly, fixed Zephyr dtrace issue #5329

Merged
merged 3 commits into from
Feb 10, 2022

Conversation

jsarha
Copy link
Contributor

@jsarha jsarha commented Feb 8, 2022

The first commit should be the the only one needed to fix #5120 .
The second commit should make sure the same problem does not bite us again. The third commit is pretty much optional, but I think the fix is correct and it is also cleaning up the code a bit.

@jsarha jsarha changed the title Issue 5120 fix dma-trace: Align dma trace buffer correctly, fixes https://github.com/thesofproject/sof/issues/5120 Feb 8, 2022
@jsarha jsarha changed the title dma-trace: Align dma trace buffer correctly, fixes https://github.com/thesofproject/sof/issues/5120 dma-trace: Align dma trace buffer correctly, fixed dtrace issue Feb 8, 2022
@jsarha jsarha changed the title dma-trace: Align dma trace buffer correctly, fixed dtrace issue dma-trace: Align dma trace buffer correctly, fixed Zephyr dtrace issue Feb 8, 2022
src/trace/dma-trace.c Show resolved Hide resolved
src/include/sof/lib/dma.h Show resolved Hide resolved
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first one needs to be made platform-specific, but otherwise look good and test results show this is fixing #5120 which has been causing a lot of CI/validation pain.

src/trace/dma-trace.c Show resolved Hide resolved
src/include/sof/lib/dma.h Show resolved Hide resolved
@lgirdwood lgirdwood added this to the v2.1 milestone Feb 9, 2022
Force DMA buffer to be correctly aligned. If the buffer is only
64-byte aligned, when using hda-dma, then the last 64-bytes of the DMA
buffer is not transferred correctly to host side.

Signed-off-by: Jyri Sarha <jyri.sarha@intel.com>
Add check for DMA buffer address alignment to hda_dma_set_config().
Wrongly aligned DMA buffer does not work correctly.

Signed-off-by: Jyri Sarha <jyri.sarha@intel.com>
It is possible that that in trace_work() sof_ipc_dma_trace_posn ipc
message reaches host side sof driver before the dma_copy_to_host_nowait()
is finished. Change the dma_copy_to_host_nowait() implementation to
dma_copy_to_host() by adding DMA_COPY_BLOCKING flag to dma_copy()
command.

The flag is already there on the !defined(CONFIG_DMA_GW) version of
the function and the old name of the function was a bit misleading.

This commit changes also probe.c probe_task() so that it also uses the
new blocking DMA copy. I could not see any justification to use nowait
version there either.

Signed-off-by: Jyri Sarha <jyri.sarha@intel.com>
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems very reasonable from my perspective. @teburd will want to note this behavior for the Zephyr driver too. In Zephyrese that would probably be expressed as an assertion in the driver where it picks up the buffer pointers I guess, though a soft error isn't impossible (not sure if there's a return code where needed in our DMA API).

So... do we get to claim that Zephyr saw this problem more severely because its allocator is more fragmentation-resistant and thus more likely to produce a misaligned output buffer? It pleases me to think that, anyway.

@teburd
Copy link
Contributor

teburd commented Feb 9, 2022

Seems very reasonable from my perspective. @teburd will want to note this behavior for the Zephyr driver too. In Zephyrese that would probably be expressed as an assertion in the driver where it picks up the buffer pointers I guess, though a soft error isn't impossible (not sure if there's a return code where needed in our DMA API).

So... do we get to claim that Zephyr saw this problem more severely because its allocator is more fragmentation-resistant and thus more likely to produce a misaligned output buffer? It pleases me to think that, anyway.

Yes, I had this in mind and had an assertion on alignment in fact in a first pass as I noted the alignment mentioned in the docs, before I got sucked into a rabbit hole of getting a test working.

@jsarha jsarha requested a review from kv2019i February 10, 2022 06:37
@marc-hb
Copy link
Collaborator

marc-hb commented Feb 10, 2022

This is fascinating: the bug disappeared in the latest daily test run without merging this PR. The bug saw the PR, got scared and ran away. It's magic.

  • In daily run ID 10012 on Feb 7th, the DMA corruption impacted TGLU_UP_HDA_ZEPHYR and TGLH_RVP_NOCODEC_ZEPHYR massively. Most tests were failing and even the ones that passed (e.g.: check-playback-all-formats on TGLH_RVP_NOCODEC_ZEPHYR) were still showing this error at an extremely regular interval:
error: log_entry_address      0x6b0 is not in dictionary range!
warn: Seeking forward 4 bytes at a time until re-synchronize.
Start Time: 2022-02-07 22:27:29 UTC
Kernel Branch: topic/sof-dev
Kernel Commit:  03175849
SOF Branch: main
SOF Commit: 99b4bdffd279
Zephyr Commit: b374dc6a7fd8
  • Now moving 2 days later to daily run ID 10079, the DMA corruption is gone. There are still a couple failures including some sof-logger failures but much much fewer and they all look different. I searched all the logs of all the tests across all platforms and now only some CML platforms show the log_entry_address error above. It's never in any Zephyr test.
Start Time: 2022-02-09 22:27:36 UTC
Kernel Branch: topic/sof-dev
Kernel Commit: 8bf95cb0
SOF Branch: main
SOF Commit: 9983ab67a9fe
Zephyr Commit: 41a63f48fdc8

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 10, 2022

I cherry-picked this PR on top of the current a61ff1f and I can still reproduce the issue instantly with Zephyr 2f5976d674ff, my APL UP2 and sof-logger "hardening" #5340 (as a system configuration sanity check, XTOS is fine with the exact same SOF git commit)

@jsarha
Copy link
Contributor Author

jsarha commented Feb 10, 2022

XTOS works just by change, or actually because of its different heap implementation. If one enforces the DMA buffer to be only 64-byte aligned but not 128-byte aligned (the way it usually is in the Zephyr version), the xtos version fails as miserably as Zephyr.

@marc-hb
Copy link
Collaborator

marc-hb commented Feb 10, 2022

My main point was: this PR does not fix the issue at all in my (Zephyr) configuration.

I'm testing different Zephyr versions now.

EDIT: I combined this PR with all the SOF and Zephyr commits used by the daily runs references above. In every combination the sof-logger prints error: log_entry_address 0x.... is not in dictionary range! like clockwork.

This is with kernel 920d034ba4f2 in case that matters. I don't have any TGL device at home.

@jsarha
Copy link
Contributor Author

jsarha commented Feb 10, 2022

I tried to put together a branch with all the mentioned pieces, even used the same linux hash, and tested that on my TGL board, but can not reproduce the problem. @marc-hb could you share exact branches and tests you run so I could try again? Also (a remote) access to same piece of HW could be helpful. My debug hacks here should help in finding out what is still wrong with the dtrace.

Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jsarha , looks good! I think even there are stilll more issues, this seems like a step to the right direction and makes the DMA trace code more robust.

@lgirdwood
Copy link
Member

Ack - this fixes the HW alignment as mentioned in the spec. We can now focus on why its not working with @marc-hb setup.

@lgirdwood lgirdwood merged commit 443b21d into thesofproject:main Feb 10, 2022
@marc-hb
Copy link
Collaborator

marc-hb commented Feb 10, 2022

I cherry-picked this PR ...

Apologies, last night I was cherry-picking only the last commit in this PR :-( When cherry-picking the first, alignment commit then then problem does instantly go away in my configuration: from systematic corruption to none. Yesterday was a way too long day.

It's still fascinating that the issue went mostly away even before this PR was merged (involving zero fat finger of mine) but that's not new: we always knew this corruption was "randomly deterministic"!

Thanks a lot @jsarha , you've hopefully just made a huge difference to test results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] [ZEPHYR] corruption of the end of the DMA trace ring buffer ("sof-logger was already dead")
7 participants