Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dma-trace: Fixes aimed to make the re-configuration robust and cleaner #4879

Closed

Conversation

ujfalusi
Copy link
Contributor

Within #4860 from @keyonjie there were a fix for memory leak in dtrace which would have broke the SOF client support - we need to be able to remove and insert the dtrace, re-configuration is normal.

The problem is valid, but it has to be handled in a different way, this PR is providing that with a cleaner (I hope) implementation and a the same time makes the dtrace reconfiguration a bit more nicer.

I have tested it with the SOF client support (thesofproject/linux#3136):
run aplay -Dplughw:1,0 -fdat /dev/urandom to keep the DSP on and run in loop:

sudo rmmod snd_sof_dma_trace
sleep 1
sudo modprobe snd_sof_dma_trace
sleep 1

@ujfalusi ujfalusi requested a review from akloniex as a code owner October 15, 2021 06:42
@ujfalusi ujfalusi mentioned this pull request Oct 15, 2021
@ujfalusi ujfalusi changed the title dmaa-trace: Fixes aimed to make the re-configuration robust and cleaner dma-trace: Fixes aimed to make the re-configuration robust and cleaner Oct 15, 2021
mtrace_printf(LOG_LEVEL_ERROR, "dma_trace_enable: buffer_init failed");
goto out;
/* Allocate and initialize the dma trace buffer if needed */
if (!d->dmatb.addr) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably this check should be moved within dma_trace_buffer_alloc() ?

@ujfalusi ujfalusi force-pushed the peter/pr/dtrace_init_fixes_01 branch from b335a12 to bb3e8b4 Compare October 15, 2021 07:41
@ujfalusi
Copy link
Contributor Author

Changes since v1:

  • move the check for d->dmatb.addr within dma_trace_buffer_alloc()

@lgirdwood
Copy link
Member

@kkarask @wszypelt looks like an unrelated DMIC failure ?

@ujfalusi ujfalusi marked this pull request as draft October 15, 2021 11:34
@ujfalusi
Copy link
Contributor Author

@kkarask @wszypelt looks like an unrelated DMIC failure ?

Almost all of them has the error: can't set params for DMA for trace -110 line.
I will try to reproduce it on my UP2 board, works w/o issue on my tgl-h.

Moving to draft for now.

@lgirdwood
Copy link
Member

@kkarask @wszypelt looks like an unrelated DMIC failure ?

Almost all of them has the error: can't set params for DMA for trace -110 line. I will try to reproduce it on my UP2 board, works w/o issue on my tgl-h.

Moving to draft for now.

Ok, btw I think I only noticed a single red box on QB this mroning and now there are more related to Stream start. Lets try again.

@lgirdwood
Copy link
Member

SOFCI TEST

@ujfalusi
Copy link
Contributor Author

It can only be the last commit which introduces any functional changas, but it should only have effect in case of reconfiguration and I don't see that.
Too bad that the trace (non dma) has no information for us. Hmm, the buffer allocation failure is printed tr_trace().

So far this is not really doing what it supposed to do.

The other thing which might be of issue is that the check for the addr might need to be inside of a lock?

@lgirdwood
Copy link
Member

@ujfalusi best to bisect the PR and we can incrementally merge.

@ujfalusi ujfalusi force-pushed the peter/pr/dtrace_init_fixes_01 branch 3 times, most recently from d5ab186 to 9a101ae Compare October 19, 2021 09:52
@ujfalusi ujfalusi marked this pull request as ready for review October 19, 2021 11:18
@ujfalusi
Copy link
Contributor Author

I'm not certain if it is related to this PR, with the latest devixe test (https://sof-ci.01.org/sofpr/PR4879/build10753/devicetest/):
TGLH_RVP_HDA - check-suspend-resume-5.sh : TIMEOUT
CML_HEL_RT5682 - check-suspend-resume-with-capture-5.sh : FAIL

On the previous one (https://sof-ci.01.org/sofpr/PR4879/build10750/devicetest/) without the two topmost patch:
TGLH_RVP_HDA - check-suspend-resume-with-playback-5.sh : TIMEOUT
CML_SKU0955_HDA - check-suspend-resume-with-capture-5.sh : FAIL

Scheduled another PR test run.

@ujfalusi
Copy link
Contributor Author

The PR test run (7496) also have two failures:
CML_HEL_RT5682 and CML_SKU0955_HDA - check-suspend-resume-with-capture-5.sh : FAIL

The two dmesg starts identical with a firmware boot issue:

kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x80000]=0x5000001 timedout
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: cl_copy_fw: timeout HDA_DSP_SRAM_REG_ROM_STATUS read
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: FW Poll Status: reg[0x160]=0x140002 timedout
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: hda_dsp_stream_trigger: cmd 0 on -- (Playback, stream_tag: 1): timeout on STREAM_SD_OFFSET read
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: DMA trigger stop failed
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump start ]------------
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (2)
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: unknown ROM status value 05000001
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: extended rom status:  0x5000001 0x0 0x0 0x0 0x0 0x0 0x1811102 0x0
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump end ]------------
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: load fw failed ret: -110
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to start DSP
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump start ]------------
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: fw_state: SOF_FW_BOOT_IN_PROGRESS (2)
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: unknown ROM status value 01009701
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: extended rom status:  0x1009701 0x1d003c 0x1 0x0 0x10000 0xc00 0x180030 0x0
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: ------------[ DSP dump end ]------------
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: error: failed to boot DSP firmware after resume -110
kernel: PM: dpm_run_callback(): pci_pm_resume+0x0/0x80 returns -110
kernel: sof-audio-pci-intel-cnl 0000:00:1f.3: PM: failed to resume async: error -110

But no issue on TGLH_RVP_HDA this time.

@ujfalusi
Copy link
Contributor Author

Another PR test (7500) have again different failure pattern:
CML_RVP_SDW - check-suspend-resume-with-playback-5.sh and check-suspend-resume-with-capture-5.sh : TIMEOUT

Looks like the board does not wake up and the playback version have:

atkbd serio0: Failed to enable keyboard on isa0060/serio0

in the logs.

@ujfalusi
Copy link
Contributor Author

I have tried to bisect via CI the CML TIMEOUT we have in this PR but it was inconclusive as with an empty commit on top of main I see even more failures: https://sof-ci.01.org/sofpr/PR4897/build10772/devicetest/

@ujfalusi
Copy link
Contributor Author

SOFCI TEST

@ujfalusi ujfalusi force-pushed the peter/pr/dtrace_init_fixes_01 branch from 9a101ae to 750c177 Compare October 22, 2021 11:59
@@ -321,6 +328,9 @@ static int dma_trace_start(struct dma_trace_data *d)
d->dc.chan = NULL;
err = dma_copy_set_stream_tag(&d->dc, d->stream_tag);
}

/* Re-initialize the dtrace buffer */
dma_trace_buffer_init(d, NULL);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can a memory leak happen? Is start() called without a stop(), because the latter does free the buffer? Is stop() never called in such a case, just a repeated start()? Maybe worth mentioning this in the commit message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh, this PR kind of precedes the stop() from @ranj063.
But if you have old kernel that will not going to call it.
I think this is more like a safety net for the sake of completeness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record: it is SOF_IPC_TRACE_DMA_FREE and the dma_trace_disable() for the trace stop which if it is called then stops/put the DMA channel and free up the buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me check w/o this series tomorrow against the SOF clients version. I might be able just close this PR.

dma_trace_buffer_init() can only fail if the rballoc fails so there is no
need to call dma_trace_buffer_free() as the buffer has not been allocated

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
The current dma_trace_buffer_init() which allocates and initializes the
dtrace buffer is re-tasked to simply (re-)initialize the dtrace buffer
along with placing the markers at the same function.

The allocation of the buffer is split out to dma_trace_buffer_alloc() which
calls the dma_trace_buffer_init() to get the buffer initialized.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
…nfig

We should stop the task before the dtrace channel is stopped when we
need to re-configure it.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Do not attempt to allocate the dtrace buffer again if it has been already
allocated.
This only possible if we are re-configuring the dtrace which implies that
we also must have the DMA channel.

In such case, skip the allocation and do a re-init of the buffer after the
DMA channel has been stopped.

Reported-by: Keyon Jie <yang.jie@linux.intel.com>
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
@ujfalusi ujfalusi force-pushed the peter/pr/dtrace_init_fixes_01 branch from 750c177 to 81aff7b Compare October 26, 2021 12:55
@ujfalusi
Copy link
Contributor Author

Changes since v2:

  • no need to set the end_addr if we are re-configuring the dtrace
  • Extend the comment to clarify that the memory leak is unlikely to happen with proper host implementation

@ujfalusi ujfalusi marked this pull request as draft October 27, 2021 06:19
@ujfalusi
Copy link
Contributor Author

I still see two TIMEOUTs on the tests which makes me a bit nervous. I don't see how this PR can cause it to be honest, but let me try to clarify if it is really due to this or something else.

Please do not merge it!

@lgirdwood
Copy link
Member

@ujfalusi can you rebase and repush, this will retest with latest revert

@marc-hb
Copy link
Collaborator

marc-hb commented Oct 27, 2021

Please do not merge it!

Drafts cannot be merged, I mean not without deliberately changing them to non-drafts.

@ujfalusi can you rebase and repush, this will retest with latest revert

No need to rebase and create git range-diff noise, an zero change commit --amend is enough to change the SHA1 and retest everything merged with the latest.

I keep saying this because I notice people keep missing that tests are always performed on a moving target whether they want that or not. Asking to rebase maintains this misconception.

Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The situation is complicated because the current design (before this PR) does not make sense: dma_trace_init_early() pretends to

Do this early so we can log at initialization time even before the DMA runs. The rest happens later in dma_trace_init_complete() and dma_trace_enable()

But in fact what dma_trace_init_early() does is nowhere near enough because it does not initialize the DMA trace buffer (= what this PR touches).

So why is dma_trace_init_early() lying and why do we have this very gradual and more initialization split over multiple functions called at different points in time? Very good question. The answer is: because old commit eca2089 was merged without commit message and without code review. Before that commit, dma_trace_init_early() used to do what it says and earlier DMA traces (which still exist!) where not silently discarded. For more details check Restore early tracing #4334 which I want to go and try to revive now that a bunch of DMA issues have been fixed.

As #4334 is incompatible with this (and apparently new to people here) I'm defensively blocking this for now until someone proves that #4334 is a bad idea and this is a better one.

Back to the initial point: why should the DMA trace buffer be freed and re-allocated? Why not make it immutable?

@ujfalusi
Copy link
Contributor Author

The situation is complicated because the current design (before this PR) does not make sense: dma_trace_init_early() pretends to

Do this early so we can log at initialization time even before the DMA runs. The rest happens later in dma_trace_init_complete() and dma_trace_enable()

Yes, this comment is completely and without a doubt not stand as of today, regardless of this PR.

But in fact what dma_trace_init_early() does is nowhere near enough because it does not initialize the DMA trace buffer (= what this PR touches).

I agree, with the current code you can not store the dtrace messages between dma_trace_init_early() and dma_trace_enable() (which is executed when SOF_IPC_TRACE_DMA_PARAMS /_EXT is received), where we can do and configure the DMA.

So why is dma_trace_init_early() lying and why do we have this very gradual and more initialization split over multiple functions called at different points in time? Very good question. The answer is: because old commit eca2089 was merged without commit message and without code review. Before that commit, dma_trace_init_early() used to do what it says and earlier DMA traces (which still exist!) where not silently discarded. For more details check Restore early tracing #4334 which I want to go and try to revive now that a bunch of DMA issues have been fixed.

What you need to take into account is that we now have a new IPC (SOF_IPC_TRACE_DMA_FREE) which will free up the buffer via dma_trace_disable() and this IPC is sent when the host releases the trace (before suspend or with the SOF client version when you remove the module).

As #4334 is incompatible with this (and apparently new to people here) I'm defensively blocking this for now until someone proves that #4334 is a bad idea and this is a better one.

I don't see any incompatibility, this PR is not freeing up the buffer. It is making sure that we re-use it if it has been already allocated - which is now unlikely as the dma_trace_disable() will free it up.
In fact this PR could be a better starting point for trying to get the early dtrace storage available as the dma_trace_enable() will not going to double allocate memory, it will re-use the buffer.
Having said that, there are couple of gotchas needs to be addressed, like what is the correct way to handle the w_ptr, r_ptr, avail and whats up with the markers in case of dtrace reconfiguration.
Fwiw when I tried to leave those as they were, the sof-logger output got pretty confused so in this PR I just reset them and place the marker at the start.

Back to the initial point: why should the DMA trace buffer be freed and re-allocated? Why not make it immutable?

It is a good question.
On one hand it kind of make sense to have the dtrace buffer's lifespan equal to the firmware's, allocate once and never free it up.
On the other hand it might make sense to have it released if it is not in use, dtrace is stopped?

If we release the buffer while the dtrace is disabled, not in use then we do loose the messages (and I guess we can not fall them back to mtrace).
If we keep the buffer and we have no dtrace enabled on the host (or the host disables, enables it runtime) we need to track that as well and do_the_right_thing(tm) when next time it gets enabled.

Either way, I don't see this PR in any way blocking the early dtrace or I would rather re-name it to nonstop dtrace support.

@marc-hb marc-hb self-requested a review October 28, 2021 06:38
@ujfalusi
Copy link
Contributor Author

I gave up with the bisecting of TIMEOUT on CML, it just does not worth the time (#4925).
https://sof-ci.01.org/sofpr/PR4925/build10867/devicetest/
and
https://sof-ci.01.org/sofpr/PR4925/build10884/devicetest/
Is the result of exactly the same patch content (reverting the top two patch from this series.).
One has TIMEOUT, the other does not.

I'll move it out from draft and let's continue the discussion around this.

@ujfalusi ujfalusi marked this pull request as ready for review October 28, 2021 10:25
@lgirdwood
Copy link
Member

@ujfalusi @marc-hb ping - any update, could go on the back burner though whilst we do ipc4

@marc-hb
Copy link
Collaborator

marc-hb commented Nov 10, 2021

@ujfalusi correct me but I believe this one is a nice to have compared to some other trace PRs in flight like for instance thesofproject/linux#3136 (or IPC4). One significant trace change at a time.

@ujfalusi
Copy link
Contributor Author

@ujfalusi correct me but I believe this one is a nice to have compared to some other trace PRs in flight like for instance thesofproject/linux#3136 (or IPC4). One significant trace change at a time.

I think the memleak can still happen if the fuzzer is used but not with Linux. I did run a test with current sof-dev and with the client support without this PR and all looked fine (we do release the trace buffer and DMA channel on dtrace free).

@ujfalusi
Copy link
Contributor Author

Closing this PR as #5106 contains most of the fixes but it is extending it further to plug one more leak.

@ujfalusi ujfalusi closed this Dec 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants