Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dma-trace: Full support for re-configuration and make dmatb persistent from the time it is allocated #5106

Merged
merged 9 commits into from
Feb 18, 2022

Conversation

ujfalusi
Copy link
Contributor

Hi,

This PR is replacing #4879.

Since last time I have noticed another path for a memory leak which was a hard one to crack:
Within dma_trace_start() when CONFIG_DMA_GW is set we were allocating a dma_sg (to SOF_MEM_ZONE_SYS) and we would do that for every start. The config is only used in that function to configure the DMA channel, but the dma_sg can not be freed up as it is allocated to SOF_MEM_ZONE_SYS. Other zone allocation failed on my tgl-h laptop.

The solution is to make dmatb and this dma_sg persistent from the point they are allocated.
This comes with a bonus that we can still collect dtrace logs to SOF firmware for some extent (if the would be dtrace client is removed we would still have some memory of the events).

The other change compared to #4879 is that I don't reset the firmware dmatb, only the host pointers as when the re-configuration happens it is indicating that we are going to have a new host destination along with 0 offsets.

lyakh
lyakh previously requested changes Dec 20, 2021
@@ -291,6 +302,8 @@ static int dma_trace_start(struct dma_trace_data *d)
}

if (d->dc.chan) {
unsigned int flags;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just noticed, that this file uses unsigned int for spin-lock flags. It should actually be uint32_t. Not something to change in this PR, but a separate PR could update that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll fix it up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with k_spinlock_key_t as we have switched lock API...

spin_lock_irq(&d->lock, flags);

d->old_host_offset = 0;
d->posn.host_offset = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

above dma_stop() or dma_copy_set_stream_tag() could've failed. Don't you need to abort in some of those cases? I'd reformat this function with the usual gotos to make error handling more obvious

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lyakh, we do abort at line 339.

This is a bit tricky. I would say that we can (and need) to reset the the host_offset in any case, so this can be moved beyond line 341.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved after the error check/abort.

@marc-hb
Copy link
Collaborator

marc-hb commented Dec 24, 2021

SOFCI TEST

https://sof-ci.01.org/sofpr/PR5106/build11466/devicetest/?model=TGLU_UP_HDA_ZEPHYR&testcase=check-playback-all-formats looks like issue #5120

#5120 seems "randomly deterministic", meaning it always happen in some configurations and never in other configurations. So it is possible that this PR causes a deterministic TGLU_UP_HDA_ZEPHYR regression for no obvious reason.

@lgirdwood lgirdwood added this to the v2.1 milestone Dec 25, 2021
Copy link
Contributor

@jsarha jsarha left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me (when other comments are addressed), but you'll need to rebase this after my dtrace buffer alignment PR.

@ujfalusi ujfalusi force-pushed the peter/pr/dtrace_persistent_01 branch from 6a858e2 to 115743a Compare February 14, 2022 08:12
@ujfalusi
Copy link
Contributor Author

Changes since v1 (I think):

  • rebased on main
  • addressed comments from @lyakh

Copy link
Member

@lgirdwood lgirdwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ujfalusi some red showing on Zephyr APL, can you split this PR into more parts, based on risk and change size. This way it should be more obvious which parts are causing the CI to report issues. Thanks.

@ujfalusi
Copy link
Contributor Author

@ujfalusi some red showing on Zephyr APL, can you split this PR into more parts, based on risk and change size. This way it should be more obvious which parts are causing the CI to report issues. Thanks.

@lgirdwood, we have exactly the same failures present in unrelated kernel pr, like this: https://sof-ci.01.org/linuxpr/PR3424/build7156/devicetest/

I can split the PR into two, but they can not go in parallel as the patches are depending on each other.

@lgirdwood
Copy link
Member

lgirdwood commented Feb 14, 2022

@ujfalusi no need to split, just seen the other PR showing red too in the same place for APL Zephyr and this has been happening for a while now, we must have been lucky on Friday when the CI consistently showed green.

d->old_host_offset = 0;
d->posn.host_offset = 0;

k_spin_unlock(&d->lock, key);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, not sure I understand the use of locking well here. I looked through dma-trace.c and I couldn't understand what this lock is actually protecting. As for these host_offset fields - looks like in other locations they are accessed and modified without holding the lock?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

True, there is no need for the lock.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you going to remove the lock ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I will.

key = k_spin_lock(&d->lock);

rfree(buffer->addr);
memset(buffer, 0, sizeof(*buffer));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this needed? I see, that it is just moved from another location, just wondering if we know why we need this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is to reset the variables within buffer.

@lyakh lyakh dismissed their stale review February 16, 2022 07:29

issue addressed, thanks

@lgirdwood
Copy link
Member

SOFCI TEST

@lgirdwood
Copy link
Member

Rerun the CI since this is a trace change.

Copy link
Collaborator

@marc-hb marc-hb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what effect it will have on

I haven't found the time for a full review yet sorry.

*/
#define SOF_BANNER_COMMON \
"FW ABI 0x%x DBG ABI 0x%x tags SOF:" SOF_GIT_TAG ZEPHYR_VER_OPT \
" src hash 0x%08x (ldc hash " META_QUOTE(SOF_SRC_HASH) ")"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please move these macros down, back together with the only place they are used?

@ujfalusi
Copy link
Contributor Author

I wonder what effect it will have on

* #4334

I haven't found the time for a full review yet sorry.

I think a simple revert as it is in #4334 will no longer going to cut it to have early dtrace buffer as the dma_trace_buffer_init() now needs d->dc.dmac which is initialized in dma_trace_init_complete().
#4334 also does not address the memory leak this PR is fixing.

We should revisit the early dtrace buffer allocation, I agree. I would also add a Kconfig option to make it selectable with opt-in. I would also add mtrace for wrap notification when the host is not yet (or withdrawn it) provided a host buffer to flush the dtrace buffer to.

dma_trace_buffer_init() can only fail if the rballoc fails so there is no
need to call dma_trace_buffer_free() as the buffer has not been allocated

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Printing the error via dtrace is not going to work when the memory for
the dtrace is not allocated.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Move the tag printing out from dma_trace_enable() to
dma_trace_buffer_init()
to be close proximity of the buffer allocation for easier reading.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
…nfig

We should stop the task before the dtrace channel is stopped when we
need to re-configure it.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
@ujfalusi ujfalusi force-pushed the peter/pr/dtrace_persistent_01 branch from 115743a to 6698e43 Compare February 17, 2022 10:25
@ujfalusi
Copy link
Contributor Author

Changes since v2:

  • drop the pointless locking as pointed out by @lyakh
  • DO not separate the macro used from the tag print from the printing, suggested by @marc-hb

@ujfalusi
Copy link
Contributor Author

I have found other memory leak. I will update this PR with an additional patch (the host buffer's elem_array is not freed on error and on dtrace_free)

We need to free up the allocated SG in case of a failure after a successful
ipc_process_host_buffer() call, which does the allocation.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
Do not attempt to allocate the dtrace buffer again if it has been already
allocated.
This only possible if we are re-configuring the dtrace which implies that
we also must have the DMA channel.

In such case, skip the allocation and do a re-init of the buffer after the
DMA channel has been stopped.

Reported-by: Keyon Jie <yang.jie@linux.intel.com>
Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
When the dtrace is disabled (via the SOF_IPC_TRACE_DMA_FREE IPC message) we
need to also free up the allocated dma_sg_elem_array to avoid leaking
memory on the next dtrace start.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
…ocations

dma_sg_alloc() is called with SOF_MEM_ZONE_SYS which can not be freed up.

dma config or start error skip the free as it is not going to work anyways.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
…tion

Do not free the dmatb so that the trace code can continue collecting
information continuously in it's buffer.

We also have another memory leak (when CONFIG_DMA_GW is set) which can be
handled this way:
Every time the dma_trace_start() is called we allocate a new dma_sg. This
is fine when the dma_trace_start() is only called once right after the
DSP finished booting, but multiple calls would eventually going to lead to
OOM.

Move the dma_sg allocation at the same place where the dmatb is allocated
to make it persistent as well.

Signed-off-by: Peter Ujfalusi <peter.ujfalusi@linux.intel.com>
@ujfalusi ujfalusi force-pushed the peter/pr/dtrace_persistent_01 branch from 6698e43 to 8a051c2 Compare February 17, 2022 12:35
@ujfalusi
Copy link
Contributor Author

Changes since v3:

  • two new patches added to fix other memory leaks:
    • ipc: ipc3: handler: Fix error handling in ipc_dma_trace_config()
    • dma-trace: Fix memory leak of host elem_array on reconfiguration

@marc-hb marc-hb requested a review from jsarha February 17, 2022 22:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants