Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Thread analyzer adsp fix #71408

Merged

Conversation

jsarha
Copy link

@jsarha jsarha commented Apr 11, 2024

Due to Intel ADSP's cache architecture the Zephyr thread analyzer does not work out of the box on SOF platform. The core specific caches are not synchronized between the cores and accessing thread date from other core usually causes a crash. This PR adds on option for thread analyzer thread to only "analyze" threads on the same core its running on, and starts thread for each core.

@jsarha
Copy link
Author

jsarha commented Apr 11, 2024

@ujfalusi , @kv2019i , @lyakh and @lgirdwood , could you give this a brief SOF internal review, before I mark this ready for review.

@kv2019i
Copy link
Collaborator

kv2019i commented Apr 12, 2024

I think the code look good for wider review. I think the rationale could be expanded a bit. Which piece of data was problematic without this patch? Zephyr allocates most of the its data uncached on xtensa as well, so a bit surprising the caches are causing a problem here on Zephyr side. E.g. global static data is uncached by default and should be safe to use across cores (evein on these xtensa platforms). It would help to pinpoint the exact data object (and where it is accessed) that is causing problems, and explain why these additions are needed.

@jsarha jsarha force-pushed the thread-analyzer-adsp-fix branch from ecd5d4e to 8958b2c Compare April 12, 2024 10:14
@jsarha
Copy link
Author

jsarha commented Apr 12, 2024

I think the code look good for wider review. I think the rationale could be expanded a bit. Which piece of data was problematic without this patch? Zephyr allocates most of the its data uncached on xtensa as well, so a bit surprising the caches are causing a problem here on Zephyr side. E.g. global static data is uncached by default and should be safe to use across cores (evein on these xtensa platforms). It would help to pinpoint the exact data object (and where it is accessed) that is causing problems, and explain why these additions are needed.

The kernel thread list can be looped through, but it appeared that calling k_thread_runtime_stats_all_get() hung the DSP (I never got any crash, the DSP just became dead), also asking isr stack from neighboring core hung the DSP.

There was a build error for non SMP build, I'll try to fix all the automatic testing issues and the mark this ready for review.

@jsarha jsarha force-pushed the thread-analyzer-adsp-fix branch 2 times, most recently from d1abd85 to adedca7 Compare April 15, 2024 07:33
@jsarha jsarha marked this pull request as ready for review April 15, 2024 19:49
Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kv2019i Surely the problem here is that on SMP builds with adsp, the thread stacks which this utility needs to inspect are in cached/CPU-local memory. So the data it sees will be an essentially arbitrary mix of stale, previously-flushed cache blocks and not the sequentially consistent data it expects. But agree that the comments should be clearer on what this is doing.

Nitpick-wise: Might be worth tying this directly to CONFIG_KERNEL_COHERENCE, which is the actual feature that enables the cache-incoherent thread stacks. If you don't have that =y there's no need for the per-CPU thread analyzer.

Likewise this is using CONFIG_SCHED_CPU_MASK APIs, which is an optional feature that should be listed in a "depends on" in kconfig.

But on the whole this is a reasonable and straightforward workaround for the problem.


static K_THREAD_STACK_ARRAY_DEFINE(analyzer_thread_stacks, CONFIG_MP_MAX_NUM_CPUS,
CONFIG_THREAD_ANALYZER_AUTO_STACK_SIZE);
static struct k_thread analyzer_thread[CONFIG_MP_MAX_NUM_CPUS];
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't know that you need N separate threads, it would be enough to have just one that sequentially re-pins itself to each core in sequence, via e.g. a k_timer or something (you can't pin a running thread, so it can't move itself directly).

But then I guess that requires knowing which CPUs are actually unsuspended, which would be difficult.

Copy link
Author

@jsarha jsarha Apr 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should in theory be possible to survive with two threads. One all the time at core 0, and the other being pinned by the core 0 thread on different cores, synchronized with semaphores, and having some timeout mechanism to skip suspended cores... but then again I don't think the save is worth all that complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, the savings is very significant. A thread stack is the largest kernel data structure you're going to allocate. The DSPs are big but ~8k of stack memory on MTL is still a big waste. And again, I don't see that this is so complicated? Just pend the thread for each change of CPU and bounce to a k_timer to do the migration. You'll pay a cost of 0-1 ticks (MTL is still 12kHz ticks I think?), which seems reasonable for a debug feature.

@jsarha
Copy link
Author

jsarha commented Apr 16, 2024

@kv2019i Surely the problem here is that on SMP builds with adsp, the thread stacks which this utility needs to inspect are in cached/CPU-local memory. So the data it sees will be an essentially arbitrary mix of stale, previously-flushed cache blocks and not the sequentially consistent data it expects. But agree that the comments should be clearer on what this is doing.

Nitpick-wise: Might be worth tying this directly to CONFIG_KERNEL_COHERENCE, which is the actual feature that enables the cache-incoherent thread stacks. If you don't have that =y there's no need for the per-CPU thread analyzer.

I tried setting CONFIG_KERNEL_COHERENCE=n and taking away my changes, but the FW still crashed with the change. So there is something else to it than just that.

Likewise this is using CONFIG_SCHED_CPU_MASK APIs, which is an optional feature that should be listed in a "depends on" in kconfig.

Thanks, I'll add the dependency.

But on the whole this is a reasonable and straightforward workaround for the problem.

@jsarha jsarha force-pushed the thread-analyzer-adsp-fix branch 2 times, most recently from d271f5d to ca6a826 Compare April 19, 2024 15:01
@jsarha
Copy link
Author

jsarha commented Apr 26, 2024

@nashif , Liam suggested me to ping you about this PR.

@andyross
Copy link
Contributor

andyross commented May 3, 2024

I tried setting CONFIG_KERNEL_COHERENCE=n and taking away my changes, but the FW still crashed with the change.

No, I meant predicate this feature so that it gets enabled when COHERENCE=y. You can't meaningfully run SMP kernels on intel_adsp without COHERENCE (well, you can, but it maps all memory as uncached -- this used to work but might have bitrotten).

@jsarha jsarha force-pushed the thread-analyzer-adsp-fix branch 2 times, most recently from 8d5390d to 8c570a5 Compare May 6, 2024 17:54
@jsarha
Copy link
Author

jsarha commented May 10, 2024

Then why don't you simply pass each pinned thread its CPU number as an extra
argument? Problem solved, no deceptive syscall needed.

Sure, I can do that. Then, instead of adding k_thread_foreach_current_cpu() I would need to implement something like k_thread_foreach_filter_by_cpu(k_thread_user_cb_t user_cb, void *user_data, int cpu). Would you like that better?

Oh, and I would need k_thread_runtime_stats_get_by_cpu() or something too.

Applied the change suggested here.

@jsarha jsarha force-pushed the thread-analyzer-adsp-fix branch from 4d69f51 to bbbc478 Compare May 15, 2024 16:19
@jsarha
Copy link
Author

jsarha commented May 15, 2024

Fix CI errors by adding couple of ARG_UNUSED().

@jsarha
Copy link
Author

jsarha commented May 21, 2024

There, in order to get benefit from uncoherent per cpu cache, all
processes are pinned to single cpu only. That also the reason why thread
analyzer thread can not go around peeking the thread statistics from a
neighboring cpu, but each cpu should have their own thread analyzer thread.

Then why don't you simply pass each pinned thread its CPU number as an extra argument? Problem solved, no deceptive syscall needed.

@npitre this is now done. Is there something more I should do here to get this merged?

@jsarha jsarha force-pushed the thread-analyzer-adsp-fix branch from bbbc478 to 273f9cd Compare May 22, 2024 19:50
@jsarha
Copy link
Author

jsarha commented May 22, 2024

Oops, k_thread_foreach_filter_by_cpu() implementation had wrong name. Apparently I only tested the unlocked version. Now fixed.

andyross
andyross previously approved these changes May 22, 2024
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jsarha Latter two patches ok, but one clarifying question on first patch. There seems to be a semantics difference between locked and unlocked versions. Am I missing something?

kernel/thread_monitor.c Outdated Show resolved Hide resolved
kernel/thread_monitor.c Show resolved Hide resolved
subsys/debug/thread_analyzer.c Outdated Show resolved Hide resolved
Jyri Sarha added 2 commits May 27, 2024 22:24
…sion

Add functions k_thread_foreach_unlocked_filter_by_cpu() and
k_thread_foreach_filter_by_cpu() to loop through the threads on the
specified cpu only.

Signed-off-by: Jyri Sarha <jyri.sarha@linux.intel.com>
Add k_thread_runtime_stats_cpu_get() to get runtime statistics of
the specified core.

Signed-off-by: Jyri Sarha <jyri.sarha@linux.intel.com>
…tely

Add implementation to analyze threads on each cpu separately. This
feature can be enabled with THREAD_ANALYZER_AUTO_SEPARATE_CORES Kconfig
option. If enabled, an analyzer thread is started for each cpu, and
the threads will only analyze thread on the cpu its running on.

This feature is needed for Intel ADSP platform, where cpu specific
caches are not synchronized between the cpu. It is also probably
needed by other platforms having CONFIG_KERNEL_COHERENCE=y, so default
to THREAD_ANALYZER_AUTO_SEPARATE_CORES=y for those platform.

Signed-off-by: Jyri Sarha <jyri.sarha@linux.intel.com>
@jsarha jsarha force-pushed the thread-analyzer-adsp-fix branch from 5bce654 to 17e7a84 Compare May 27, 2024 19:27
Copy link
Collaborator

@kv2019i kv2019i left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, looks good now!

@jsarha
Copy link
Author

jsarha commented May 31, 2024

@andyross , there was one-line bug-fix, but otherwise this should be the same as the version you approved.

@jsarha
Copy link
Author

jsarha commented Jul 8, 2024

Argh, forgot about that this is still lying around. @andyross , a gentle ping!

@lgirdwood I'am not sure how to get this moving.

@jsarha
Copy link
Author

jsarha commented Jul 8, 2024

@peter-mitsis , could you spare time to review this?

@peter-mitsis
Copy link
Collaborator

@jsarha - Taking a look at this again.

Copy link
Contributor

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still looks good here

@nashif nashif added this to the v4.0.0 milestone Jul 12, 2024
@aescolar
Copy link
Member

@jsarha @andyross @nashif is this meant to be backported to 3.7?

@jsarha
Copy link
Author

jsarha commented Jul 29, 2024

@jsarha @andyross @nashif is this meant to be backported to 3.7?

@aescolar No, we should not have any need for back-porting, but I would very much like to see this merged for the next release.

@andyross
Copy link
Contributor

@jsarha @andyross @nashif is this meant to be backported to 3.7?

@kv2019i @lyakh might have input too. But broadly SOF tracks Zephy upstream closely. At least so far there hasn't been an attempt to tie a particular SOF release to a Zephyr LTS. And honestly even if there were, this is a fix to a debug feature that isn't used in production audio firmware, so seems like it would be low priority.

(Just FWIW: I'm fairly conservative about bugfix backports. Security bugs, 100%. Bugs reported by actual shipping apps, absolutely. But just general fixes? There are too many and IMHO the churn caused by the fixes themselves are a higher risk than the bugs they fix in most cases.)

@fabiobaltieri fabiobaltieri merged commit 1b6e0f6 into zephyrproject-rtos:main Jul 30, 2024
27 checks passed
Copy link

Hi @jsarha!
Congratulations on getting your very first Zephyr pull request merged 🎉🥳. This is a fantastic achievement, and we're thrilled to have you as part of our community!

To celebrate this milestone and showcase your contribution, we'd love to award you the Zephyr Technical Contributor badge. If you're interested, please claim your badge by filling out this form: Claim Your Zephyr Badge.

Thank you for your valuable input, and we look forward to seeing more of your contributions in the future! 🪁

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants