Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] the semantics of enter_critical_section is unclear #14593

Open
1 task done
yamt opened this issue Nov 1, 2024 · 9 comments
Open
1 task done

[BUG] the semantics of enter_critical_section is unclear #14593

yamt opened this issue Nov 1, 2024 · 9 comments
Labels
Arch: all Issues that apply to all architectures Area: Api API Issues Area: Kernel Kernel issues OS: Mac Issues related to MacOS (building system, etc) Type: Bug Something isn't working

Comments

@yamt
Copy link
Contributor

yamt commented Nov 1, 2024

Description / Steps to reproduce the issue

https://github.com/apache/nuttx/blob/master/Documentation/implementation/critical_sections.rst

Without going into the details of the implementation of enter_critical_section() suffice it to say that it (1) disables interrupts on the local CPU and (2) uses spinlocks to assure exclusive access to a code sequence across all CPUs.

as users of enter_critical_section don't seem to have a way to limit the scope,
i guess it implies enter_critical_section works as the single global lock covering the whole system.
is it right?

also, to me it isn't clear if enter_critical_section is supposed to disable interrupts on the remote CPUs.
the comment in irq_csection.c says

Take the CPU IRQ lock and disable interrupts on all CPUs.

on the other hand, other comment in the file says

NOTE 1: Ideally this should disable interrupts on all CPUs, but most
architectures only support disabling interrupts on the local CPU.`

they seem contradicting each other.

On which OS does this issue occur?

[OS: Mac]

What is the version of your OS?

macOS 14.7

NuttX Version

master

Issue Architecture

[Arch: all]

Issue Area

[Area: Api], [Area: Kernel]

Verification

  • I have verified before submitting the report.
@yamt yamt added the Type: Bug Something isn't working label Nov 1, 2024
@github-actions github-actions bot added Arch: all Issues that apply to all architectures Area: Api API Issues Area: Kernel Kernel issues OS: Mac Issues related to MacOS (building system, etc) labels Nov 1, 2024
@acassis
Copy link
Contributor

acassis commented Nov 1, 2024

@anchao could you please help to shed some light here?

@anchao
Copy link
Contributor

anchao commented Nov 4, 2024

In the current implementation, enter_critical_section() is a global lock used to protect critical sections between multiple CPUs under SMP case. In performance evaluations, the frequent calls to enter_critical_section() under SMP have posed significant challenges for the system real-time capability. @hujun260 's work focuses on optimizing SMP performance on NuttX, and we'll see that many of his contributions are related to critical section protection and interrupt refactor.

Here are a few PRs under review, @acassis @yamt
#14626
#14623

@hujun260
Copy link
Contributor

hujun260 commented Nov 4, 2024

i guess it implies enter_critical_section works as the single global lock covering the whole system.
is it right?

It can be understood like this.

However, due to the complex of the enter_critical_section function, it can encompass any area,
especially scenarios where context switching occurs within the protected area, which may lead to the failure of protection.
This is a vulnerability of the critical section.

The current recommendation is that, if it is not related to protecting the tcb/group
good programming practices can avoid to use critical sections.

also, to me it isn't clear if enter_critical_section is supposed to disable interrupts on the remote CPUs. the comment in irq_csection.c says

Take the CPU IRQ lock and disable interrupts on all CPUs.

on the other hand, other comment in the file says

NOTE 1: Ideally this should disable interrupts on all CPUs, but most
architectures only support disabling interrupts on the local CPU.`

they seem contradicting each other.

enter_critical_section only disable local CPU interrupts and cannot disable remote CPU interrupts.

@yamt
Copy link
Contributor Author

yamt commented Nov 5, 2024

i guess it implies enter_critical_section works as the single global lock covering the whole system.
is it right?

It can be understood like this.

However, due to the complex of the enter_critical_section function, it can encompass any area, especially scenarios where context switching occurs within the protected area, which may lead to the failure of protection. This is a vulnerability of the critical section.

The current recommendation is that, if it is not related to protecting the tcb/group good programming practices can avoid to use critical sections.

what do you recommend to use instead?
mutex? semaphore? spin lock?

how do you think about introducing a bit higher-level primitives for the kernel? eg. solaris-like adaptive mutex

also, to me it isn't clear if enter_critical_section is supposed to disable interrupts on the remote CPUs. the comment in irq_csection.c says

Take the CPU IRQ lock and disable interrupts on all CPUs.

on the other hand, other comment in the file says

NOTE 1: Ideally this should disable interrupts on all CPUs, but most
architectures only support disabling interrupts on the local CPU.`

they seem contradicting each other.

enter_critical_section only disable local CPU interrupts and cannot disable remote CPU interrupts.

so, the current semantics of enter_critical_section is:

  • it disables interrupts on the calling CPU.
  • it does NOT disable interrupts on remote CPUs.
  • it acquires the system-global, recursive lock, which is automatically released/re-aquired when the current thread blocks/wakes up. (similarly to the big kernel lock in the traditional unix.)

is it right?

@hujun260
Copy link
Contributor

hujun260 commented Nov 5, 2024

what do you recommend to use instead?
mutex? semaphore? spin lock?

Based on the program logic, deciding which synchronization strategy to use,
I believe that if the protected region executes quickly without context switching,
spinlock is the preferred choice and recursive locks should not be used.

how do you think about introducing a bit higher-level primitives for the kernel? eg. solaris-like adaptive mutex

I don't think adaptive mutex is a good idea.

Using general locks strategy can easily lead to abuse and a decline in code quality.
We need more precise lock behavior, especially in RTOS, where more precise time control
and good programming practices are required.

Just like critical sections in NuttX, which are used everywhere
and significantly degrade performance in smp complex scenarios.

so, the current semantics of enter_critical_section is:

  • it disables interrupts on the calling CPU.
  • it does NOT disable interrupts on remote CPUs.
  • it acquires the system-global, recursive lock, which is automatically released/re-aquired when the current thread blocks/wakes up. (similarly to the big kernel lock in the traditional unix.)

is it right?

yes
Additionally, the old enter_critical_section impl was more complex than the current one,
as it also handling sync pause strategies. We have already made some simplifications.
Looking towards the future, we need to remove the recursive strategy and auto released/re-aquired strategy within the critical section, turning it into a simple spinlock. But this will take a relatively long time,

@pussuw
Copy link
Contributor

pussuw commented Nov 5, 2024

so, the current semantics of enter_critical_section is:

* it disables interrupts on the calling CPU.

Yes

* it does NOT disable interrupts on remote CPUs.

Yes

* it acquires the system-global, recursive lock, which is automatically released/re-aquired when the current thread blocks/wakes up. (similarly to the big kernel lock in the traditional unix.)

Yes

is it right?

The release/re-acquire logic is a bit hard to follow at least for me. I'm stuck with my SMP integration to MPFS + BUILD_KERNEL because I'm getting random crashes / DEBUGASSERT()s from irq_csection() (as well as deadlocks too). There is something odd going on with this logic.

EDIT: It is somehow tied to the fact that we have two "types" of context switches; From ISR and from thread. There is some kind of a hole in the recursive irqlock handling related to this.

The issue might be the fact that we have a task specific tcb.irqcount (which is IMO the correct way to do the recursion, as critical sections are task specific states) and a global "per-cpu" irqcount g_cpu_nestcount which is supposed to supersede the task specific recursion counter "somehow".

@hujun260
Copy link
Contributor

hujun260 commented Nov 6, 2024

is it right?

yes

The issue might be the fact that we have a task specific tcb.irqcount (which is IMO the correct way to do the recursion, as critical sections are task specific states) and a global "per-cpu" irqcount g_cpu_nestcount which is supposed to supersede the task specific recursion counter "somehow".

when an interrupt occurs, g_cpu_nestcount is used as the recursive counter because context switching may happen and this_task() will change, and tcp->irqcount cannot be used.

@yamt
Copy link
Contributor Author

yamt commented Nov 6, 2024

what do you recommend to use instead?
mutex? semaphore? spin lock?

Based on the program logic, deciding which synchronization strategy to use, I believe that if the protected region executes quickly without context switching, spinlock is the preferred choice and recursive locks should not be used.

i feel a bare spinlock is too primitive for many cases.

how do you think about introducing a bit higher-level primitives for the kernel? eg. solaris-like adaptive mutex

I don't think adaptive mutex is a good idea.

Using general locks strategy can easily lead to abuse and a decline in code quality. We need more precise lock behavior, especially in RTOS, where more precise time control and good programming practices are required.

although it doesn't need to be adaptive mutex (it was just an example)
it would be nice to have a small set of api, which can cover 90% of use cases.

Just like critical sections in NuttX, which are used everywhere and significantly degrade performance in smp complex scenarios.

well, a typical implementation of adaptive mutex would be far better than a system global critical section
even for a RTOS.

@pussuw
Copy link
Contributor

pussuw commented Nov 6, 2024

when an interrupt occurs, g_cpu_nestcount is used as the recursive counter because context switching may happen and this_task() will change, and tcp->irqcount cannot be used.

Yes this much I was able to figure out when I tried to remove g_cpu_nestcount.

There is still a hole somewhere in the acquire / release logic as I can see random debugasserts and deadlocks which happen when:

  • CPUx tries to take the big kernel lock from ISR
  • CPUy tries to take the big kernel lock from thread

And very likely a context switch is what triggers the issue. This happens more often under high load scenarios, like system bootup. In our case the system boot uses a lot of CPU, it also spawns a lot of processes that start, run and exit almost immediately.

I'll keep investigating.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arch: all Issues that apply to all architectures Area: Api API Issues Area: Kernel Kernel issues OS: Mac Issues related to MacOS (building system, etc) Type: Bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants