Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sem: change sem wait to atomic operation #14465

Merged
merged 1 commit into from
Oct 29, 2024

Conversation

zyfeier
Copy link
Contributor

@zyfeier zyfeier commented Oct 23, 2024

Summary

Add sem_wait fast operations, use atomic to ensure atomicity of semcount operations, and do not depend on critical section.

Test with robot board:
before modify:
nxmutex_lock cost: 78 ns
nxmutex_unlock cost: 82 ns

after modify:
nxmutex_lock cost: 28 ns
nxmutex_unlock cost: 14 ns

Impact

semaphore

Testing

bes board with monkey test pass
sabre-6quad:smp with ostest pass

Add sem_wait fast operations, use atomic to ensure
atomicity of semcount operations, and do not depend
on critical section.

Test with robot:
before modify:
nxmutex_lock cost: 78 ns
nxmutex_unlock cost: 82 ns

after modify:
nxmutex_lock cost: 28 ns
nxmutex_unlock cost: 14 ns

Signed-off-by: zhangyuan29 <zhangyuan29@xiaomi.com>
@github-actions github-actions bot added Area: OS Components OS Components issues Size: M The size of the change in this PR is medium labels Oct 23, 2024
@nuttxpr
Copy link

nuttxpr commented Oct 23, 2024

[Experimental Bot, please feedback here]

The PR summary is somewhat informative, but the rest of the information is insufficient to determine if it meets NuttX requirements. Here's a breakdown:

What's Good:

  • Summary: Clearly states the technical changes made. The performance improvement mentioned is a good addition.
  • Testing: Mentions specific boards used for testing.

What's Missing:

  • Summary:
    • Why? No explanation of the problem the PR solves. Why were the original semaphore operations slow? Why is atomicity important in this context?
    • Issue reference: Is there a related NuttX issue?
  • Impact:
    • User Impact: While the summary mentions semaphore changes, it doesn't explain if users need to adapt their code. Will existing semaphore usage still work correctly?
    • Other Impacts: The PR lacks details about potential impacts on build systems, hardware, documentation, security, and compatibility.
  • Testing:
    • Incomplete Information: The PR needs to specify the host operating system, CPU architecture, and compiler used for testing.
    • Vague Descriptions: "bes board with monkey test pass" and "sabre-6quad:smp with ostest pass" are too vague. Provide specific details about the tests performed and what "pass" means in each context.
    • Missing Logs: The PR should include actual testing logs from before and after the changes, as the template requests.

Conclusion:

The PR needs significant improvements to meet NuttX requirements. It needs to be more descriptive, particularly in the "Impact" section and provide specific evidence of testing.

Copy link
Contributor

@anchao anchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just some minor comments

if (sem->flags & SEM_TYPE_MUTEX)
{
short old = 0;
if (atomic_compare_exchange_weak_explicit(NXSEM_COUNT(sem), &old, 1,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (sem->flags & SEM_TYPE_MUTEX)
{
short old = 1;
if (atomic_compare_exchange_weak_explicit(NXSEM_COUNT(sem), &old, 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

*/

#if !defined(CONFIG_PRIORITY_INHERITANCE) && !defined(CONFIG_PRIORITY_PROTECT)
if (sem->flags & SEM_TYPE_MUTEX)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not move SEM_TYPE_MUTEX related logic inside mutex?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the atomic operation for count is implemented in the lib mutex, when mutex_lock is called, it may exit early due to fast_lock, which can result in the sem->count value not being updated. This, in turn, can cause mutex_unlock to fail.

{
sem->semcount = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR should be split to two, one is replace semaphore count type to atomic, another one is fast mutex

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If fast mutexes are implemented the implementation should be in user space (i.e. libc) for it to benefit memory protected builds as well. There taking the lock (in our case, semaphore) is a very expensive operation via syscall.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this patch only optimizes the mutex performance of the kernel, while the performance optimization of the userspace will be implemented in subsequent patches.

* Pre-processor Definitions
****************************************************************************/

#define NXSEM_COUNT(s) ((FAR atomic_short *)&(s)->semcount)
Copy link
Contributor

@pussuw pussuw Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should semcount be atomic_short to begin with ? What happens if the compiler / architecture cannot handle datatypes that are not of natural width atomically without locking ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In c code, the type of atomic_short is always short.

Copy link
Contributor

@pussuw pussuw Oct 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes but my point is, is short guaranteed to be atomic without locking ? atomic_short can be forwarded to stdatomic which can use locks to implement the read-modify-write. On some architectures only the natural datawidth can be handled atomically by the hardware.

So should we use u64 as semcount for 64-bit architectures, u32 for 32-bit architectures etc?

* else try to get it in slow mode.
*/

#if !defined(CONFIG_PRIORITY_INHERITANCE) && !defined(CONFIG_PRIORITY_PROTECT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the fast path is usable only when priority inheritance is not used? For mutexes you could actually set the holder atomically without locking (by setting PID) although for semaphores this is not the case (semaphores can have several holders).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fast path cannot used when priority inheritance enabled, because holder function need enter critical section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For semaphores this is true, as semaphores can have multiple holders, but for mutexes you can set PID atomically without locking / critical section, as mutexes will only have 1 holder.

@xiaoxiang781216
Copy link
Contributor

@pussuw should we merge this patch and optimize priority inheritance and semaphore later?

@pussuw
Copy link
Contributor

pussuw commented Oct 29, 2024

@pussuw should we merge this patch and optimize priority inheritance and semaphore later?

Fine by me

@xiaoxiang781216 xiaoxiang781216 merged commit befe298 into apache:master Oct 29, 2024
26 checks passed
@tmedicci
Copy link
Contributor

Hi @zyfeier , this PR broke esp32-devkitc:sotest . The device can't boot anymore:

Steps to reproduce

make -j distclean && ./tools/configure.sh esp32-devkitc:sotest &&  make -j bootloader &&  make flash EXTRAFLAGS="-Wno-cpp -Werror" ESPTOOL_PORT=/dev/ttyUSB0 ESPTOOL_BINDIR=./ -s -j$(nproc) && minicom -D /dev/ttyUSB0

Output:

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3ffb20a0,len:3604
load:0x40080000,len:24576
entry 0x400827ac
*** Booting NuttX ***
I (27) boot: chip revision: v3.0
I (28) boot.esp32: SPI Speed      : 40MHz
I (28) boot.esp32: SPI Mode       : DIO
I (29) boot.esp32: SPI Flash Size : 8MB
I (33) boot: Enabling RNG early entropy source...
dram: lma 0x00001020 vma 0x3ffb20a0 len 0xe14    (3604)
iram: lma 0x00001e3c vma 0x40080000 len 0x6000   (24576)
padd: lma 0x00007e48 vma 0x00000000 len 0x81b0   (33200)
imap: lma 0x00010000 vma 0x400e0000 len 0x17804  (96260)
padd: lma 0x0002780c vma 0x00000000 len 0x880c   (34828)
dmap: lma 0x00030020 vma 0x3f400020 len 0x7f20   (32544)
total segments stored 6
A__esp32_start: ESP32 chip revision is v3.0
Bxtensa_user_panic: User Exception: EXCCAUSE=0003 task: Idle_Task
dump_assert_info: Current Version: NuttX  10.4.0 befe29801f Oct 31 2024 17:56:57 xtensa
dump_assert_info: Assertion failed user panic: at file: common/xtensa_assert.c:180 task: Idle_Task process: Kernel 0x400e15d8
up_dump_register:    PC: 400f0a59    PS: 00060d30
up_dump_register:    A0: 800e392d    A1: 3ffb0af0    A2: 40086000    A3: 00040000
up_dump_register:    A4: 00040000    A5: 40086000    A6: 00040001    A7: 00040001
up_dump_register:    A8: 00000000    A9: ffff0000   A10: 00000001   A11: 00000000
up_dump_register:   A12: 3ffb0cc4   A13: 00000000   A14: 00000000   A15: 00000001
up_dump_register:   SAR: 00000020 CAUSE: 00000003 VADDR: 40086000
up_dump_register:  LBEG: 400e4f50  LEND: 400e4f5d  LCNT: 00000000
dump_stacks: ERROR: Stack pointer is not within the stack
dump_stackinfo: User Stack:
dump_stackinfo:   base: 0
dump_stackinfo:   size: 00000000
stack_dump: 0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
dump_tasks:    PID GROUP PRI POLICY   TYPE    NPX STATE   EVENT      SIGMASK          STACKBASE  STACKSIZE   COMMAND

The crash's stack call is not useful. Can you please take a look?

@zyfeier
Copy link
Contributor Author

zyfeier commented Nov 1, 2024

Hi @zyfeier , this PR broke esp32-devkitc:sotest . The device can't boot anymore:

Steps to reproduce

make -j distclean && ./tools/configure.sh esp32-devkitc:sotest &&  make -j bootloader &&  make flash EXTRAFLAGS="-Wno-cpp -Werror" ESPTOOL_PORT=/dev/ttyUSB0 ESPTOOL_BINDIR=./ -s -j$(nproc) && minicom -D /dev/ttyUSB0

Output:

rst:0x1 (POWERON_RESET),boot:0x13 (SPI_FAST_FLASH_BOOT)
configsip: 0, SPIWP:0xee
clk_drv:0x00,q_drv:0x00,d_drv:0x00,cs0_drv:0x00,hd_drv:0x00,wp_drv:0x00
mode:DIO, clock div:2
load:0x3ffb20a0,len:3604
load:0x40080000,len:24576
entry 0x400827ac
*** Booting NuttX ***
I (27) boot: chip revision: v3.0
I (28) boot.esp32: SPI Speed      : 40MHz
I (28) boot.esp32: SPI Mode       : DIO
I (29) boot.esp32: SPI Flash Size : 8MB
I (33) boot: Enabling RNG early entropy source...
dram: lma 0x00001020 vma 0x3ffb20a0 len 0xe14    (3604)
iram: lma 0x00001e3c vma 0x40080000 len 0x6000   (24576)
padd: lma 0x00007e48 vma 0x00000000 len 0x81b0   (33200)
imap: lma 0x00010000 vma 0x400e0000 len 0x17804  (96260)
padd: lma 0x0002780c vma 0x00000000 len 0x880c   (34828)
dmap: lma 0x00030020 vma 0x3f400020 len 0x7f20   (32544)
total segments stored 6
A__esp32_start: ESP32 chip revision is v3.0
Bxtensa_user_panic: User Exception: EXCCAUSE=0003 task: Idle_Task
dump_assert_info: Current Version: NuttX  10.4.0 befe29801f Oct 31 2024 17:56:57 xtensa
dump_assert_info: Assertion failed user panic: at file: common/xtensa_assert.c:180 task: Idle_Task process: Kernel 0x400e15d8
up_dump_register:    PC: 400f0a59    PS: 00060d30
up_dump_register:    A0: 800e392d    A1: 3ffb0af0    A2: 40086000    A3: 00040000
up_dump_register:    A4: 00040000    A5: 40086000    A6: 00040001    A7: 00040001
up_dump_register:    A8: 00000000    A9: ffff0000   A10: 00000001   A11: 00000000
up_dump_register:   A12: 3ffb0cc4   A13: 00000000   A14: 00000000   A15: 00000001
up_dump_register:   SAR: 00000020 CAUSE: 00000003 VADDR: 40086000
up_dump_register:  LBEG: 400e4f50  LEND: 400e4f5d  LCNT: 00000000
dump_stacks: ERROR: Stack pointer is not within the stack
dump_stackinfo: User Stack:
dump_stackinfo:   base: 0
dump_stackinfo:   size: 00000000
stack_dump: 0: 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000
dump_tasks:    PID GROUP PRI POLICY   TYPE    NPX STATE   EVENT      SIGMASK          STACKBASE  STACKSIZE   COMMAND

The crash's stack call is not useful. Can you please take a look?

@tmedicci we don't have an ESP32 board. can we replicate it using QEMU?

@zyfeier
Copy link
Contributor Author

zyfeier commented Nov 1, 2024

@tmedicci We cannot reproduce this issue with other boards and qemu which use xtensa arch, and we also do not have an ESP32 board. Could you provide the ELF file, or could you help debug using J-Link to gather more information? Thanks.

@tmedicci
Copy link
Contributor

tmedicci commented Nov 1, 2024

@tmedicci We cannot reproduce this issue with other boards and qemu which use xtensa arch, and we also do not have an ESP32 board. Could you provide the ELF file, or could you help debug using J-Link to gather more information? Thanks.

Hi! Of course, check the backtrace using the J-Link:

#0  nxsem_wait (sem=0x40086140) at semaphore/sem_wait.c:263
#1  0x400e3e0a in nxmutex_lock (mutex=0x40086140) at misc/lib_mutex.c:253
#2  0x400e54b4 in mm_lock (heap=0x40086140) at mm_heap/mm_lock.c:97
#3  0x400e5330 in mm_addregion (heap=0x40086140, heapstart=0x400862b8, heapsize=105800)
    at mm_heap/mm_initialize.c:112
#4  0x400e5471 in mm_initialize (name=<optimized out>, heapstart=0x400862b8, heapsize=105800)
    at mm_heap/mm_initialize.c:287
#5  0x400e64b9 in esp32_iramheap_initialize () at chip/esp32_iramheap.c:62
#6  0x400e64a1 in up_extraheaps_init () at chip/esp32_extraheaps.c:66
#7  0x400e1b6a in nx_start () at init/nx_start.c:603
#8  0x4008291a in __esp32_start () at chip/esp32_start.c:293
#9  __start () at chip/esp32_start.c:358

It fails to run atomic_compare_exchange_weak_explicit when adding the IRAM heap. Pay attention that this heap region is accessible through the instruction bus and any non-word-aligned access would trigger an exception. If the exception triggers, it's treated here and the execution is supposed to return from the moment it was firstly triggered. This function, somehow, makes this mechanism break.

@zyfeier
Copy link
Contributor Author

zyfeier commented Nov 4, 2024

@tmedicci Could you please help test if this patch #14625 can fix the issue? Thanks.

yamt added a commit to yamt/incubator-nuttx that referenced this pull request Nov 13, 2024
Regressions caused by signedness issues in
"sem: change sem wait to atomic operation".
(apache#14465)

An alternative would be to make these atomic macros propagate
signedness using the typeof() GCC/clang extension. I'm not inclined
to do so because typeof is not so portable though. As we can unlikely
require "real" C11 atomics in the foreseeable future, maybe we should
use a different set of names from C11 to avoid confusions.
/* The following operations must be performed with interrupts
* disabled because sem_post() may be called from an interrupt
* handler.
*/

flags = enter_critical_section();

sem_count = sem->semcount;
sem_count = atomic_fetch_add(NXSEM_COUNT(sem), 1);

/* Check the maximum allowable value */

if (sem_count >= SEM_VALUE_MAX)
{
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the overflown value is already visible to other threads at this point, isn't it?
isn't it a problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i guess it's safer to use compare xchg.

@chirping78
Copy link

Using JTAG to single step through the nxsem_wait function, it shows that it's not an alignment issue, but the s32c1i instruction caused the exception.

   0x400f0468 <+196>:   or      a14, a13, a9
   0x400f046b <+199>:   or      a8, a9, a9
   0x400f046e <+202>:   wsr.scompare1   a14
=> 0x400f0471 <+205>:   s32c1i  a8, a12, 0
   0x400f0474 <+208>:   beq     a8, a14, 0x400f0480 <nxsem_wait+220>
   0x400f0477 <+211>:   or      a14, a9, a9
   0x400f047a <+214>:   and     a9, a8, a11
   0x400f047d <+217>:   bne     a14, a9, 0x400f0468 <nxsem_wait+196>

At this point, the register values are as expection:

(gdb) p/x $a14
$3 = 0x40001
(gdb) p/x $a12
$4 = 0x40085fd0
(gdb) p/x $a8
$5 = 0x40000

The sem value is 1, and here the code wants to write 0 to it.

But the problem might be that the iram is not compatiable with s32c1i instruction.
@tmedicci you may need to check this with the IC designer.

It this is the case, one solution might be:

  • in esp32_iramheap_initialize, not directly use mm_initialize, since mm_initialize will put the memory manager struct to the memory header, i.e. a sem will be in iram;
  • but allocate the memory manager struct from another data heap, such as Umem, then use that memory manager struct to manage iram heap.

@chirping78
Copy link

But the problem might be that the iram is not compatiable with s32c1i instruction. @tmedicci you may need to check this with the IC designer.

@tmedicci Found this statement in "Xtensa ® LX7 MicroprocessorData Book"

S32C1I instructions may target cached, cache-bypass, and data RAM memory locations. 
S32C1I instructions are not permitted to access memory addresses in data ROM,
instruction memory or the address region allocated to the XLMI port. Attempts to direct
the S32C1I at these addresses will cause an exception.

@yamt
Copy link
Contributor

yamt commented Nov 15, 2024

But the problem might be that the iram is not compatiable with s32c1i instruction. @tmedicci you may need to check this with the IC designer.

@tmedicci Found this statement in "Xtensa ® LX7 MicroprocessorData Book"

S32C1I instructions may target cached, cache-bypass, and data RAM memory locations. 
S32C1I instructions are not permitted to access memory addresses in data ROM,
instruction memory or the address region allocated to the XLMI port. Attempts to direct
the S32C1I at these addresses will cause an exception.

depending on how the instruction is actually used, it might or might not easy to
implement enough emulation in the trap handler.
it might be difficult if the same memory can be modified with non-trapping instruction (s32i) as well.
maybe someone needs to make a feasibility study by reading the disassembly of the relevant code.

masayuki2009 added a commit to masayuki2009/incubator-nuttx that referenced this pull request Nov 19, 2024
Summary:
- In apache#14465,
  atomic_compare_exchange_weak_explicit() was newly introduced
  in semaphore. Howerver, cxd56xx has an issue with the API
  if SMP is enabled (see up_testset2 in cxd56_testset.c).
- This commit fixes the issue by using LIBC_ARCH_ATOMIC.

Impact:
- Only cxd56xx SoCs in SMP mode.

Testing:
- Tested with spresense:smp, spresense:wifi_smp
- NOTE: If DEBUG_ASSERTIONS is enabled assert would be happend.
  I think this might be another issue.

Signed-off-by: Masayuki Ishikawa <Masayuki.Ishikawa@jp.sony.com>
masayuki2009 added a commit to masayuki2009/incubator-nuttx that referenced this pull request Nov 19, 2024
Summary:
- In apache#14465,
  atomic_compare_exchange_weak_explicit() was newly introduced
  in semaphore. However, cxd56xx has an issue with the API
  if SMP is enabled (see up_testset2 in cxd56_testset.c).
- This commit fixes the issue by using LIBC_ARCH_ATOMIC.

Impact:
- Only cxd56xx SoCs in SMP mode.

Testing:
- Tested with spresense:smp, spresense:wifi_smp
- NOTE: If DEBUG_ASSERTIONS is enabled assert would be happend.
  I think this might be another issue.

Signed-off-by: Masayuki Ishikawa <Masayuki.Ishikawa@jp.sony.com>
masayuki2009 added a commit to masayuki2009/incubator-nuttx that referenced this pull request Nov 19, 2024
Summary:
- In apache#14465,
  atomic_compare_exchange_weak_explicit() was newly introduced
  in semaphore. However, cxd56xx has an issue with the API
  if SMP is enabled (see up_testset2 in cxd56_testset.c).
- This commit fixes the issue by using LIBC_ARCH_ATOMIC.

Impact:
- Only cxd56xx SoCs in SMP mode.

Testing:
- Tested with spresense:smp, spresense:wifi_smp
- NOTE: If DEBUG_ASSERTIONS is enabled assert would be happend.
  I think this might be another issue.

Signed-off-by: Masayuki Ishikawa <Masayuki.Ishikawa@jp.sony.com>
xiaoxiang781216 pushed a commit that referenced this pull request Nov 19, 2024
Summary:
- In #14465,
  atomic_compare_exchange_weak_explicit() was newly introduced
  in semaphore. However, cxd56xx has an issue with the API
  if SMP is enabled (see up_testset2 in cxd56_testset.c).
- This commit fixes the issue by using LIBC_ARCH_ATOMIC.

Impact:
- Only cxd56xx SoCs in SMP mode.

Testing:
- Tested with spresense:smp, spresense:wifi_smp
- NOTE: If DEBUG_ASSERTIONS is enabled assert would be happend.
  I think this might be another issue.

Signed-off-by: Masayuki Ishikawa <Masayuki.Ishikawa@jp.sony.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: OS Components OS Components issues Size: M The size of the change in this PR is medium
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants