Random crash on the L475 due to work->handler set to NULL #49963
Labels
area: Wi-Fi
Wi-Fi
bug
The issue is a bug, or the PR is fixing a bug
platform: STM32
ST Micro STM32
priority: low
Low impact/importance bug
Describe the bug
During development of a application that uses Zephyr I observe a random crash which seems to be a problem with race conditions in the eswifi driver used by the disco L475 iot1 board. I managed to reproduce the crash using a small sample that I made. The sample runs two loops both of which runs
sntp_simple
function to create a socket, send/receive some data, and then close it. The logs of the crash:[00:00:33.841,000] os: ***** USAGE FAULT *****
[00:00:33.841,000] os: Illegal use of the EPSR
[00:00:33.841,000] os: r0/a1: 0x200024b0 r1/a2: 0xe000ed00 r2/a3: 0x200024b0
[00:00:33.841,000] os: r3/a4: 0x00000000 r12/ip: 0x00000000 r14/lr: 0x080020ed
[00:00:33.841,000] os: xpsr: 0x60000000
[00:00:33.841,000] os: s[ 0]: 0x00000000 s[ 1]: 0x00000000 s[ 2]: 0x00000000 s[ 3]: 0x000000
[00:00:33.841,000] os: s[ 4]: 0x00000000 s[ 5]: 0x00000000 s[ 6]: 0x00000000 s[ 7]: 0x000000
[00:00:33.841,000] os: s[ 8]: 0x00000000 s[ 9]: 0x00000000 s[10]: 0x00000000 s[11]: 0x000000
[00:00:33.841,000] os: s[12]: 0x00000000 s[13]: 0x00000000 s[14]: 0x00000000 s[15]: 0x000000
[00:00:33.841,000] os: fpscr: 0x00000000
[00:00:33.841,000] os: Faulting instruction address (r15/pc): 0x00000000
[00:00:33.841,000] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:33.841,000] os: Current thread: 0x20001c68 (unknown)
[00:00:34.008,000] os: Halting system
It seems that during closing of the socket by
eswifi_socket_close()
(https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_socket_offload.c#L401) from one thread clears theeswifi_off_socket
structure (https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_offload.h#L29) holding function pointers used by the other thread. This can corrupt the data in thek_work
structure in thework_queue_main()
function (https://github.com/zephyrproject-rtos/zephyr/blob/main/kernel/work.c#L633). Thehandler
in this case can be set toNULL
which causes the crash in this line https://github.com/zephyrproject-rtos/zephyr/blob/main/kernel/work.c#L668 (this is where the lr/r14 register points to).I gather some logs from the debugger that shows that this is indeed the case:
Breakpoint 2, work_queue_main (workq_ptr=0x20001c68 <eswifi0+32>, p2=,
p3=) at /home/mzajac/Repo/Internal/zephyr-root/zephyr/kernel/work.c:668
668 handler(work);
(gdb) p handler
$1 = (k_work_handler_t) 0x0
(gdb) p work
$2 = (struct k_work *) 0x200024b0 <eswifi0+2152>
(gdb) p *work
$3 = {node = {next = 0x0}, handler = 0x0, queue = 0x20001c68 <eswifi0+32>, flags = 1}
(gdb) bt
#0 work_queue_main (workq_ptr=0x20001c68 <eswifi0+32>, p2=, p3=)
at /home/mzajac/Repo/Internal/zephyr-root/zephyr/kernel/work.c:668
#1 0x08016c5e in z_thread_entry (entry=0x8002069 <work_queue_main>, p1=,
p2=, p3=)
at /home/mzajac/Repo/Internal/zephyr-root/zephyr/lib/os/thread_entry.c:36
#2 0xaaaaaaaa in ?? ()
The
handler
is set toNULL
but the work item0x200024b0
was in the work queue. This also explains why the pc/r15 register is set to 0x0.I also checked that removing the
memset
from https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_offload.c#L400 and https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_socket_offload.c#L430 will prevent the crash from showing up but will end up in a situation that no more sockets will be created (probably thesocket->context
still needs to be set toNULL
in both cases).Target platform: disco L475 iot1 board
To Reproduce
Using the attached project.
crash.zip
west build -b disco_l745_iot1
)Impact
Randomly crashes the application during startup.
Environment (please complete the following information):
Probably irrelevant but:
and the important part:
Additional context
This seems the as the same problem that was reported in #24283 but the solution proposed there isn't really a fix of the problem.
The text was updated successfully, but these errors were encountered: