Description
Describe the bug
During development of a application that uses Zephyr I observe a random crash which seems to be a problem with race conditions in the eswifi driver used by the disco L475 iot1 board. I managed to reproduce the crash using a small sample that I made. The sample runs two loops both of which runs sntp_simple
function to create a socket, send/receive some data, and then close it. The logs of the crash:
[00:00:33.841,000] os: ***** USAGE FAULT *****
[00:00:33.841,000] os: Illegal use of the EPSR
[00:00:33.841,000] os: r0/a1: 0x200024b0 r1/a2: 0xe000ed00 r2/a3: 0x200024b0
[00:00:33.841,000] os: r3/a4: 0x00000000 r12/ip: 0x00000000 r14/lr: 0x080020ed
[00:00:33.841,000] os: xpsr: 0x60000000
[00:00:33.841,000] os: s[ 0]: 0x00000000 s[ 1]: 0x00000000 s[ 2]: 0x00000000 s[ 3]: 0x000000
[00:00:33.841,000] os: s[ 4]: 0x00000000 s[ 5]: 0x00000000 s[ 6]: 0x00000000 s[ 7]: 0x000000
[00:00:33.841,000] os: s[ 8]: 0x00000000 s[ 9]: 0x00000000 s[10]: 0x00000000 s[11]: 0x000000
[00:00:33.841,000] os: s[12]: 0x00000000 s[13]: 0x00000000 s[14]: 0x00000000 s[15]: 0x000000
[00:00:33.841,000] os: fpscr: 0x00000000
[00:00:33.841,000] os: Faulting instruction address (r15/pc): 0x00000000
[00:00:33.841,000] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:33.841,000] os: Current thread: 0x20001c68 (unknown)
[00:00:34.008,000] os: Halting system
It seems that during closing of the socket by eswifi_socket_close()
(https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_socket_offload.c#L401) from one thread clears the eswifi_off_socket
structure (https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_offload.h#L29) holding function pointers used by the other thread. This can corrupt the data in the k_work
structure in the work_queue_main()
function (https://github.com/zephyrproject-rtos/zephyr/blob/main/kernel/work.c#L633). The handler
in this case can be set to NULL
which causes the crash in this line https://github.com/zephyrproject-rtos/zephyr/blob/main/kernel/work.c#L668 (this is where the lr/r14 register points to).
I gather some logs from the debugger that shows that this is indeed the case:
Breakpoint 2, work_queue_main (workq_ptr=0x20001c68 <eswifi0+32>, p2=,
p3=) at /home/mzajac/Repo/Internal/zephyr-root/zephyr/kernel/work.c:668
668 handler(work);
(gdb) p handler
$1 = (k_work_handler_t) 0x0
(gdb) p work
$2 = (struct k_work *) 0x200024b0 <eswifi0+2152>
(gdb) p *work
$3 = {node = {next = 0x0}, handler = 0x0, queue = 0x20001c68 <eswifi0+32>, flags = 1}
(gdb) bt
#0 work_queue_main (workq_ptr=0x20001c68 <eswifi0+32>, p2=, p3=)
at /home/mzajac/Repo/Internal/zephyr-root/zephyr/kernel/work.c:668
#1 0x08016c5e in z_thread_entry (entry=0x8002069 <work_queue_main>, p1=,
p2=, p3=)
at /home/mzajac/Repo/Internal/zephyr-root/zephyr/lib/os/thread_entry.c:36
#2 0xaaaaaaaa in ?? ()
The handler
is set to NULL
but the work item 0x200024b0
was in the work queue. This also explains why the pc/r15 register is set to 0x0.
I also checked that removing the memset
from https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_offload.c#L400 and https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_socket_offload.c#L430 will prevent the crash from showing up but will end up in a situation that no more sockets will be created (probably the socket->context
still needs to be set to NULL
in both cases).
Target platform: disco L475 iot1 board
To Reproduce
Using the attached project.
crash.zip
- Update the west configuration to point to the west.yml included in the project.
- Set the Kconfig variable CONFIG_WIFI_SSID and CONFIG_WIFI_PASS
- build the project for disco L475 iot1 board (eg.
west build -b disco_l745_iot1
) - flash the L475 board
- Wait some time to see error (from the I made tests I made it seems depending on the Wi-Fi router the crash can happen almost instantly or you need to wait a little to make it show up).
Impact
Randomly crashes the application during startup.
Environment (please complete the following information):
Probably irrelevant but:
- Linux
- Toolchain (e.g Zephyr: 3.1.0, SDK: 0.14.2)
and the important part: - disco L475 iot1 board
Additional context
This seems the as the same problem that was reported in #24283 but the solution proposed there isn't really a fix of the problem.