Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random crash on the L475 due to work->handler set to NULL #49963

Closed
mzajac-avs opened this issue Sep 6, 2022 · 3 comments
Closed

Random crash on the L475 due to work->handler set to NULL #49963

mzajac-avs opened this issue Sep 6, 2022 · 3 comments
Assignees
Labels
area: Wi-Fi Wi-Fi bug The issue is a bug, or the PR is fixing a bug platform: STM32 ST Micro STM32 priority: low Low impact/importance bug

Comments

@mzajac-avs
Copy link
Contributor

mzajac-avs commented Sep 6, 2022

Describe the bug
During development of a application that uses Zephyr I observe a random crash which seems to be a problem with race conditions in the eswifi driver used by the disco L475 iot1 board. I managed to reproduce the crash using a small sample that I made. The sample runs two loops both of which runs sntp_simple function to create a socket, send/receive some data, and then close it. The logs of the crash:

[00:00:33.841,000] os: ***** USAGE FAULT *****
[00:00:33.841,000] os: Illegal use of the EPSR
[00:00:33.841,000] os: r0/a1: 0x200024b0 r1/a2: 0xe000ed00 r2/a3: 0x200024b0
[00:00:33.841,000] os: r3/a4: 0x00000000 r12/ip: 0x00000000 r14/lr: 0x080020ed
[00:00:33.841,000] os: xpsr: 0x60000000
[00:00:33.841,000] os: s[ 0]: 0x00000000 s[ 1]: 0x00000000 s[ 2]: 0x00000000 s[ 3]: 0x000000
[00:00:33.841,000] os: s[ 4]: 0x00000000 s[ 5]: 0x00000000 s[ 6]: 0x00000000 s[ 7]: 0x000000
[00:00:33.841,000] os: s[ 8]: 0x00000000 s[ 9]: 0x00000000 s[10]: 0x00000000 s[11]: 0x000000
[00:00:33.841,000] os: s[12]: 0x00000000 s[13]: 0x00000000 s[14]: 0x00000000 s[15]: 0x000000
[00:00:33.841,000] os: fpscr: 0x00000000
[00:00:33.841,000] os: Faulting instruction address (r15/pc): 0x00000000
[00:00:33.841,000] os: >>> ZEPHYR FATAL ERROR 0: CPU exception on CPU 0
[00:00:33.841,000] os: Current thread: 0x20001c68 (unknown)
[00:00:34.008,000] os: Halting system

It seems that during closing of the socket by eswifi_socket_close() (https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_socket_offload.c#L401) from one thread clears the eswifi_off_socket structure (https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_offload.h#L29) holding function pointers used by the other thread. This can corrupt the data in the k_work structure in the work_queue_main() function (https://github.com/zephyrproject-rtos/zephyr/blob/main/kernel/work.c#L633). The handler in this case can be set to NULL which causes the crash in this line https://github.com/zephyrproject-rtos/zephyr/blob/main/kernel/work.c#L668 (this is where the lr/r14 register points to).

I gather some logs from the debugger that shows that this is indeed the case:

Breakpoint 2, work_queue_main (workq_ptr=0x20001c68 <eswifi0+32>, p2=,
p3=) at /home/mzajac/Repo/Internal/zephyr-root/zephyr/kernel/work.c:668
668 handler(work);
(gdb) p handler
$1 = (k_work_handler_t) 0x0
(gdb) p work
$2 = (struct k_work *) 0x200024b0 <eswifi0+2152>
(gdb) p *work
$3 = {node = {next = 0x0}, handler = 0x0, queue = 0x20001c68 <eswifi0+32>, flags = 1}
(gdb) bt
#0 work_queue_main (workq_ptr=0x20001c68 <eswifi0+32>, p2=, p3=)
at /home/mzajac/Repo/Internal/zephyr-root/zephyr/kernel/work.c:668
#1 0x08016c5e in z_thread_entry (entry=0x8002069 <work_queue_main>, p1=,
p2=, p3=)
at /home/mzajac/Repo/Internal/zephyr-root/zephyr/lib/os/thread_entry.c:36
#2 0xaaaaaaaa in ?? ()

The handler is set to NULL but the work item 0x200024b0 was in the work queue. This also explains why the pc/r15 register is set to 0x0.

I also checked that removing the memset from https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_offload.c#L400 and https://github.com/zephyrproject-rtos/zephyr/blob/main/drivers/wifi/eswifi/eswifi_socket_offload.c#L430 will prevent the crash from showing up but will end up in a situation that no more sockets will be created (probably the socket->context still needs to be set to NULL in both cases).

Target platform: disco L475 iot1 board

To Reproduce
Using the attached project.
crash.zip

  1. Update the west configuration to point to the west.yml included in the project.
  2. Set the Kconfig variable CONFIG_WIFI_SSID and CONFIG_WIFI_PASS
  3. build the project for disco L475 iot1 board (eg. west build -b disco_l745_iot1)
  4. flash the L475 board
  5. Wait some time to see error (from the I made tests I made it seems depending on the Wi-Fi router the crash can happen almost instantly or you need to wait a little to make it show up).

Impact
Randomly crashes the application during startup.

Environment (please complete the following information):
Probably irrelevant but:

  • Linux
  • Toolchain (e.g Zephyr: 3.1.0, SDK: 0.14.2)
    and the important part:
  • disco L475 iot1 board

Additional context
This seems the as the same problem that was reported in #24283 but the solution proposed there isn't really a fix of the problem.

@mzajac-avs mzajac-avs added the bug The issue is a bug, or the PR is fixing a bug label Sep 6, 2022
@erwango erwango assigned erwango and loicpoulain and unassigned erwango Sep 6, 2022
@erwango erwango added the area: Wi-Fi Wi-Fi label Sep 6, 2022
@erwango
Copy link
Member

erwango commented Sep 6, 2022

@loicpoulain Would you be able to have a look ?

@mzajac-avs
Copy link
Contributor Author

I added a proposed solution to fix the problem in this pull request: #50153

@henrikbrixandersen henrikbrixandersen added the platform: STM32 ST Micro STM32 label Nov 2, 2022
@erwango
Copy link
Member

erwango commented Nov 8, 2022

Fixed in #50153

@erwango erwango closed this as completed Jan 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Wi-Fi Wi-Fi bug The issue is a bug, or the PR is fixing a bug platform: STM32 ST Micro STM32 priority: low Low impact/importance bug
Projects
None yet
Development

No branches or pull requests

5 participants