Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AWS provisioned ESP32 randomly crashes (core dump) #447

Open
JamesHepworth opened this issue Aug 27, 2018 · 9 comments
Open

AWS provisioned ESP32 randomly crashes (core dump) #447

JamesHepworth opened this issue Aug 27, 2018 · 9 comments

Comments

@JamesHepworth
Copy link

JamesHepworth commented Aug 27, 2018

Good afternoon,

We have AWS provisioned devices running firmware which seem to show two related crashing behaviours:

  1. On provisioning of a device using the method described here, working firmware crashes on first reboot of the device following provisioning. This behaviour is not constant: during some periods a device may be provisioned successfully, whilst at other times crashing behaviour is evident.

  2. During operation a device may crash and cause a core dump. This seems to occur at completely random intervals - a device performed as expected for four days before crashing without change of system state or firmware.

Both types of crashing behaviour produce the following console output:

[Aug 27 12:50:25.398] assertion "p->tot_len == p->len + q->tot_len" failed: file "/opt/Espressif/esp-idf/components/lwip/core/pbuf.c", line 864, function: pbuf_dechain
[Aug 27 12:50:25.444] abort() was called at PC 0x400d20ff on core 0
[Aug 27 12:50:25.444]
[Aug 27 12:50:25.444] Backtrace: 0x40092a03 0x40092b03 0x400d20ff 0x4019fdea 0x400e7bc8 0x400ec859 0x400ef6a6 0x400e92a2 0x400e75b3 0x40082eb2 0x40082da2
[Aug 27 12:50:25.444] --- BEGIN CORE DUMP ---
[Aug 27 12:50:25.444] mos: catching core dump
[Aug 27 12:50:28.288] ...................
[Aug 27 12:51:19.958] ---- END CORE DUMP ----
[Aug 27 12:51:19.966] mos: wrote to C:\xx\xx\core--ESP32-20180827-125119.601032427 (628283 bytes)
[Aug 27 12:51:19.978] mos: analyzing core dump
Using ELF file at: C:\xx\xx\build\objs\fw.elf
Using Docker image: docker.io/mgos/esp32-build:3.0-r11
Using Mongoose OS souces at: C:\xx\xx\deps\mongoose-os
Running docker run --rm -v C:\xx\xx\build\objs\fw.elf:/fw.elf -v C:\xx\xx\core--ESP32-20180827-125119.601032427:/core -v C:\xx\xx\deps\mongoose-os:/mongoose-os -v C:\xx\xx:/c/xx/xx docker.io/mgos/esp32-build:3.0-r11 bash -c /usr/local/bin/serve_core.py --rom=/opt/Espressif/rom/rom.bin --rom_addr=0x40000000 --xtensa_addr_fixup=true /fw.elf /core & $MGOS_TARGET_GDB /fw.elf -ex 'target remote 127.0.0.1:1234' -ex 'set confirm off' -ex bt -ex quit
GNU gdb (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=x86_64-build_pc-linux-gnu --target=xtensa-esp32-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /fw.elf...done.
Found core at 23 - 628310
Mapping IRAM: 131072 @ 0x40080000
Mapping DRAM: 335872 @ 0x3ffae000
Mapping /opt/Espressif/rom/rom.bin at 0x40000000
Mapping /fw.elf .iram0.vectors: 1024 @ 0x40080000
Mapping /fw.elf .iram0.text: 86040 @ 0x40080400
Mapping /fw.elf .dram0.data: 13896 @ 0x3ffc0000
Mapping /fw.elf .dram0.bss: 90792 @ 0x3ffc3648
Mapping /fw.elf .flash.rodata: 262860 @ 0x3f400020
Mapping /fw.elf .flash.text: 1068170 @ 0x400d0018
Loaded core dump from last snippet in /core
Remote debugging using 127.0.0.1:1234
0x40092a03 in invoke_abort ()
at /opt/Espressif/esp-idf/components/esp32/panic.c:140
140 *((int *) 0) = 0;
#0 0x40092a03 in invoke_abort ()
at /opt/Espressif/esp-idf/components/esp32/panic.c:140
#1 0x40092b06 in abort ()
at /opt/Espressif/esp-idf/components/esp32/panic.c:149
#2 0x400d2102 in __assert_func (
file=0x3f4362ac "/opt/Espressif/esp-idf/components/lwip/core/pbuf.c",
line=864, func=,
failedexpr=0x3f43646f "p->tot_len == p->len + q->tot_len")
at ../../../.././newlib/libc/stdlib/assert.c:63
#3 0x4019fded in pbuf_dechain (p=)
at /opt/Espressif/esp-idf/components/lwip/core/pbuf.c:881
#4 0x400e7bcb in mg_lwip_if_udp_recv (nc=, buf=0x3ffbceac,
len=1460, sa=0x3ffe3500, sa_len=0x3ffe3510)
at common/platforms/lwip/mg_lwip_net_if.c:309
#5 0x400ec85c in mg_recv_udp (len=,
buf=0x3ffbceac "\240\340\373?", nc=0x3ffb6ee0)
at mongoose/src/mg_net.c:639
#6 mg_do_recv (nc=0x3ffb6ee0) at mongoose/src/mg_net.c:582
#7 0x400ef6a9 in mg_if_can_recv_cb (nc=0x3ffb6ee0)
at mongoose/src/mg_net.c:590
#8 mg_lwip_if_poll (iface=, timeout_ms=)
at common/platforms/lwip/mg_lwip_ev_mgr.c:152
#9 0x400e92a5 in mg_mgr_poll (m=0x3ffd0bc4 <s_mgr>, timeout_ms=0)
at mongoose/src/mg_net.c:290
#10 0x400e75b6 in mongoose_poll (ms=0)
at /c/xxx/xxx/mos-libs/mongoose/src/mgos_mongoose.c:62
#11 0x40082eb5 in mgos_mg_poll_cb (arg=)
at /mongoose-os/fw/src/mgos_hal_freertos.c:106
#12 0x40082da5 in mgos_task (arg=)
at /mongoose-os/fw/src/mgos_hal_freertos.c:220
Detaching from program: /fw.elf, Remote target
Remote doesn't know how to detach
GDB closed the connection

On reboot immediately following this trace, the device sometimes boots correctly and resumes expected operation, whilst at other times, the same crashing behaviour occurs and produces the result above.

We are using v2.4.3 of mongoose-os, mos-libs, and the mos tool. Libraries are the closed source version.

Any assistance would be greatly appreciated.

Kind regards,
James

@cpq
Copy link
Member

cpq commented Aug 28, 2018

Thanks James.
Does it happen on a latest version ?

@JamesHepworth
Copy link
Author

JamesHepworth commented Aug 28, 2018

Hi Sergey,

Same result with latest version of mongoose-os, mos-libs and mos tool.

Note that rebuilding & flashing the firmware without the required certificates and config for AWS results in successful operation without crashing.

@cpq
Copy link
Member

cpq commented Aug 30, 2018

I see this in the log:

#5 0x400ec85c in mg_recv_udp (len=,
buf=0x3ffbceac "\240\340\373?", nc=0x3ffb6ee0)
at mongoose/src/mg_net.c:639
#6 mg_do_recv (nc=0x3ffb6ee0) at mon

mg_recv_udp is something unexpected. Could you show the result of mos config-get for the device that is crashing, please?

@JamesHepworth
Copy link
Author

Hi Sergey, result attached.

Note the device is currently operational.

mos_config-get.txt

@JamesHepworth
Copy link
Author

Hi Sergey,

I was wondering if perhaps you have made any further progress with this?

@JamesHepworth
Copy link
Author

JamesHepworth commented Sep 5, 2018

Update:

The device also shows the behaviour when not provisioned with AWS. Trace and config attached.

[Sep 5 10:59:23.104] mongoose_poll New heap free LWM: 86484
[Sep 5 10:59:23.143] assertion "p->tot_len == p->len + q->tot_len" failed: file "/opt/Espressif/esp-idf/components/lwip /core/pbuf.c", line 864, function: pbuf_dechain
[Sep 5 10:59:23.191] abort() was called at PC 0x400d21a3 on core 0
[Sep 5 10:59:23.191]
[Sep 5 10:59:23.191] Backtrace: 0x400928c7 0x400929c7 0x400d21a3 0x40197ada 0x400e7a54 0x400ec689 0x400ef52a 0x400e9092 0x400e743f 0x40082eb2 0x40082da2
[Sep 5 10:59:23.191] --- BEGIN CORE DUMP ---
[Sep 5 10:59:23.191] mos: catching core dump
[Sep 5 10:59:26.035] ...................
[Sep 5 11:00:17.711] ---- END CORE DUMP ----
[Sep 5 11:00:17.722] mos: wrote to C:\xx\xx\core--ESP32-20180905-110017. 450735467 (628323 bytes)
[Sep 5 11:00:17.726] mos: analyzing core dump
Using ELF file at: C:\xx\xx\build\objs\fw.elf
Using Docker image: docker.io/mgos/esp32-build:3.0-r11
Running docker run --rm -v C:\xx\xx\build\objs\fw.elf:/fw.elf -v C:\xx\xx\core--ESP32-20180905-110017.450735467:/core -v C:\xx\xx:/c/xx/xx docker.io/mgos/esp32-build:3.0-r11 bash -c /usr/local/bin/serve_core.py --rom=/opt/Espressif/rom/rom.bin --rom_addr=0x40000000 --xtensa_addr_fixup=true /fw.elf /core & $MGOS_TARGET_GDB /fw.elf -ex 'target remote 127.0.0.1:1234' -ex 'set confirm off' -ex bt -ex quit
GNU gdb (crosstool-NG crosstool-ng-1.22.0-80-g6c4433a) 7.10
Copyright (C) 2015 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later http://gnu.org/licenses/gpl.html
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law. Type "show copying"
and "show warranty" for details.
This GDB was configured as "--host=x86_64-build_pc-linux-gnu --target=xtensa-esp32-elf".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
http://www.gnu.org/software/gdb/bugs/.
Find the GDB manual and other documentation resources online at:
http://www.gnu.org/software/gdb/documentation/.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /fw.elf...done.
Found core at 23 - 628350
Mapping IRAM: 131072 @ 0x40080000
Mapping DRAM: 335872 @ 0x3ffae000
Mapping /opt/Espressif/rom/rom.bin at 0x40000000
Mapping /fw.elf .iram0.vectors: 1024 @ 0x40080000
Mapping /fw.elf .iram0.text: 85376 @ 0x40080400
Mapping /fw.elf .dram0.data: 13884 @ 0x3ffc0000
Mapping /fw.elf .dram0.bss: 57608 @ 0x3ffc3640
Mapping /fw.elf .flash.rodata: 247076 @ 0x3f400020
Mapping /fw.elf .flash.text: 1010034 @ 0x400d0018
Loaded core dump from last snippet in /core
Remote debugging using 127.0.0.1:1234
0x400928c7 in invoke_abort ()
at /opt/Espressif/esp-idf/components/esp32/panic.c:140
140 *((int *) 0) = 0;
#0 0x400928c7 in invoke_abort ()
at /opt/Espressif/esp-idf/components/esp32/panic.c:140
#1 0x400929ca in abort ()
at /opt/Espressif/esp-idf/components/esp32/panic.c:149
#2 0x400d21a6 in __assert_func (
file=0x3f433230 "/opt/Espressif/esp-idf/components/lwip/core/pbuf.c",
line=864, func=,
failedexpr=0x3f4333f3 "p->tot_len == p->len + q->tot_len")
at ../../../.././newlib/libc/stdlib/assert.c:63
#3 0x40197add in pbuf_dechain (p=)
at /opt/Espressif/esp-idf/components/lwip/core/pbuf.c:881
#4 0x400e7a57 in mg_lwip_if_udp_recv (nc=, buf=0x3ffbf100,
len=1460, sa=0x3ffdad40, sa_len=0x3ffdad50)
at common/platforms/lwip/mg_lwip_net_if.c:309
#5 0x400ec68c in mg_recv_udp (len=,
buf=0x3ffbf100 "(\373\373?", nc=0x3ffe35b8) at mongoose/src/mg_net.c:639
#6 mg_do_recv (nc=0x3ffe35b8) at mongoose/src/mg_net.c:582
#7 0x400ef52d in mg_if_can_recv_cb (nc=0x3ffe35b8)
at mongoose/src/mg_net.c:590
#8 mg_lwip_if_poll (iface=, timeout_ms=)
at common/platforms/lwip/mg_lwip_ev_mgr.c:152
#9 0x400e9095 in mg_mgr_poll (m=0x3ffc8a20 <s_mgr>, timeout_ms=0)
at mongoose/src/mg_net.c:290
#10 0x400e7442 in mongoose_poll (ms=0)
at /c/xxxx/mos-libs/mongoose/src/mgos_mongoose.c:62
#11 0x40082eb5 in mgos_mg_poll_cb (arg=)
at /mongoose-os/fw/src/mgos_hal_freertos.c:105
#12 0x40082da5 in mgos_task (arg=)
at /mongoose-os/fw/src/mgos_hal_freertos.c:219
Detaching from program: /fw.elf, Remote target
Remote doesn't know how to detach
GDB closed the connection

config-get_unprovisioned.txt

@JamesHepworth
Copy link
Author

It looks like the issue is being caused by the dns-sd library. If I remove the library from the project crashing behaviour seems to stop. In its previous form (with the dns-sd library included) the firmware can run for extended periods without a crash so I have not been able to verify this as the definitive cause yet.

I am in the process of setting up a long term test to investigate this and am trying to setup a test case to reproduce the crashing behaviour in a simple example project.

I will comment again with test code and test results once I have them.

@JamesHepworth
Copy link
Author

We discovered that a device on our network was flooding the device running the Mongoose OS firmware with DNS packets. This seemed to cause the device to eventually run out of memory and to crash.

Excerpt from log attached.

Does this seem to make sense? Does Mongoose OS provide any protection from devices that might be malfunctioning and flooding the network with packets?

dns_flood.txt

@cpq
Copy link
Member

cpq commented Sep 27, 2018

Looks interesting.
Well, Mongoose OS is using the LWIP stack - so much of the protection is on the lower layer. However, I do not anticipate the DNS flood to kill the device, since it does not create any long-lived sessions or objects like that. Once the packet gets handled, it is discarded. Said that, we'll take a closer look - thank so much for the report.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants