Skip to content

BLE+WiFi+heap_caps_malloc_extmem_enable crash. #6666

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
1 task done
raphael-bmec-co opened this issue Apr 29, 2022 · 12 comments
Closed
1 task done

BLE+WiFi+heap_caps_malloc_extmem_enable crash. #6666

raphael-bmec-co opened this issue Apr 29, 2022 · 12 comments
Assignees
Labels
Area: BT&Wifi BT & Wifi related issues Type: For reference Common questions & problems Type: Question Only question

Comments

@raphael-bmec-co
Copy link

Board

ESP32 Dev Module

Device Description

Chip is ESP32-D0WD-V3 (revision 3); Auto-detected Flash size: 16MB

Hardware Configuration

None

Version

other

IDE Name

CLion with PIO

Operating System

Windows 10

Flash frequency

80 MHz

PSRAM enabled

yes

Upload speed

921600

Description

Using BLE + WiFi + heap_caps_malloc_extmem_enable(0) causes a spontaneous crash when the AP is not found and also at other random times. The AP not found case is most easily replicated.

PIO version:

platform = https://github.com/tasmota/platform-espressif32/releases/download/v2.0.3rc1/platform-espressif32-2.0.3.zip
platform_packages = framework-arduinoespressif32 @ https://github.com/Jason2866/esp32-arduino-lib-builder/releases/download/826/framework-arduinoespressif32-v4.4_work-b98c8476f2.tar.gz

Sketch

cpp
#include <WiFi.h>
#include <BLEDevice.h>
#include <BLEUtils.h>
#include <BLEServer.h>

// Set to 1 to accelerate the crash. Set to zero to see a typical use case.
#define ACCELERATED_CRASH 1

// BLE defines.
#define SERVICE_UUID        "4fafc201-1fb5-459e-8fcc-c5c9c331914b"
#define CHARACTERISTIC_UUID "beb5483e-36e1-4688-b7f5-ea07361b26a8"

void setup() {
  // Comment this line and the crash does not happen.
  heap_caps_malloc_extmem_enable(0);

  // Serial.
  Serial.begin(115200);
  Serial.println("Starting BLE...");

  // Minimum BLE simplified from example.
  BLEDevice::init("Long name works now");
  BLEServer *pServer = BLEDevice::createServer();
  BLEService *pService = pServer->createService(SERVICE_UUID);
  BLECharacteristic *pCharacteristic = pService->createCharacteristic(
                                         CHARACTERISTIC_UUID,
                                         BLECharacteristic::PROPERTY_READ |
                                         BLECharacteristic::PROPERTY_WRITE
                                       );

  pCharacteristic->setValue("Hello World says Neil");
  pService->start();
  BLEAdvertising *pAdvertising = BLEDevice::getAdvertising();
  pAdvertising->addServiceUUID(SERVICE_UUID);
  BLEDevice::startAdvertising();
  Serial.println("Started BLE");

  // Minimum WiFi.
  Serial.println("Starting WiFi connection attempt...");
  WiFi.begin("missing network", "password");

  Serial.print("Available heap:");
  Serial.println(ESP.getFreeHeap());
}

void loop() {

  // This causes a crash to happen faster and may be useful for debugging.
  // However the crash will happen even without this block.
  if(ACCELERATED_CRASH){
    while(!WiFi.isConnected()){
      log_i("Attempting connection...");
      WiFi.begin("missing network", "password");
      delay(100);
    }
  }

  delay(1000);
  Serial.println("Loop");
}


### Debug Message

```plain
Guru Meditation Error: Core  0 panic'ed (Cache disabled but cached memory region accessed).

Core  0 register dump:
PC      : 0x4008490a  PS      : 0x00060c35  A0      : 0x800849dc  A1      : 0x3ffbea2c
A2      : 0x3f80d868  A3      : 0xbad00bad  A4      : 0x00060a23  A5      : 0x40092a6c
A6      : 0x003fffff  A7      : 0x3ffbc87c  A8      : 0x3ffbbd6c  A9      : 0x3ffbeb1c
A10     : 0x00000010  A11     : 0x3ffc0580  A12     : 0x00000001  A13     : 0x00000002
A14     : 0x3ffc4488  A15     : 0x3f800000  SAR     : 0x00000010  EXCCAUSE: 0x00000007
EXCVADDR: 0x00000000  LBEG    : 0x40090804  LEND    : 0x4009081a  LCOUNT  : 0x00000000


Backtrace:0x40084907:0x3ffbea2c |<-CORRUPTED
  #0  0x40084907:0x3ffbea2c in esp_timer_stop at /home/runner/work/esp32-arduino-lib-builder/esp32-arduino-lib-builder/esp-idf/components/esp_timer/src/esp_timer.c:192

Other Steps to Reproduce

Please note the comments in the code.

I have checked existing issues, online documentation and the Troubleshooting Guide

  • I confirm I have checked existing issues, online documentation and Troubleshooting guide.
@raphael-bmec-co raphael-bmec-co added the Status: Awaiting triage Issue is waiting for triage label Apr 29, 2022
@SuGlider SuGlider self-assigned this Apr 29, 2022
@SuGlider
Copy link
Collaborator

I've tested it with ESP32-WROVER-E Module.
Issue confirmed.

Investigation indicates that the issue is related to PSRAM being activated in the Arduino Core.
I tested it with Arduino IDE with a Partition Scheme = Huge APP (3MB No OTA/1MB SPIFFS).

Using an ESP32-WROVER-E board, when the board selected in Arduino IDE is ESP32 Wrover Module, the issue happens.
BUT when the selected board is a ESP32 Dev Module with PSRAM disabled, it works perfectly fine.
In other hand, with the same ESP32 Dev Module selected, when PSRAM is enabled, the issue occurs.

It seems initially related to Dynamic/Static memory allocation within WiFi and how Arduino has set it up.

Investigation will continue.

@SuGlider SuGlider added Status: In Progress ⚠️ Issue is in progress and removed Status: Awaiting triage Issue is waiting for triage labels Apr 29, 2022
@SuGlider
Copy link
Collaborator

SuGlider commented Apr 29, 2022

Testing the sketch with different Arduino Core versions:

  • 1.0.6, based on IDF 3, OK
  • 2.0.0-Alpha1 onward, based on IDF 4, FAILs

It seems that this issue has been there since the change from IDF 3.3 to IDF 4.4.
Given that it only occurs when PSRAM/Malloc is enabled. There is definitely something wrong with it.

Failure Messages:

Guru Meditation Error: Core  0 panic'ed (Cache disabled but cached memory region accessed). 

Core  0 register dump:
PC      : 0x40085519  PS      : 0x00060134  A0      : 0x80085660  A1      : 0x3ffbe8e0  
A2      : 0x3f813994  A3      : 0xbad00bad  A4      : 0x00060d23  A5      : 0x400860dc  
A6      : 0x3ffc4388  A7      : 0x3ffd41e4  A8      : 0x00000103  A9      : 0xbad00bad  
A10     : 0x00000010  A11     : 0x00000000  A12     : 0x0000186a  A13     : 0x0000f604  
A14     : 0xfffff000  A15     : 0x00000008  SAR     : 0x00000010  EXCCAUSE: 0x00000007  
EXCVADDR: 0x00000000  LBEG    : 0x400912f0  LEND    : 0x4009130c  LCOUNT  : 0x00000000  


Backtrace:0x40085516:0x3ffbe8e00x4008565d:0x3ffbe900 0x4008357d:0x3ffbe920 0x4008ddcb:0x3ffbe940 0x4008d45b:0x3ffbe960 0x4008d670:0x3ffbe980 0x4008d266:0x3ffbe9d0 0x400860e5:0x3ffbe9f0 0x40089b19:0x3ffbea10 0x40086aa6:0x3ffbea50 0x4008c07a:0x3ffbea70 0x4008cb93:0x3ffbea90 0x40085bf1:0x3ffbeab0 0x4009783a:0x3ffdb0a0 0x400975d1:0x3ffdb0c0 0x40099e5d:0x3ffdb100 0x400820a9:0x3ffdb180 0x400dc342:0x3ffdb1c0 0x400de2ae:0x3ffdb1f0 0x400dedd9:0x3ffdb210 0x400df19b:0x3ffdb230 0x400dde61:0x3ffdb2b0 0x400de24f:0x3ffdb320 0x400dc876:0x3ffdb340 0x400fe225:0x3ffdb360 0x400e3d72:0x3ffdb390 0x400e37d2:0x3ffdb3b0 0x400ec65d:0x3ffdb3d0 

Decoding stack:

Decoding stack results

PC: 0x40085519: esp_timer_stop at ../esp-idf/components/esp_timer/src/esp_timer.c line 306
EXCVADDR: 0x00000000


0x40085516: esp_timer_stop at ../esp-idf/components/esp_timer/src/esp_timer.c line 306
0x4008565d: ets_timer_disarm at ../esp-idf/components/esp_timer/src/ets_timer_legacy.c line 109
0x4008357d: timer_disarm_wrapper at ../esp-idf/components/esp_wifi/esp32/esp_adapter.c line 437
0x400860e5: coex_bt_request_wrapper at ../esp-idf/components/bt/controller/esp32/bt.c line 1038
0x4009783a: spi_flash_hal_poll_cmd_done at ../esp-idf/components/hal/spi_flash_hal_common.inc line 37
0x400975d1: spi_flash_hal_read at ../esp-idf/components/hal/spi_flash_hal_common.inc line 188
0x40099e5d: spi_flash_chip_generic_read at ../esp-idf/components/spi_flash/spi_flash_chip_generic.c line 206
0x400820a9: esp_flash_read at ../esp-idf/components/spi_flash/esp_flash_api.c line 688
0x400dc342: esp_partition_read at ../esp-idf/components/spi_flash/partition.c line 347
0x400de2ae: nvs::NVSPartition::read(unsigned int, void*, unsigned int) at ../esp-idf/components/nvs_flash/src/nvs_partition.cpp line 45
0x400dedd9: nvs::Page::readEntry(unsigned int, nvs::Item&) const at ../esp-idf/components/nvs_flash/src/nvs_page.hpp line 209
0x400df19b: nvs::Page::findItem(unsigned char, nvs::ItemType, char const*, unsigned int&, nvs::Item&, unsigned char, nvs::VerOffset) at ../esp-idf/components/nvs_flash/src/nvs_page.cpp line 841
0x400dde61: nvs::Storage::eraseItem(unsigned char, nvs::ItemType, char const*) at ../esp-idf/components/nvs_flash/src/nvs_storage.cpp line 160
0x400de24f: nvs::NVSHandleSimple::erase_item(char const*) at ../esp-idf/components/nvs_flash/src/nvs_storage.hpp line 98
0x400dc876: nvs_erase_key(nvs_handle_t, char const*) at ../esp-idf/components/nvs_flash/src/nvs_api.cpp line 335
0x400fe225: dhcp_ip_addr_erase at ../esp-idf/components/lwip/port/esp32/netif/dhcp_state.c line 74
0x400e3d72: esp_netif_dhcpc_stop_api at ../esp-idf/components/esp_netif/lwip/esp_netif_lwip.c line 992
0x400e37d2: esp_netif_api_cb at ../esp-idf/components/esp_netif/lwip/esp_netif_lwip.c line 107
0x400ec65d: tcpip_thread at ../esp-idf/components/lwip/lwip/src/api/tcpip.c line 208

@SuGlider
Copy link
Collaborator

SuGlider commented Apr 29, 2022

@raphael-bmec-co

Why do you need this?

  // Comment this line and the crash does not happen.
  heap_caps_malloc_extmem_enable(0);

Default value is 4096 and that is the smallest memory block size of PSRAM that can be allocated from PSRAM.
When the block of RAM is smaller than 4096, it will be allocated from HEAP.
Setting it to zero may be a bad idea given that many data structures in functions got from malloc() must be in RAM and not in PSRAM.

@SuGlider
Copy link
Collaborator

So far I see no issue... it sounds like bad usage.

@raphael-bmec-co
Copy link
Author

This does indeed seem to have stopped the crash. But I am seeing:

[ 1400][E][WiFiGeneric.cpp:601] wifiLowLevelInit(): esp_wifi_init 257
[ 1400][E][co_bmec_wifi.cpp:88] init(): could not start WiFi.

Please can you point me to where this default of 4096 can be found? Every example I can find - granted there aren't many - for using heap_caps_malloc_extmem_enable() is with a value of 0. Including the original issue where I first posted.

I will do more testing ASAP my side. The bug above is showing up in my main application I will need to see if I can replicate it in the test code.

@SuGlider
Copy link
Collaborator

Please can you point me to where this default of 4096 can be found? Every example I can find - granted there aren't many - for using heap_caps_malloc_extmem_enable() is with a value of 0. Including the original issue where I first posted.

Arduino sets it here:
https://github.com/espressif/arduino-esp32/blob/master/cores/esp32/esp32-hal-psram.c#L93

CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL is defined for each SoC by IDF.
It can be found at the sdkconfig:
https://github.com/espressif/arduino-esp32/blob/master/tools/sdk/esp32/sdkconfig#L517

Thus, its default value is 4096.

@raphael-bmec-co
Copy link
Author

Is the correct usage then to just set
CONFIG_SPIRAM_USE_MALLOC && !CONFIG_ARDUINO_ISR_IRAM in the build flags?

And to not call heap_caps_malloc_extmem_enable() at all?

@SuGlider
Copy link
Collaborator

SuGlider commented Apr 29, 2022

Is the correct usage then to just set CONFIG_SPIRAM_USE_MALLOC && !CONFIG_ARDUINO_ISR_IRAM in the build flags?

And to not call heap_caps_malloc_extmem_enable() at all?

I think so.... can't tell for sure for your application case.
But you can think that Arduino has a default setup that works fine with the framework.

I'm pretty sure that setting heap_caps_malloc_extmem_enable(0) is causing the crash for the example you posted and it is a very bad idea because it forces all malloc() to go to SPIRAM... which doesn't work for data structures used in ISRs.

@raphael-bmec-co
Copy link
Author

@SuGlider thank you for that insight. I'm still doing some more testing on my side.

Setting:

-DCONFIG_ARDUINO_ISR_IRAM=0
-DCONFIG_SPIRAM_USE_MALLOC=1

Or not setting those and calling:

heap_caps_malloc_extmem_enable(CONFIG_SPIRAM_MALLOC_ALWAYSINTERNAL);

Works fine on the test code I provided above but in my actual project I get:

[ 1400][E][WiFiGeneric.cpp:601] wifiLowLevelInit(): esp_wifi_init 257
[ 1400][E][co_bmec_wifi.cpp:88] init(): could not start WiFi.

However, everything seems to work if I reduce this down to say:

heap_caps_malloc_extmem_enable(64);

I am going to do some more stability testing on this but I would appreciate your thoughts on using 64 instead of 4096.

@SuGlider
Copy link
Collaborator

SuGlider commented Apr 30, 2022

I am going to do some more stability testing on this but I would appreciate your thoughts on using 64 instead of 4096.

64 is a lot better than Zero ;-)

I think that 64 bytes may cover a big part of malloc() done for data struct of data arrays used in critical functions of the Core, WiFi and BLE.
In order to make sure it is the "magical number", it would need an extensive analysis of the whole Arduino Core/Libs code which is not practical, but you can certainly give it a try...

The exception in the original code, apparently, has to do with part of memory that are not accessible from the Arduino Core - which indicates that critical data structures were in PSRAM instead of HEAP because of heap_caps_malloc_extmem_enable(0)

@SuGlider
Copy link
Collaborator

If this is a more professional application, I would go with the 4096, to play safe.
Is there any specific reason that motivates you to change this parameter?

@SuGlider SuGlider added Area: BT&Wifi BT & Wifi related issues Type: Question Only question and removed Status: In Progress ⚠️ Issue is in progress labels Apr 30, 2022
@raphael-bmec-co
Copy link
Author

In order to make sure it is the "magical number", it would need an extensive analysis of the whole Arduino Core/Libs code which is not practical, but you can certainly give it a try...
I figured as much.

It is a more professional application. But WiFi won't start with 4096. I think 4096 just puts too much stuff on the HEAP and there isn't enough place to start the WiFi up. It is looking very stable so far at 64. We will have to go with long term stability testing for now. I am going to close the issue at this point. Thanks for all your help.

@SuGlider SuGlider added the Type: For reference Common questions & problems label Apr 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Area: BT&Wifi BT & Wifi related issues Type: For reference Common questions & problems Type: Question Only question
Projects
None yet
Development

No branches or pull requests

2 participants