Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exception 9 in enableSTA() #7016

Closed
6 tasks done
dalbert2 opened this issue Jan 14, 2020 · 27 comments
Closed
6 tasks done

Exception 9 in enableSTA() #7016

dalbert2 opened this issue Jan 14, 2020 · 27 comments

Comments

@dalbert2
Copy link
Contributor

dalbert2 commented Jan 14, 2020

Basic Infos

  • This issue complies with the issue POLICY doc.
  • I have read the documentation at readthedocs and the issue is not addressed there.
  • I have tested that the issue is present in current master branch (aka latest git).
  • I have searched the issue tracker for a similar issue.
  • If there is a stack dump, I have decoded it.
  • I have filled out all fields below.

Platform

  • Hardware: [ESP-12]
  • Core Version: [Core:2_6_3 SDK:2.2.2-dev(38a443e)] (same issue with 2.6.2)
  • Development Env: [Platformio]
  • Operating System: [Windows]

Settings in IDE

  • Module: [Generic ESP8266 Module]
  • Flash Mode: [qio]
  • Flash Size: [4MB]
  • lwip Variant: [v2 Lower Memory]
  • Reset Method: [nodemcu]
  • Flash Frequency: [40Mhz]
  • CPU Frequency: [160MHz]
  • Upload Using: [SERIAL]
  • Upload Speed: [115200] (serial upload only)

Problem Description

On some fielded units, I am getting an exception 9 after WiFi disconnect. On a few units, it happens very frequently. On most units, the same code is not taking an exception at all and my application connects/disconnects every 15 minutes.

This is a long standing defect. The exception address in the current build (using core 2.6.3) is 0xx40207D00, but it looks like the same defect has been happening using 2.5.2 and earlier. The exception address moves around of course, but always decodes to somewhere in ESP8266WiFiGeneric.cpp, and usually to code related to WiFi disconnect/shutdown (which is what is happening when the exception takes place).

Does anyone know what excvaddr means? It has similar values across all of the exceptions in different Core versions: (0x02070231)

The stack trace is short and does not contain anything useful. Consider the following stack trace shown in the debug output below. It decodes as follows:

$ ./esp_addr2line 40207D00
C:\Users\david.platformio\packages\framework-arduinoespressif8266@2.20602.191128\libraries\ESP8266WiFi\src/ESP8266WiFiGeneric.cpp:727

Which is in ESP8266WiFiGenericClass::shutdown (uint32 sleepUs, WiFiState* state)
exception address is this line: state->crc = shutdownCRC(state);

The message "Disconnected from AP\r\n" (see code below) is the last thing printed before the exception (i.e. my disconnect callback is being called).

The rest of the exception stack trace is short and not useful (nothing decodes to useful text-space addresses).

MCVE Sketch

void ICACHE_RAM_ATTR onSTADisconnected(WiFiEventStationModeDisconnected event_info) {
	if (connect_time) {
		// save disconnect time and reason
		disconnect_time = rtc_get();
		disconnect_reason = event_info.reason;
		disconnect_ssid = event_info.ssid;
		connect_time = 0;
		if (config_getDebug() & config_debug_wifi) {
			Serial.printf("Disconnected from AP\r\n");
		}
	}
}

// Registration done elsewhere, before WiFi.begin() is called
WiFi.onStationModeDisconnected(onSTADisconnected);

// Code that shuts down WiFi:
		wifi_station_dhcpc_stop();
		WiFi.mode(WIFI_OFF);

any thoughts would be much appreciated!

Debug Messages


Disconnected from AP       <<<<<<<< output from my disconnect handler
--
Fatal exception 9(LoadStoreAlignmentCause):
epc1=0x40207d00, epc2=0x00000000, epc3=0x00000000, excvaddr=0x02070231, depc=0x00000000
 
Exception (9):
epc1=0x40207d00 epc2=0x00000000 epc3=0x00000000 excvaddr=0x02070231 depc=0x00000000
 
>>>stack>>>
 
ctx: sys
sp: 3ffffdc0 end: 3fffffb0 offset: 01a0
3fffff60:  401021c5 00000000 00000002 3ffefcd0
3fffff70:  3fffdc10 00000000 3fff3014 40261888
3fffff80:  40000f49 3fffdab0 3fffdab0 40000f49
3fffff90:  40000e19 40001878 00000002 3fffffb0
3fffffa0:  3fffff10 aa55aa55 000000c4 40105bd1
<<<stack<<<


@dalbert2
Copy link
Contributor Author

I've been able to capture better logs from the affected sites and the issue manifests when there is a lot of data accumulated on the device and it is dumped to the server over a fairly slow link (wifi hotspot).

The data are sent via individual HTTP GETs, each with a WiFiClient.connect(), WiFiClient.printf_P(), process response while WiFiClient.available() && WiFiClient.connected(), WiFiClient.flush(), WiFiClient.stop().

So it's possible that this is a resource exhaustion issue related to LWIP use. I'll dig deeper into this, but if anyone has experience with a crash due to repeated connect/send/disconnect to a TCP server in rapid succession, please let me know. Thanks!

@alemori
Copy link

alemori commented Jan 27, 2020

I had the same problem days ago with something that was already working for me. I use the arduino IDE, so I reloaded the ESP8266 board for version 2.4.2, after deleting the arduino 15 directory.
Then it worked again.

@Tech-TX
Copy link
Contributor

Tech-TX commented Feb 4, 2020

If you'll install the exception decoder it will decode the exception and stack trace to give us an idea of what's happened.
https://github.com/me-no-dev/EspExceptionDecoder
It works on your current build, so you may need to recompile to get a usable trace. Copy everything from the fatal exception down and paste into the decoder.

@dalbert2
Copy link
Contributor Author

dalbert2 commented Feb 6, 2020

I had been decoding the exception stack trace manually and it was not giving useful code-space addresses. I've made a few changes and the exception address now decodes to line 238 inside ESP8266WiFiGenericClass::_eventCallback(void *arg) which is:
if (handler->canExpire() && handler.unique())

From the debug output, it looks like that is happening after my disconnect handler has run (the last thing in the disconnect handler is a Serial.printf("Disconnected from AP") and that appears on the logged serial output.

I will try building with DEBUG_WIFI defined to see if that helps narrow the event that is causing the exception. It is an intermittent (but frequent) event, i.e. not every disconnect causes this and it appears to be timing specific (it only happens with hotspots), so it looks like a race condition associated with disconnecting from a WiFi access point.

@TD-er
Copy link
Contributor

TD-er commented Feb 6, 2020

Hmm that looks like a function that may deserve some extra attention:

struct WiFiEventHandlerOpaque;
typedef std::shared_ptr<WiFiEventHandlerOpaque> WiFiEventHandler;

void ESP8266WiFiGenericClass::_eventCallback(void* arg)
{
System_Event_t* event = reinterpret_cast<System_Event_t*>(arg);
DEBUG_WIFI("wifi evt: %d\n", event->event);
if(event->event == EVENT_STAMODE_DISCONNECTED) {
DEBUG_WIFI("STA disconnect: %d\n", event->event_info.disconnected.reason);
WiFiClient::stopAll();
}
for(auto it = std::begin(sCbEventList); it != std::end(sCbEventList); ) {
WiFiEventHandler &handler = *it;
if (handler->canExpire() && handler.unique()) {
it = sCbEventList.erase(it);
}
else {
(*handler)(event);
++it;
}
}
}

It is not increasing the counter to the shared pointer (right?) by just creating a reference.
There is no check to see if the object is a null pointer.
Also it does feel a bit creepy to first dereference the pointer and then check whether it is unique.
By swapping both statements in the if, the crash would be harder to reproduce I think ;)
Maybe also check whether the pointer it is valid before continuing creating a reference.

@dalbert2
Copy link
Contributor Author

I'm not familiar with the WiFi event handling system (the control block event list handling, significance of handler.unique(), etc.); I would be delighted to try anything you suggest. I can easily reproduce the problem (albeit at a remote site so all I get back is a report of the crash and the exception address).

Can you provide suggestions for a modified _eventCallback method? I'll try putting the unique() test first to short circuit the de-referenced canExpire() call; is there anything else I can/should be testing for? Thanks!

@dalbert2
Copy link
Contributor Author

FYI, swapping the conditions in the statement: if (handler->canExpire() && handler.unique()) does not appear to help.

@TD-er
Copy link
Contributor

TD-er commented Feb 12, 2020

Can you change this line:

     for(auto it = std::begin(sCbEventList); it != std::end(sCbEventList); ) { 

into this:

     for(auto it = std::begin(sCbEventList); it != nullptr && it != std::end(sCbEventList); ) { 

If this is "fixing" it, it may mean there is a bug which may delete events that have not yet been handled.
Also I am not sure if it is possible a call to this callback can be interrupted by another callback, or maybe some other piece of code using these events may be interrupted by this callback. (or this list is cleaned up in multiple places?)

Maybe @earlephilhower does know more about this part of the code?

@devyte
Copy link
Collaborator

devyte commented Feb 12, 2020

@dalbert2 What is rtc_get()? what is config_getDebug()? why does your issue not include a MCVE?
The wifi event callbacks execute in SYS and therefore the code is subject to the SYS restrictions, i. e. can't be blocking or have delay/yield etc.

@dalbert2
Copy link
Contributor Author

dalbert2 commented Feb 12, 2020

@TD-er Thank you for the suggestion; unfortunately, I think the original code is correct (at least in that regard) and you can't test an interator against nullptr.

I share your concern about race conditions. I reviewed the ESP8266WiFiGeneric class and the events are only deleted in one place (in the _eventCallback() method); the other manipulation is all just push_back() pushing the event back. However, the _eventCallback() code is not re-entrant so if it an event takes place during iteration through the list and _eventCallback() is called again, bad things can happen (and this seems like the most likely problem to me at the moment.

As I understand it, the default constructor registers _eventCallback() through wifi_set_event_handler_cb(). I don't know whether the callback is called in interrupt time.

Then when I call onStationModeDisconnected() it creates a WiFiEventHandler for WIFI_EVENT_STAMODE_DISCONNECTED and adds it to the sCbEventList. Because I only do this once, that shared object will have a count of 1 and unique() should return true. The constructor for WiFiEventHandlerOpaque (defined as a struct not a class to make member data public?) sets mCanExpire true so canExpire() should also return true. See: WiFiEventHandlerOpaque in ESP8266WiFiGeneric.cpp line 86:
bool mCanExpire = true; /* stopgap solution to handle deprecated void onEvent(cb, evt) case */
Note that onEvent() (referred to in the above comment) explicitly sets the mCanExpire method false. I haven't dug into this yet, but obviously I don't yet understand the meaning/flow of mCanExpire and some comments in this part of the code would be very helpful.

When the OS/Stack calls _eventCallback() and the event type is EVENT_STAMODE_DISCONNECTED, it calls WiFiClient::stopAll() (which is where one potential race condition may exist: WiFiClient.stopAll() may generate more WiFi events). Then it iterates through the sCbEventList and if the handler it gets returns unique() and canExpire() removes it from the list without calling the handler, otherwise it calls the handler and advances the iterator.

A potential problem is that _eventCallback() is not reentrant, so if it is called while iterating through the list of event callbacks, there will be issues. Do you know whether interrupts are blocked during this processing? If not, it would be my first guess that this is the issue.

@devyte The exception is taking place after my callback has returned. The callback is decorated with ICACHE_RAM_ATTR and all of the calls except Serial.printf() simply return a static variable's value, so nothing blocking.

unsigned long ICACHE_RAM_ATTR rtc_get(void) {
   return epoch;
}
uint8_t ICACHE_RAM_ATTR config_getDebug() {
	return configData.debug;
}

I am slightly concerned about the use of Serial.printf(), but the exception takes place whether the debug output is enabled or not. Moreover, I use Serial.printf() it in other callbacks without trouble and even in this callback, the problem only happens at certain sites (and a huge number don't have the problem). In short, although I first suspected my callback handler, I now think the problem is in ESP8266WiFiGeneric.

The problem site are running on hotspots and I believe the issue is specific to the events that are being received (or perhaps not received) when disconnecting from the hotspot. Unfortunately, those sites are remote so it's difficult to get good data. I am going to go and try to capture data with DEBUG_WIFI configured to get more about the incoming events, but the crash is always on the specific line mentioned in ESP8266WiFiGeneric.cpp in _eventCallback(). At the moment, I suspect a race condition that may be causing _eventcallback() to be re-entered; if that happens, the iterator list can become corrupted and would result in the exception I'm seeing.

If you are familiar with it, can you fill me in on the intended use of the unique and canExpire predicates (not what they mean but why how they are being used)? In general, if anyone is familiar with this code, it would be awesome if they'd add some comments. In particular, do we know whether _eventCallback() can be called re-entrantly (it shouldn't be or it can cause the problem I'm seeing).

@devyte
Copy link
Collaborator

devyte commented Feb 12, 2020

@dalbert2 callbacks that execute in SYS, such as the wifi events or Ticker, don't need the ICACHE_RAM_ATTR. That's just wasting IRAM space. Only ISRs need the decorator.
The handler is of type WiFiEventHandler, which is a std::shared_ptr of a WiFiEventHandlerOpaque. A std::shared_ptr is a standard STL type smart pointer, you can find the docs online. The ::unique() method call means that it is checking that there is only one copy of the handler. I'll explain why below.
The handler->canExpire() call is accessing the underlying method from WiFiEventHandlerOpaque, which is just a getter for mCanExpire. The reason for mCanExpire is to handle backwards compat with the old event API onEvent(blah), which you're not supposed to be using since a long while, and if you are I will yell at you.

The way the wifi handler events work is as follows: there is a ´std::list´ that contains WiFiEventHandler objects, i.e.: shared_ptr objects. When you set a handler, an object is inserted into the list, and you get a copy of that object back from the onBlah() method. As long as that copy, or any copy of it, exists, the handler is active (it is not unique()). If you destroy/clear all of your copies of that object, then only the shared_ptr inside the list exists => unique().
So again: what the code does is: if it's unique, destroy the handler, else call the handler.
Calling the handler means: if the current event is the same as this handler's event, then call the user callback.
Why do all this? Because it's essentially a callback handler registry that allows a user to set multiple callbacks for each event, without interfering with each other. E.g.: MDNS makes use of it internally, and your setting of a callback for the same event doesn't step on that, but rather works along side it.

Now, the entirety of the above implementation is using STL objects, and zero manual pointer handling or allocs etc. That means that the chance of it misbehaving due to some mishandling of pointers or whatever is very low. The handler var is a reference, so if it's null, or if calling unique() crashes, then the most likely scenario is that you have corruption before even reaching that point.

Assuming there is nothing wrong with your code, which is a big assumption, my best guess here is:
ETS_UART_INTR_DISABLE() ETS_UART_INTR_ENABLE() inside WiFiGeneric::mode.
I think they suffer from the same nesting problem as ETS_INTR_LOCK/UNLOCK did, and we don't know what happens inside wifi_set_opmode().

@dalbert2
Copy link
Contributor Author

dalbert2 commented Feb 12, 2020

@devyte thank you for the thorough explanation! I hadn't noticed in my own code that I'd saved the pointer returned by onStationModeDisconnected() and even had a comment there to explain why. So fortunately/unfortunately, that's not the problem. Obviously the handler wasn't being removed from the list because it is getting called...I just didn't understand the dynamics of, in particular, the canExpire() method, but I don't think it's relevant.

I suspect we were typing our comments simultaneously. I don't know if you had a chance to look at mine yet, but do you know whether _eventCallback() can be re-entered? The fact that you say it doesn't require ICACHE_RAM_ATTR decoration suggests it cannot be re-entered due to being called in interrupt context, but it if it can be re-entered, that seems like the most probable cause. As I mentioned, this exception takes place even when Serial.printf() is not called and the rest of the code in my callback function is exceedingly innocuous, so although I would normally suspect my callback, it seems hard to see how that could be causing it. Also, this is clearly a race condition because it never happens on most nodes and even on the few nodes where it happens, it is intermittent.

Do you know whether the callback registered with wifi_set_event_handler_cb() can be called from either multiple threads or interrupt context?

@dalbert2
Copy link
Contributor Author

For now I've simply removed the callbacks and registration and the problem has gone away. I've implemented what was needed in other ways. I'll circle back some time to see if I can narrow it down further, but the problem was too damaging to the customers experiencing it to allow it to continue.

A potentially related issue however: on the same sites, a soft watchdog has been tripping when HTTP GET requests take more than 10s. Specifically I make a call to WiFiClient::printf() containing an HTTP GET and the call takes more than 10s to return. I have a 1s ticker that triggers a hard watchdog if 10s elapse (and they do) before the printf returns.

I suspect that either the connection drops while the printf is in progress or TCP backs up and blocks the write. Either way, the hardware watchdog is getting stroked but the system is inactive which is a bad thing (hence the soft watchdog implementation). The sites experiencing this problem are the same ones that had been experiencing the exception mentioned in this thread so there may be a clue there. I'll try to narrow down the specific issue (e.g. disconnected during WiFiClient::printf()).

In the meanwhile, does anyone know how to cause WiFiClient::printf() to timeout and return rather than lock up indefinitely?

@d-a-v
Copy link
Collaborator

d-a-v commented Feb 19, 2020

@dalbert2 I didn't read between every line, did you try:

  • to use scheduled functions in the callbacks (like suggested in another issue) ?
  • to use tcp keepalive to prevent from waiting indefinitely for a dying connection ?

@dalbert2
Copy link
Contributor Author

@d-a-v

  • The callback functions were only reading or setting variables. For now I've removed the callbacks entirely, but I don't think there was anything remotely problematic in them. They work perfectly in the vast majority of deployments. The problem only manifests at a handful of sites that have particularly bad internet connectivity which relates to the next bullet.
  • At these sites, the code that is frequently blocking (but stroking the hardware watchdog) for longer than 10s is here:
    wifi_client.keepAlive(5,5,5); wifi_client.printf_P(http_get, bufP, http_server, config_getHostname());
    Where wifi_client is an instance of WiFiClient. I believe the blocking here and the problem with the callbacks are related and both likely are caused by the poor internet connectivity: either a connection dropping in the middle of a TCP transmission or excessively long latency. Either way, the call to WiFiClient::printf_P() is not returning for more than 10s despite the call to keepAlive(5,5,5)...any ideas would be much appreciated. Unfortunately, this only happens at a remote site (hence the poor internet) so debugging is tough although I can remotely upgrade firmware post some data back to a database.

@TD-er
Copy link
Contributor

TD-er commented Feb 20, 2020

In don't understand the choice for 5,5,5 for the keepAlive.
Shouldn't the first one be longer than the 2nd parameter?
I thought the 2nd parameter is interval to use when the regular interval (1st parameter) is not giving an ack.

@dalbert2
Copy link
Contributor Author

dalbert2 commented Feb 20, 2020

@TD-er Thanks for the reply; I don't recall how those choices came about, but I would be grateful for suggestions for better parameters. I'll read up on the LWIP documentation for it. I think the answer is here: http://lwip.100.n7.nabble.com/TCP-KeepAlive-td10838.html

The function signature is:
void keepAlive (uint16_t idle_sec = TCP_DEFAULT_KEEPALIVE_IDLE_SEC, uint16_t intv_sec = TCP_DEFAULT_KEEPALIVE_INTERVAL_SEC, uint8_t count = TCP_DEFAULT_KEEPALIVE_COUNT);

There is very little documentation about what each of those parameters mean.
The implementation in ClientContext just converts the seconds to ms and stuffs the values into the tcp_pcb (protocol control block). tcb_pcb also has minimal commenting; it indicates that keep_idle is the idle time before KEEPALIVE is sent. LWIP_TCP_KEEPALIVE is defined so keep_intvl and keep_cnt are used, but they have no comments.

Any thoughts on how to do this better so the printf() will timeout and return if not completed within some specified duration?

@TD-er
Copy link
Contributor

TD-er commented Feb 20, 2020

Why not use 6, 1, 4 ?
See Wikipedia for more details on the parameters.

@dalbert2
Copy link
Contributor Author

dalbert2 commented Feb 20, 2020

@TD-er Thanks, I think that's exactly what I needed. The Wikipedia link is perfect.

I had just found the documentation for it (noted in the edited comment above) and am testing a fix now. I'm going to test with 4,1,4; I'll give it a try and report back.

Although it isn't critical for me now, I think the pressing issue is the original one: why the disconnect callback fails (takes an exception) when the TCP connection goes south. I'll circle back to that shortly.

@TD-er
Copy link
Contributor

TD-er commented Feb 20, 2020

Can you have a look at this comment I made a while ago?

@TD-er
Copy link
Contributor

TD-er commented Feb 21, 2020

I also added a keepAlive call to a WiFIclient I use (one for MQTT) and it indeed leads to crashes.

The test I used was mqtt.keepAlive(15, 2, 5); and after 15 seconds it crashed.
Commented out and it is still running.

So this makes me think as the default settings for keepAlive is 75 sec.
This could explain the really hard to reproduce crashes my users still mention.

Edit:
Looked a bit throughout the code and it seems that a lot of LWIP code is calling espconn_keepalive_enable, which sets the keep alive flag.
For example:

	if (espconn_keepalive_disabled(paccept))
		espconn_keepalive_enable(pcb);

So even if we don't set it ourselves, it may still be enabled in a lot of scenarios.

Not saying it is a bug in LWIP, but wherever this bug may be in the code, it may be triggered in a lot of use cases where an connection remains active for over 75 seconds.

@dalbert2
Copy link
Contributor Author

Thanks @TD-er I will look at that as well!

The change to the keepAlive() call helped and now I'm down to one (I hope) source of soft watchdog timeouts: WiFiClient::connect(hostname, port). This should always timeout within 5s if either the DNS lookup fails or if the subsequent connect by ip address fails, but it is taking more than 10s so I've got a bug somewhere.

Also, @devyte while looking into this, I read through the DNS lookup process in ESP8266WiFiGenericClass and it concerns me: DNS uses UDP and I'm not sure there is any guarantee about when a response will be received. So if a DNS query ESP8266WiFiGenericClass::hostByName() takes longer than timeout_ms, the query will timeout. If another query is issued (sets _dns_lookup_pending again) and then the response from the first DNS query is received, wifi_dns_found_callback() will be invoked and will write to the address provided in the first query (which is now some unknown location on the stack - and bad things can happen). Since the DNS query process is really single-threaded, I think the remote_addr variable in WiFiClient::connect() must be static. That could still be problematic (result from prior query returned), but at least it avoids writing to random spots on the stack. Am I mis-reading this?

@TD-er
Copy link
Contributor

TD-er commented Feb 27, 2020

The change to the keepAlive() call helped and now I'm down to one (I hope) source of soft watchdog timeouts: WiFiClient::connect(hostname, port). This should always timeout within 5s if either the DNS lookup fails or if the subsequent connect by ip address fails, but it is taking more than 10s so I've got a bug somewhere.

Then we both have the same bug I guess.
I've also noticed those to take longer than the set timeout in some circumstances.

@dalbert2
Copy link
Contributor Author

dalbert2 commented Mar 2, 2020

Although it doesn't address the original issue in this post, extending the soft watchdog (to 12s) prior to calls to connect() has solved my soft reset problem. I will circle back to the exception on onSTAdisconnect as time permits; for now, I've simply removed it. Given what I saw in the DNS code, there may be some race conditions in the core codebase that need to be cleaned up; this seems likely given the nature of the failures I was seeing (only on networks with exceptionally poor connectivity/long latency).

@devyte
Copy link
Collaborator

devyte commented May 16, 2020

It's been two and half months since the last update, and after all the above discussions I still don't see something actionable to pursue, or a defined way of reproducing a problem. There isn't even a MCVE.
I'm closing this.

@devyte devyte closed this as completed May 16, 2020
@marcosfnsc
Copy link

I'm having the same problem on the esp8266, my project consists of controlling an LED strip via wifi and sending a message via websocket to a javascript client depending on the status of the LEDs.

According to GDB, the crash code on line 253 of the ESP8266WIFIGeneric.cpp file

@mcspr
Copy link
Collaborator

mcspr commented Jun 3, 2024

@marcosfnsc pls open a new issue with the gdb output. exception number is sdk code for 'something has gone wrong, heres where to look for possible issues'. can be for any number of things, but definitely not related to enableSTA discussed here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants