Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

No recovery after Timed out waiting for SPI char #102

Closed
evanthomas opened this issue Jul 2, 2020 · 31 comments
Closed

No recovery after Timed out waiting for SPI char #102

evanthomas opened this issue Jul 2, 2020 · 31 comments
Assignees
Labels
bug Something isn't working

Comments

@evanthomas
Copy link

I have an application that polls a couple of endpoints, with adafruit_requests, every few seconds and makes a nice display. Every few minutes adafruit_requests fails with: "Timed out waiting for SPI char" in adafruit_esp32spi.py line 271. Once this happens the stack becomes permanently borked with all subsequent calls failing with the same error. The only way to recover that I've found is to reboot.

This happens with 2 pyportals, 2 physically different networks and multiple USB cables.

Firmware: 1.2.2 and 1.6.1 (current stable)
Circuit python: 4.something and 5.3.0 (current stable)
Libraries: 4.something and 5.x (current stable)

The following code reproduces the problem about 90% of the time with an occasional success in the second GET and occasionally a different error.

import time
import board
import busio
from digitalio import DigitalInOut
from adafruit_esp32spi import adafruit_esp32spi
import adafruit_esp32spi.adafruit_esp32spi_socket as socket
import adafruit_requests as requests
from secrets import secrets

# Non existent endpoint
BAD_DATA_SOURCE = "http://1.2.3.4/hello"

# Good data source
# GOOD_DATA_SOURCE = "http://httpbin.org/get"
GOOD_DATA_SOURCE = 'http://54.236.246.173/get'

# Start WiFi
esp32_ready = DigitalInOut(board.ESP_BUSY)
esp32_gpio0 = DigitalInOut(board.ESP_GPIO0)
esp32_reset = DigitalInOut(board.ESP_RESET)
esp32_cs = DigitalInOut(board.ESP_CS)

spi = busio.SPI(board.SCK, board.MOSI, board.MISO)
esp = adafruit_esp32spi.ESP_SPIcontrol(spi, esp32_cs, esp32_ready, esp32_reset, esp32_gpio0)

requests.set_socket(socket, esp)

print("MAC: " + ':'.join('{:02x}'.format(x) for x in reversed(esp.MAC_address)))
print("Connecting to AP...")
while not esp.is_connected:
  try:
    esp.connect_AP(secrets['ssid'], secrets['password'])
  except RuntimeError as e:
    print("could not connect to AP, retrying: ",e)
    continue
print("Connected to", str(esp.ssid, 'utf-8'), "\tRSSI:", esp.rssi)
print("Firmware vers.", esp.firmware_version)
print("My IP address is", esp.pretty_ip(esp.ip_address))


try:
  r = requests.get(BAD_DATA_SOURCE)
except RuntimeError as e:
  # Timed out waiting for SPI char
  # A failure here is to be expected
  print(e)


# This GET always fails
r = requests.get(GOOD_DATA_SOURCE)
print(r.text)
@tannewt tannewt self-assigned this Jul 2, 2020
@tannewt
Copy link
Member

tannewt commented Jul 2, 2020

Thank you for the example! I'm hoping to look at the ESP32SPI stuff today. I'm reworking requests a bit and need to test them together.

@evaherrada evaherrada added the bug Something isn't working label Jul 10, 2020
@tannewt
Copy link
Member

tannewt commented Jul 17, 2020

Note: I'm reworking a lot of these APIs. I talk the changes over on my stream here: https://www.youtube.com/watch?v=jfRjlu3oCDA (Newer streams will show more progress.)

@kevinjwalters
Copy link

kevinjwalters commented Aug 11, 2020

FYI, there's a post which happens to mention timed out waiting for SPI char on a PyPortal Titano in the forums, @gmeader addition to: Adafruit Forums: wget didn't write a complete file.

@kevinjwalters
Copy link

And a question about how to handle/recover in an application: Adafruit Forums: How to gracefully handle "Timed out waiting for SPI char"?.

@flavio-fernandes
Copy link

I'm hitting the same issue while using the adafruit_minimqtt library:

30392.6: DEBUG - Sending PINGREQ
30392.6: DEBUG - Checking PINGRESP
..Sun, 3/Jan/2021 3:54 pm
...Sun, 3/Jan/2021 3:55 pm
.30452.8: DEBUG - KeepAlive period elapsed -                                    requesting a PINGRESP from the server...
30452.8: DEBUG - Sending PINGREQ
30452.9: DEBUG - Checking PINGRESP
..Sun, 3/Jan/2021 3:55 pm
...Sun, 3/Jan/2021 3:56 pm
.30513.1: DEBUG - KeepAlive period elapsed -                                    requesting a PINGRESP from the server...
30513.1: DEBUG - Sending PINGREQ
30513.1: DEBUG - Checking PINGRESP
..Sun, 3/Jan/2021 3:56 pm
..Traceback (most recent call last):
  File "code.py", line 258, in <module>
  File "adafruit_minimqtt/adafruit_minimqtt.py", line 706, in loop
  File "adafruit_minimqtt/adafruit_minimqtt.py", line 712, in _wait_for_msg
  File "adafruit_esp32spi/adafruit_esp32spi_socket.py", line 132, in recv
  File "adafruit_esp32spi/adafruit_esp32spi_socket.py", line 172, in available
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 725, in socket_available
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 324, in _send_command_get_response
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 243, in _send_command
  File "adafruit_esp32spi/adafruit_esp32spi.py", line 238, in _send_command
RuntimeError: ESP32 timed out on SPI select

Press any key to enter the REPL. Use CTRL-D to reload.

Attempting to to catch the exception and doing wifi.reset() followed by mqtt.reconnect() does
not work at all. As mentioned in #102 (comment)

Looking a the code:

times = time.monotonic()
while (time.monotonic() - times) < 1: # wait up to 1000ms
if self._ready.value: # ok ready to send!
break
else:
raise RuntimeError("ESP32 timed out on SPI select")

I wonder how it came up with 1 second being the max time we should wait for. The overall timeout value in mqtt is 30 seconds, so I would think that waiting more than a sec should not be a big deal.

@tannewt : Is there a particular log/info I should try to fetch in order to see how we can fix this issue? I also wonder if the
mqtt code should be made more resilient so it does not go off the rails as it does now once this happens; it may actually be a better way to approach this issue.

Thanks,

-- flaviof

@tannewt
Copy link
Member

tannewt commented Jan 5, 2021

@tannewt : Is there a particular log/info I should try to fetch in order to see how we can fix this issue? I also wonder if the
mqtt code should be made more resilient so it does not go off the rails as it does now once this happens; it may actually be a better way to approach this issue.

Not that I can think of. There is a lot of complexity here between the ESP32SPI driver and the nina firmware that I don't know how to simplify.

I think your idea to be more resilient is a good one. Networking can cause errors at any time and that's tricky to handle well.

@flavio-fernandes
Copy link

@tannewt : Is there a particular log/info I should try to fetch in order to see how we can fix this issue? I also wonder if the
mqtt code should be made more resilient so it does not go off the rails as it does now once this happens; it may actually be a better way to approach this issue.

Not that I can think of. There is a lot of complexity here between the ESP32SPI driver and the nina firmware that I don't know how to simplify.

I think your idea to be more resilient is a good one. Networking can cause errors at any time and that's tricky to handle well.

Ack. I opened an issue in PyPortal : adafruit/Adafruit_CircuitPython_PyPortal#98 . I hope to be able to solve this from that angle. I see you are also very active there, so any help will be much appreciated. :) Thanks!

flavio-fernandes added a commit to flavio-fernandes/Adafruit_CircuitPython_MiniMQTT that referenced this issue Jan 13, 2021
This change addresses a few issues in the handling of the MQTT
messages that caused the library to become unstable:

- Add wapper for socket.recv() so that an exact number of bytes
  are read into the buffer before attempting to parse the MQTT
  message;

- Fix handling of ping response packets as part of _wait_for_msg(),
  together with all other MQTT messages;

- Fix disconnect so it can gracefully handle cases when socket writes
  are not possible. Also re-init _subscribed_topics as an empty list
  instead of None.

Related-to adafruit/Adafruit_CircuitPython_ESP32SPI#102
Fixes adafruit/Adafruit_CircuitPython_PyPortal#98
Fixes adafruit#54
Signed-off-by: Flavio Fernandes <flavio@flaviof.com>
flavio-fernandes added a commit to flavio-fernandes/Adafruit_CircuitPython_MiniMQTT that referenced this issue Jan 13, 2021
This change addresses a few issues in the handling of the MQTT
messages that caused the library to become unstable:

- Add wapper for socket.recv() so that an exact number of bytes
  are read into the buffer before attempting to parse the MQTT
  message;

- Fix handling of ping response packets as part of _wait_for_msg(),
  together with all other MQTT messages;

- Fix disconnect so it can gracefully handle cases when socket writes
  are not possible. Also re-init _subscribed_topics as an empty list
  instead of None.

Related-to adafruit/Adafruit_CircuitPython_ESP32SPI#102
Fixes adafruit/Adafruit_CircuitPython_PyPortal#98
Fixes adafruit#54
Signed-off-by: Flavio Fernandes <flavio@flaviof.com>
@flavio-fernandes
Copy link

@kevinjwalters @tannewt ^^ ;)

flavio-fernandes added a commit to flavio-fernandes/Adafruit_CircuitPython_MiniMQTT that referenced this issue Jan 13, 2021
This change addresses a few issues in the handling of the MQTT
messages that caused the library to become unstable:

- Add wapper for socket.recv() so that an exact number of bytes
  are read into the buffer before attempting to parse the MQTT
  message;

- Fix handling of ping response packets as part of _wait_for_msg(),
  together with all other MQTT messages;

- Fix disconnect so it can gracefully handle cases when socket writes
  are not possible. Also re-init _subscribed_topics as an empty list
  instead of None.

Related-to adafruit/Adafruit_CircuitPython_ESP32SPI#102
Fixes adafruit/Adafruit_CircuitPython_PyPortal#98
Fixes adafruit#54
Signed-off-by: Flavio Fernandes <flavio@flaviof.com>
flavio-fernandes added a commit to flavio-fernandes/Adafruit_CircuitPython_MiniMQTT that referenced this issue Jan 13, 2021
This change addresses a few issues in the handling of the MQTT
messages that caused the library to become unstable:

- Add wapper for socket.recv() so that an exact number of bytes
  are read into the buffer before attempting to parse the MQTT
  message;

- Fix handling of ping response packets as part of _wait_for_msg(),
  together with all other MQTT messages;

- Fix disconnect so it can gracefully handle cases when socket writes
  are not possible. Also re-init _subscribed_topics as an empty list
  instead of None.

Related-to adafruit/Adafruit_CircuitPython_ESP32SPI#102
Fixes adafruit/Adafruit_CircuitPython_PyPortal#98
Fixes adafruit#54
Signed-off-by: Flavio Fernandes <flavio@flaviof.com>
flavio-fernandes added a commit to flavio-fernandes/Adafruit_CircuitPython_MiniMQTT that referenced this issue Jan 13, 2021
This change addresses a few issues in the handling of the MQTT
messages that caused the library to become unstable:

- Add wapper for socket.recv() so that an exact number of bytes
  are read into the buffer before attempting to parse the MQTT
  message;

- Fix handling of ping response packets as part of _wait_for_msg(),
  together with all other MQTT messages;

- Fix disconnect so it can gracefully handle cases when socket writes
  are not possible. Also re-init _subscribed_topics as an empty list
  instead of None.

Related-to adafruit/Adafruit_CircuitPython_ESP32SPI#102
Fixes adafruit/Adafruit_CircuitPython_PyPortal#98
Fixes adafruit#54
Signed-off-by: Flavio Fernandes <flavio@flaviof.com>
flavio-fernandes added a commit to flavio-fernandes/Adafruit_CircuitPython_MiniMQTT that referenced this issue Jan 13, 2021
This change addresses a few issues in the handling of the MQTT
messages that caused the library to become unstable:

- Add wrapper for socket.recv() so that an exact number of bytes
  are read into the buffer before attempting to parse the MQTT
  message;

- Fix handling of ping response packets as part of _wait_for_msg(),
  together with all other MQTT messages;

- Fix disconnect so it can gracefully handle cases when socket writes
  are not possible. Also re-init _subscribed_topics as an empty list
  instead of None.

Related-to adafruit/Adafruit_CircuitPython_ESP32SPI#102
Fixes adafruit/Adafruit_CircuitPython_PyPortal#98
Fixes adafruit#54
Signed-off-by: Flavio Fernandes <flavio@flaviof.com>
flavio-fernandes added a commit to flavio-fernandes/Adafruit_CircuitPython_MiniMQTT that referenced this issue Jan 14, 2021
This change addresses a few issues in the handling of the MQTT
messages that caused the library to become unstable:

- Add wrapper for socket.recv() so that an exact number of bytes
  are read into the buffer before attempting to parse the MQTT
  message;

- Fix handling of ping response packets via _wait_for_msg();

- Fix disconnect so it can gracefully handle cases when socket writes
  are not possible. Also re-init _subscribed_topics as an empty list
  instead of None.

Related-to adafruit/Adafruit_CircuitPython_ESP32SPI#102
Fixes adafruit/Adafruit_CircuitPython_PyPortal#98
Fixes adafruit#54
Signed-off-by: Flavio Fernandes <flavio@flaviof.com>
@KTibow
Copy link

KTibow commented Feb 21, 2021

What's the status of this issue? Is there anything I can do to move this forward?

@flavio-fernandes
Copy link

What's the status of this issue? Is there anything I can do to move this forward?

AFAIK, this was actually an issue in dafruit_CircuitPython_MiniMQTT, where receive code was expecting
to get no less than the number of bytes parameter:

 https://github.com/adafruit/Adafruit_CircuitPython_MiniMQTT/issues/54

I could be out of context here if this is not related to CircuitPython_MiniMQTT. Apologies, if that it the case.
Otherwise, this resolved in adafruit/Adafruit_CircuitPython_MiniMQTT#55

@KTibow
Copy link

KTibow commented Feb 22, 2021

Got it. I installed from master. Now I just need to find out how to fix RuntimeError: Failed to send 2 bytes (sent 0)

@brentru
Copy link
Member

brentru commented Feb 22, 2021

@KTibow
Copy link

KTibow commented Feb 22, 2021

@brentru that worked! Thanks!

@brentru brentru closed this as completed Feb 22, 2021
@ALEEF02
Copy link

ALEEF02 commented Feb 26, 2021

What's the workaround for this? Currently having this issue as well - sending multiple GET requests often leaves me with RuntimeError: ESP32 timed out on SPI select

@KTibow
Copy link

KTibow commented Feb 26, 2021

Update all of your libraries, install the CircuitPython beta, and update your ESP32 firmware.

@evanthomas
Copy link
Author

Sorry, the symptoms are very slightly different but the test code above still fails in the same way. The only difference I can see is that first timeout on the bad URL produces the message "ESP32 not responding" but the stack is still broken and unrecoverable.

Have changes gone into the beta yet?
I'm using CP 6.2 beta, 6.x libraries and FW 1.7.1

The underlying network stack doesn't seem to be anymore reliable either. My application still needs to reboot every few minutes to deal with this (I'm polling internal data endpoints every 10s).

@KTibow
Copy link

KTibow commented Feb 28, 2021

Are you sure it's unrecoverable? What happens if you try to catch it and retry?

@evanthomas
Copy link
Author

Yes, I am. My production code catches and retries 5 times and then gives up and reboots. Can you please have a look at the example I posted when I raised the issue. Why does the second request fail?

@evanthomas
Copy link
Author

I could be out of context here if this is not related to CircuitPython_MiniMQTT. Apologies, if that it the case.
Otherwise, this resolved in adafruit/Adafruit_CircuitPython_MiniMQTT#55

@flavio-fernandes I am right in understanding that you closed this issue because a defect in the mqtt library was fixed?
This defect has nothing to do with mqtt.

@HerbertUnterberger
Copy link

same with me not using the mqtt

i use PyPortal fetch

@flavio-fernandes
Copy link

I could be out of context here if this is not related to CircuitPython_MiniMQTT. Apologies, if that it the case.
Otherwise, this resolved in adafruit/Adafruit_CircuitPython_MiniMQTT#55

@flavio-fernandes I am right in understanding that you closed this issue because a defect in the mqtt library was fixed?
This defect has nothing to do with mqtt.

Yes, kinda. ;) From my understanding, the changes on how SPI handles receive in PR112 affected CircuitPython_MiniMQTT. Moreover, @tannewt convinced me that the handling due to that change is best if done at a higher layer, which ultimately became the issue 54 in CircuitPython_MiniMQTT that I worked on. I apologize for the confusion this may have caused. It sounds to me that there are still other libraries/code that remains broken due to that change. Please reopen this bug if that is the case. I hope this gets resolved for all cases.

@evanthomas
Copy link
Author

I can't reopen it.

@jepler jepler reopened this Mar 5, 2021
@tannewt tannewt removed their assignment Mar 9, 2021
@mikejc58
Copy link
Contributor

mikejc58 commented Apr 3, 2021

I have been plagued by the 'Timed out waiting for SPI char' exception for a long time, both on a Matrix Portal and PyPortal. To try to understand what was going on, I added code to ESP_SPIcontrol._wait_spi_char to increase the timeout to 1 second and to record the times for each request. Here is what I found:

Running for 8 hours with a simple application on a PyPortal, talking via a socket to another system, in 8 hours there were 2.2 million calls to _wait_spi_char. They were essentially all from my application's calls to socket.connected and socket.available, waiting for messages. Of those 2.2 million calls, 99.5% of them completed in less than 1 ms (average was actually less than 5 us). 10,000 of them took between 1 and 10 ms. But 30 of them took more than 100ms (which is the timeout value coded in _wait_spi_char). All of those would have generated the exception. The longest of those 30 was just 176 ms. Since I had increased the timeout to 1000 ms, none of them failed and the application had no problems. Of interest is that there were none in the range between 10ms and 100ms. Something, (garbage collect perhaps) occasionally causes a relatively long wait.

Running with the timeout at 1 second has eliminated this as a problem for me.

@brentru
Copy link
Member

brentru commented Apr 5, 2021

@mikejc58 Which nina-fw version is running on the ESP32?

@mikejc58
Copy link
Contributor

mikejc58 commented Apr 5, 2021

It is Nina-fw-1.7.1 And circuitpython 6.1.0

BTW, I am doing some more experiments, and will let you know when done (a few days). But, I had an error in my comment of 2 days ago. I said the average time for calls to wait_spi_char that were less than 1ms was less than 5 microseconds. That was completely wrong (silly math mistake), the actual average of those requests seems to be about 115 microseconds.

@mikejc58
Copy link
Contributor

mikejc58 commented Apr 6, 2021

I've concluded my experiments and have attached a file here with details.

In short, my conclusion is that all of the 'Timed out waiting for SPI Char' exceptions that I've gotten are due to garbage collect. I believe the best course of action is to abandon timing these requests and substitute a simple retry counter. For the millions of requests that I've tracked there were none that failed on the first call to _read_byte(spi) but worked on a subsequent call. Any call to _read_byte(spi) that failed the first time, failed for the next 50 tries before finally timing out. So a simple retry loop eliminates the issues with timing (garbage collect and the use of time.monotonic() for short duration timeouts)

conclusions.txt

@ladyada
Copy link
Member

ladyada commented Apr 6, 2021

@mikejc58 good sciencing :) want to submit a PR?

@mikejc58
Copy link
Contributor

mikejc58 commented Apr 6, 2021

Sure. I am going to run with the code that I will push for a while first though.

@ajs256
Copy link

ajs256 commented Jun 20, 2021

I am also running into this error with my project. I have many errors logged, and I will make them available for sciencing if needed. Is there any update on a fix?

Never mind! I wasn't running the latest version of the library. I will send another comment here if I am still experiencing this error, but consider this resolved.

@ajs256
Copy link

ajs256 commented Jun 21, 2021

Should this be closed now that #132 is merged?

@tannewt
Copy link
Member

tannewt commented Jun 22, 2021

Sure!

@tannewt tannewt closed this as completed Jun 22, 2021
rtwfroody pushed a commit to rtwfroody/Adafruit_CircuitPython_MiniMQTT that referenced this issue Sep 18, 2022
This change addresses a few issues in the handling of the MQTT
messages that caused the library to become unstable:

- Add wrapper for socket.recv() so that an exact number of bytes
  are read into the buffer before attempting to parse the MQTT
  message;

- Fix handling of ping response packets via _wait_for_msg();

- Fix disconnect so it can gracefully handle cases when socket writes
  are not possible. Also re-init _subscribed_topics as an empty list
  instead of None.

Related-to adafruit/Adafruit_CircuitPython_ESP32SPI#102
Fixes adafruit/Adafruit_CircuitPython_PyPortal#98
Fixes adafruit#54
Signed-off-by: Flavio Fernandes <flavio@flaviof.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests