Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sonoff Dual R2 IP Address not reachable after a while #614

Closed
gd-99 opened this issue Mar 1, 2018 · 45 comments · Fixed by #1877
Closed

Sonoff Dual R2 IP Address not reachable after a while #614

gd-99 opened this issue Mar 1, 2018 · 45 comments · Fixed by #1877
Assignees
Milestone

Comments

@gd-99
Copy link

gd-99 commented Mar 1, 2018

To avoid mixing up what may be two different issue that I noted at the bottom of issue #572 I thought it better to open an new issue here.

I have managed to do some more testing - having finally managed to lose connection to the web /telnet interface. I think my issue may be related to the closed issue #392.

Briefly my setup is an unmodified Technicolor TG582n 4 port wireless router connected to a couple of devolo "powerline type" wifi extenders. These extenders are ultimately connected to one of the four Ethernet ports. The sonoff Dual r2 is in the same room as the router less than 2m away from it. Signal strength to the router almost maxed out. The nearest devolo is across the hall at the far end of the room - signal strength almost non existent. I use a shared (all use the same SSID) and hidden SSID.

Logging into the router web interface I am able to see the devices connected to its interfaces - wifi /ethernet port. When I am able to connect to the Sonoff it always appears attached to the routers wifi interface. When I can't connect to the sonoff it always appears on the Ethernet port. As the Sonoff is wifi it has to be assumed appearance on the ether port has to be because it has decided to attach to the devolo AP - with almost no signal strength in preference to the router with maximum strength! When I next lose connection I will see if I can log onto the devolo and see if it registers the sonoff as a device connected to it.

I am not sure why even if the sonoff is connected to the devolo AP I can't access the Sonoff web interface. A tablet connected to the same devolo wifi extender, is able to browse the internet, but yet not able to connect to the Sonoff when it goes off line. Conversely when the Sonoff comes back on line (connected to the router wifi) the tablet is then able to access the sonoff web interface while attachd to the same devolo AP.

As previously noted leaving Telnet enabled in the ESPurna Web interface, and repeatedly initiating a telnet connection will in most cases bring the Sonoff back online. The last time I had to run nping --arp to get the Sonoff to talk to me.

Wifi on the Sonoff is configured with Scanning turned off. With scanning turned on it is unable to connect to the hidden SSID.

Have I missed something I can do to stop the sonoff hopping onto a different AP? Alternatively, if it can be exposed on the web interface, is there a way to force the sonoff to only bind to a specific BSSID?

@mr-sneezy
Copy link

mr-sneezy commented Mar 1, 2018

I'm not sure if I have a similar issue, but mine started flapping in and out of 'unavailable' status in Home Assistant every few minutes yesterday. I'd have put it down to new WiFi router issues but for everything else on the AP is fine, and the bridge did work fine and reliable for about a week since last firmware update to the dev code.
Wondering if I report mine as a new issue or try to reflash the firmware, in the meantime the device will have to be shut down...

@gd-99 gd-99 changed the title Esp8266 IP Address not reachable after a while Sonoff Dual R2 IP Address not reachable after a while Mar 1, 2018
@icevoodoo
Copy link

I confirm also on SONOFF TH10 device this problem. The device is working but the web is not responding. I must power off the device to can log on on device web page.

@xoseperez xoseperez added this to the 1.13.0 milestone Mar 2, 2018
@xoseperez xoseperez self-assigned this Mar 2, 2018
@gd-99
Copy link
Author

gd-99 commented Mar 3, 2018

Not sure if this is a red fish, but thought I would update the issue report in case it is important. When my sonoff went AWOL yesterday I did some checking in the arp table.

On a PC, when missing, the arp table reports the sonoff as:
ps (192.168.x.xx) at on eth0
On an Android tablet, when missing, as:
ESP8266-1-60-01-94-yy-yy-yy.lan (192.168.x.xx) at on wlan0.

The IP address part is correct.

I am not sure why the host(?) name is different between the PC (Linux) arp table entry and the tablet. Looking at the tablet arp table the 60-01-94-yy part of the tablet arp table name looks to be a copy of the mac address, however the real MAC address of the sonoff device starts of as 60-01-94 but the yy part is different.

When I am able to access the Web interface (Sonoff not AWOL) the arp table entries are:
For the PC:
ps (192.168.x.xx) at 60:01:94:xx:xx:xx [ether] on eth0
For the tablet:
ESP8266-1-60-01-94-yy-yy-yy.lan (192.168.x.xx) at 60:01:94:xx:xx:xx [ether] on wlan0

The difference being the bit is now populated with the correct MAC address of the Sonoff device and the addition of [ether] field.

I don't know if it was just luck but running nmap against my IP address range made the Sonoff turn up against its IP address and I was able to access the web interface.

@Geitde
Copy link

Geitde commented Mar 4, 2018

Hi, First of all: I am back! :D You tricked me for a while with swapping the hoster for your source. :)

In the meantime I updated my home with additional devices and I experienced the same behavior. Randomly one of my Sonoff Dual (incl. r1 + r2) drops out. It works fine in the way that the firmware is still running, as I can use the wired buttons, but accessing it remotely is no longer possible.

@Geitde
Copy link

Geitde commented Mar 6, 2018

I just had another non working Sonoff inside my wall.

Pinging the device works fine. So it is not the wifi connection itself failing.

@sashimanu-san
Copy link

sashimanu-san commented Mar 10, 2018

I can confirm this bug, or a even a family them:

  1. Works ok, but then no WiFi, no ping
  2. Works ok, but then no web interface, ping ok
  3. After a reboot, does not respond to pings and web interface until pinged by the DHCP server/default gateway, afterwards works ok, at least for some time.

In all cases, the firmware keeps running: reacts to the button, the led blinks appropriately, scheduled events occur on time, the embedded AP is not visible.
Hardware is Sonoff Basic.

@Geitde
Copy link

Geitde commented Mar 11, 2018

The device type does not matter, too. The issue is generic and happens for all Espurna-flashed devices.

I have 10 devices and mostly all of them mounted inside of walls (Stone walls,not Amercian drywalls) . It is quite annoying to unplug/replug the fuse to reactivate the device.

@mr-sneezy
Copy link

mr-sneezy commented Mar 11, 2018

So what sort of bug can take days or weeks to over write some important byte or other ?
Edit: PS. I have two Tasmota FW loaded Sonoff switch devices also, both running for months fully stable.

@sashimanu-san
Copy link

Days or weeks? For me it's a matter of hours before the unit goes unreachable.

@Geitde
Copy link

Geitde commented Mar 12, 2018

As I said I have 10 devices and even if i power down the entire flat to perform a fresh start on all devices the vanish at random times. Some after a week, some after a few hours. The timing is random and not device specific.

It must be some glitch in the network service handling that prevents getting connected properly. Maybe some time out or retry counter overflow causing issues with negative values.

I guess there is no memory or hardware resource management? Or is there a micro kernel dealing with memory allocations. In that case it could simply ran out of memory due a leak or fragmentation.

I had no time to deeper look into the code, so just blind guessing here.

@sashimanu-san
Copy link

The bug does not manifest itself if the unit is being pinged once a second.
Definitely a Heisenbug!

@xoseperez
Copy link
Owner

I'm sorry guy I'm not able to help you here. Not my field and I don't have a clue on where to start debugging this...

@Geitde
Copy link

Geitde commented Mar 16, 2018

Well, without a fix espurna is quite useless. As for a 24/7 use case espurna renders the devices useless.

I avoided to use all devices for quite some time and now all my devices are no longer responding. Basic, Touch, Dual, Dual R2.

Basically plug in the device in and simply do not use it in any way. It seems that any network access avoids the bug, so there is clearly some timeout or counter overrun. After a few days it is no longer responsive from the network side. Make sure no device or user performs a network scan e.g. by app or so, as this will probably reset the issue.

@icevoodoo
Copy link

Please Geitde can you make a test with the last firmware and disable Scan networks option in WiFi section? I have 9 day's up time till now and the device is responding on http....

@Geitde
Copy link

Geitde commented Mar 16, 2018

just did that. Lets see, when it still fails here or not.

Just remember whenever you use http or anything else you probably reset the "timing" causing the delayed issue.

@gd-99
Copy link
Author

gd-99 commented Mar 17, 2018

I have left my Sonoff Dual R2 alone for a while to see if I can get some consistency. I use my device more as a time switch (schedule) with the ability to remotely override the on or off time. Leaving the device to do its thing, I have found the scheduling carries on working faultlessly. This I assume points me at the issue as something specifically network related.

I have found that if I close one browser before opening a browser on a different device more often than not I can get to the web interface in the new browser without a problem.

IMHO I don't think the problem is with ESPurna because I have seen others reporting the same sort of issue such as this one: Esp8266 IP Address not reachable after a while #2330 which goes back to the ESP8266 core for Arduino.

When I do lose the device I have found I can "re-find" it by doing an "ARP ping" using nping or nmap. This means I do not need to physically power cycle the sonoff - just a bit of a pain waiting for nmap to complete. I think this points back to the LwIP, which is part of the ESP8266 core and I think implicated in handling ARP? I see there has been some work in updating LwIP, but the current version (v2) throws up other bugs. Is LwIP v2 updated with the ESP32 in mind?

For my usage it would be very useful if I could "lock" the sonoff to a specific BSSID (AP), to stop it latching, incorrectly, onto nearby APs in my network. Would this be possible in a future version?

@Geitde
Copy link

Geitde commented Mar 17, 2018

Ok, I did this

sudo nping --arp-type ARP-reply 192.168.0.192

on one of my not yet power cycled devices and yeah. This recovers the web interface. Unfortunately it does not recover the alexa stuff. The device remains dead for that kind of protocol.

Even a reboot using the web interface did not help here.

@Geitde
Copy link

Geitde commented Mar 18, 2018

Ok, some update. After turning off the wifi scan, my devices seem to be stable. At least I did not experience a single problem since them. Even alexa somehow managed to find the devices again after some annoying long time.

However, I have some updated theory.

The wifi scan blocks normal data transfer, as the wifi chip is in a different mode as it needs to change frequency. At least the atheros chips did that when I dealed with the drivers on some other project.

What happens, when the wifi scan mode is active and other parts of espurna e.g. the NTP driver wants to send out a message itself and hits the scan?

This likely will fail, so my guess is that error causes the entire network to be set into "failed to connect" mode, even while it is connected. The ARP ping forces an answer, which somehow is not protected by that "not connected" state at it is so low profile and since it replies properly the state gets reset. Just my 50 cents for today.

@morgapa
Copy link

morgapa commented Mar 19, 2018

I have the same problem

@xoseperez
Copy link
Owner

Interesting. I have not seen any issues with wifi scan. Other modules should not try to connect if there is no valid connection ready.

@Geitde
Copy link

Geitde commented Mar 20, 2018

Ok, some update. Heisenbug returns.

Disabling Wifi scan did not help. Three of my devices (Dual + DualR2 + Touch) disconnected over night with that option disabled.

So back to page one. At least we are all on the same one again.

@CrappyTan
Copy link

Could someone confirm that when theirs vanish like this does it revert to AP? I've got a few instances of this which seem to be runtime related.

@icevoodoo
Copy link

Till now on my SONOFF TH10 with wifi scan disable I have more than 3 days's uptime, and I can access the web interface. I have activate on my route also the option to bind ip to mac addresses. Strange is that in the prevoius versions of ESPURNA firmware this problem was not on my device. Maybe Xose must check what is different from one version to another, what he change ???

@Geitde
Copy link

Geitde commented Mar 21, 2018

I did not experience the issues before the topic came up here and i used the latest versions to get most recent features.

However we can take wifi scan from the list, as here multiple devices already failed with wifi scan disabled. Currently I am waiting for a device to fail to see if we get the AP mode.

Stupid question. As future versions are a little picky about OTA updates due space limitations, I probably need the two step update. Which Core Espurna needs to be used with which device. 1MB or 4MB for what? Or can I use the 1 MB for all and they only lack some feature(s)?

@CrappyTan
Copy link

Personally, the core functionality should be limited. I think you are baking too much in to the product.

A use-case should be basic function (LEDs, relay etc) , web interface and perhaps mqtt/api. This allows the device to function as well as be controlled by an external source.

If you want to get fancier then you should get an RPi and do the hard work there in controlling it with sync and timers etc. The ESP is a small devil which is working too hard... :)

Just my 10c....

@Geitde
Copy link

Geitde commented Mar 21, 2018

i want the core stuff just to perform a dual flash of new espurna.

flash small core, reboot and flash new full size espurna. With 1.12.4 flashed you cannot update stuff using the web interface anymore, as it takes to much space.

Also I now can answer the AP question. One of my devices just dropped out again and it is not in AP mode as I see no new (espurna) networks.

@xoseperez
Copy link
Owner

The two-step updates is out of topic here but... The key point here is that the ESPurna Core image should use the same flash layout as your previous image so it can find the configuration section and connect to the same WiFi so you can perform the second step. You could even automate the two-step update (that's something I want to add to the ESPurna OTA manager).

If you flash the espurna-core-1MB.bin image to a 4MB device (like a wemos) it won't find the configuration information and will revert to factory settings.
Of course you can use 1MB images for every device, even those with 4MB flashes since right now ESPurna is not using the rest of the flash. But mind that prebuild images are compiled using the flash size of the target device.

@xoseperez
Copy link
Owner

My feeling, after reading your comments here and in the Arduino Core repo is that this is no entirely related to ESPurna thou there might have been a change somewhen that has somehow made it more obvious (notice all the conditionals and doubts).

@icevoodoo what version of ESPurna were you using when it was stable?

@xoseperez
Copy link
Owner

@Geitde and the rest. Have you tested the WiFi.setSleepMode(WIFI_NONE_SLEEP); solution?

@icevoodoo
Copy link

I have reached 4 days uptime till now with this settings (we will see after another 4 days...). I don't know if has something to do with this but I will list here. Regarding the question from wich version I have seen this kind of problem, I think from 1.12.1 or 1.12.2, don't remember...

Alexa integration - off
Scan networks - off
No schedule
Domoticz - off
HASS - off
Thingspeak - off

@icevoodoo
Copy link

Again with the last firmware. What is interesting is that from my laptop I cannot access web interface and no ping response. I have checked my arp table and I have no records for device... Instead from my RPI3 I can ping and I have response from device and also the arp table has the mac address of the device...

What do you think about that ???

@Geitde
Copy link

Geitde commented Apr 13, 2018

I was forced to use tasmota, so no more reports from me for a while.

@gd-99
Copy link
Author

gd-99 commented Apr 13, 2018

I too have started using the Tasmota firmware on a second device to see if on my network it suffers the same drop out. Though it is early days I haven't lost contact with the Sonoff Dual R2 yet.

I am keeping the two firmware in there own environment on my hard disk for which I downloaded the latest Arduino IDE along with the Arduino Core for ESP8266 via the Board Manager. I notice the Arduino Core for ESP8266 has now moved to version 2.4.1. Using the 2.4.1 version, when selecting the IwIP Variant the option label has changed to read: v1.4 High Bandwidth. This is what I have compiled Tasmota against. Once I have finished checking the connection issue against the two firmware, I will try to recompile Espurna using the Arduino Core for ESP8266 version 2.4.1, just to see if the changes make any difference to the sonoff dropping off the network.

@gd-99
Copy link
Author

gd-99 commented Apr 22, 2018

Pants! Well it has taken over 10 days, but finally I have a failure to connect - to the Tasmota flashed Sonoff Dual R2. The ESPurna is currently fine. However within the last 10 days I did lose contact with the ESPurna, but followed icevoodoo observation and set up an SSH tunnel from my RPI3 to the ESPurna flashed Sonoff. I was then able to connect to the missing ESPurna device web interface by connecting locally. I am not sure I can figure out why that should be.

I managed to reconnect to the Tasmota, by running nmap /24 on the RPI. It takes forever to run but at some point in the scan it seems to sort something out. There was nothing in the Tasmota log to indicate where it had gone and just like the ESPurna the background tasks such as timers kept working - even if I lost contact with the web interface.

The only fly in my ointment, is that I had a wifi switch failure on my network during the last 10 days, which caused me to mess with the network topology which could have upset these devices. I cant help feeling its a bit like Schrödinger's cat and every time I check the connectivity I'm affecting the outcome.

As the Tasmota dropped off the network, it doesn't appear necessary to recompile my ESPurna with the latest core v2.4.1.

I will keep monitoring both from time to time and see if any pattern becomes apparent.

@darshkpatel
Copy link
Contributor

having the same issue.
Mine seems to work alright for a couple of days with wifi scan off

@RoSulek1
Copy link

RoSulek1 commented May 1, 2018

Having the same issue (sff basic/th10/sc/rf bridge/t1 ...
Ping from pc connected via lan working. Ping from pc connected via wifi not response. Restart router. Ping from wifi working 15-20 minutes, and stopped.

@icevoodoo
Copy link

@xoseperez: Please can we have a proper resolution for this, because it is very annoying. I see you have targeted the resolving of this on 1.13 version. can we have an explanation at least for this? There is hope? :)

@xoseperez
Copy link
Owner

Not really. I'm not experiencing this issue with any of my devices and at the same time I don't have experience with this kind of network issues...

@mcspr
Copy link
Collaborator

mcspr commented May 16, 2018

The gist of other issues seems to be sleep mode (and maybe some broken wireless hardware causing sleep mode to misbehave) - this comment from upstream does something similar to arp-ping solutions and tries to send periodic keep-alive requests to network gateway while still keeping default WIFI_MODEM_SLEEP (or using WIFI_LIGHT_SLEEP as tasmota does)
esp8266/Arduino#2330 (comment)
https://github.com/d-a-v/PingAlive

Besides firmware issues, I've too had similar nonsense disconnection issues until replacing old TEW652 (2009 model with ar71xx, recent lede fw). Was ok for core 2.3.0 - stopped working on 2.4.*. Just cannot have more that 4 esp on the same network - new connections drop out and existing sometimes get lost in AP -> connect -> AP -> reconnect cycle.

@xoseperez xoseperez modified the milestones: 1.13.0, 1.14.0 Jun 4, 2018
@vtochq
Copy link
Contributor

vtochq commented Aug 20, 2018

I can confirm that is ARP issue.
I can reproduce it only on Sonoff T1 with ESP8285.
Other ESP devices work fine (one with ESP8285, two with ESP8266).

Now default WIFI_SLEEP_MODE is WIFI_NONE_SLEEP, but it don't help.

I reproduce this ARP problem with all paltforms: 1.5.0, 1.6.0, 1.7.3, current stable (1.8.0).

nping says that device don't answer on ARP:

$ sudo nping --arp-type ARP 192.168.1.41

Starting Nping 0.6.40 ( http://nmap.org/nping ) at 2018-08-20 16:50 +06
SENT (0.0221s) ARP who has 192.168.1.41? Tell 192.168.1.4
SENT (1.0224s) ARP who has 192.168.1.41? Tell 192.168.1.4
SENT (2.0235s) ARP who has 192.168.1.41? Tell 192.168.1.4
SENT (3.0247s) ARP who has 192.168.1.41? Tell 192.168.1.4
SENT (4.0255s) ARP who has 192.168.1.41? Tell 192.168.1.4

If I add ARP record all works fine for a while (now uptime >5 days).

@mcspr
Copy link
Collaborator

mcspr commented Aug 20, 2018

@vtochq can you try using latest platform (1.8.0) and substituting ...LWIP... define for -DPIO_FRAMEWORK_ARDUINO_LWIP2_HIGHER_BANDWIDTH here? (from comments above, to use lwip2 for network stack)

build_flags = -g -w -DMQTT_MAX_PACKET_SIZE=400 -DNO_GLOBAL_EEPROM ${sysenv.ESPURNA_FLAGS} -DPIO_FRAMEWORK_ARDUINO_LWIP_HIGHER_BANDWIDTH

@vtochq
Copy link
Contributor

vtochq commented Aug 22, 2018

@mcspr sorry, missed comment about LWIPv2.
Now I try 1.8.0 with PIO_FRAMEWORK_ARDUINO_LWIP2_HIGHER_BANDWIDTH and same result: no ARP RCVD when I use nping.

@GiorgioRoma
Copy link

Hi, i have the same problem, but only with Sonoff Pow R2. Then Sonoff RF with ESPURNA 1.13.3 is ok, without problems, while two sonoff pow R2 (Espurna 1.13.3), after 1-2 minutes to reboot, lost connection. Help !!!

@GiorgioRoma
Copy link

Good evening to all, I would like to provide an update to the problem. The day after my report, I went to work and I remotely connected to the raspberrry of the house and magically both sonoffs were online and the related web page always available .... all day !! When I got home, I connected my wifi to the wifi and after a few minutes the sonoffs were out of line again. So I thought it was a problem of congestion of the wifi (I had 12 devices connected simultaneously), I turned off the phones, ipad, I enabled the 5Gz frequency on the router, so that devices other than sonoff, use this frequency, and from that I did not have any more problems (I did not change any settings on the sonoffs).
Try it too and let me know.
Sorry for my English.
Bye

@vtochq
Copy link
Contributor

vtochq commented Nov 16, 2018

I still have problem with ARP, but only on Sonoff T1 dual. Maybe it's a chip bug or smth like that. I try all platform versions, LWIP versions and some last Espurna versions.
Now I just create static ARP record on all devices which connected to this Sonoff and all works fine.

I don't try two things (because my switch is already wall mounted):

  1. Full erase flash and reflash Espurna;
  2. Change MAC-address.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.