Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WebUI blank when accessing espurna over OpenVPN on 1.13.5 (WS sends incomplete data) #1610

Closed
ddf89 opened this issue Mar 6, 2019 · 20 comments · Fixed by #1723
Closed

WebUI blank when accessing espurna over OpenVPN on 1.13.5 (WS sends incomplete data) #1610

ddf89 opened this issue Mar 6, 2019 · 20 comments · Fixed by #1723

Comments

@ddf89
Copy link

ddf89 commented Mar 6, 2019

Bug description
WebUI blank when accessing espurna over OpenVPN

Steps to reproduce
Access and login to webui over OpenVPN

Expected behavior
WebUI works same as when on local network

Screenshots
Over VPN:
over vpn

Locally;
locally

Tested on SONOFF POW-R2, T1 (1,2,3 gang), BASIC and S20

Accessing WEBUI no longer works for me on 1.13.5 when I'm connected via OpenVPN, this used to work properly on 1.13.3. Inspecting the WS I can see that over the VPN, the device does not send all the data frames that it sends when connected locally. I doubt this has to do anything with the VPN or network per se, since when connected locally, all my devices are still on another VLAN, so traffic is routed (via a pfsense router).

@ddf89
Copy link
Author

ddf89 commented Mar 6, 2019

Last data frame over vpn is the relayConfig object, whilst the last data frame locally is the one with all the configuration. This explains why over overvpn the UI can't load. Does anything come to mind that could cause this?

@ddf89 ddf89 changed the title WebUI blank when accessing espurna over OpenVPN on 1.13.5 (WS sends less data) WebUI blank when accessing espurna over OpenVPN on 1.13.5 (WS sends incomplete data) Mar 6, 2019
@ColinShorts
Copy link
Contributor

Maybe the packet is fragmented and somehow lost? Possibly a misconfiguration in MTU size. You might be able to test by specifying packet size in ping, start at 1500 (too big for VPN normally, but the standard for Ethernet) but you might find it works around 1432.

Just a guess though 😀

@ddf89
Copy link
Author

ddf89 commented Mar 6, 2019 via email

@ColinShorts
Copy link
Contributor

I think that it's too low level to be attributed directly to espurna directly, possibly network stack related, that is unless the packet now contains a lot more data than it used to.

I might be able to test this tomorrow, but I'd likely need to flash two devices and dust off my VPN server. I normally use a dynamic socks proxy over SSH, maybe you could try that as a workaround in the mean time?

@mcspr
Copy link
Collaborator

mcspr commented Mar 7, 2019

The difference is that old one sends several ws data frames instead of just one in 1.13.5
Technically, it must work exactly the same, because the underlying library will space out the data.

And, funnily enough, the first thing i tried is reverse ssh tunnel + wscat tool on the remote server (~150ms for good measure), which worked perfectly.

@ddf89
Copy link
Author

ddf89 commented Mar 7, 2019 via email

@ddf89
Copy link
Author

ddf89 commented Mar 7, 2019

screenshot

@ColinShorts
Copy link
Contributor

I still think this problem is caused way deeper than espurna, i.e. by way of inclusion from:
#include <ESP8266WiFi.h> in justwifi.

There is an example of a user with problems using Hyperion and a large string of WS28nn LED's who runs into problems with fragmentation/dropped packets here

caveat emptor, I do use a lot of espurna devices, and on the odd occasion OpenVPN (just not together). I'm not a developer and tend to get a bit lost in the code at points. @xoseperez or @mcspr will know a lot more than me from the code/library side.

@ddf89
Copy link
Author

ddf89 commented Mar 7, 2019

I see your point. The thing is that the response for / gets split down properly according to the negotiated MSS. It's just the WS datagram that's not, I suspect this to be true for the whole WS conversation.

@mcspr
Copy link
Collaborator

mcspr commented Mar 7, 2019

Right. Both responses are handled by ESPAsyncWebServer:
HTTP: https://github.com/me-no-dev/ESPAsyncWebServer/blob/95dedf7a2df5a0d0ab01725baaacb4f982dedcb2/src/WebResponses.cpp#L263
WS: https://github.com/me-no-dev/ESPAsyncWebServer/blob/95dedf7a2df5a0d0ab01725baaacb4f982dedcb2/src/AsyncWebSocket.cpp#L54

Have almost the same results on my test openvpn setup, it fails even with 1.6kb of data :/
Debugging does not really makes sense, as the data is reported as sent OK. ACK never arrives from it's point of view and it times out the connection.
Using latest Core / lwip2 does "fix" this behaviour.

@ColinShorts
Copy link
Contributor

It might be a moot point, but I take it you are both (also) using UDP for OpenVPN?

@mcspr
Copy link
Collaborator

mcspr commented Mar 7, 2019

Yes. But problem is still there with proto tcp configs.

@ReclusiveGeek
Copy link

The same thing happens if you use NAT port forwarding but not if you use an ssh tunnel. It would appear that the issue is not related to UDP or TCP as it happens with both.

@mcspr
Copy link
Collaborator

mcspr commented Mar 15, 2019

I think @icoma89 already had outlined the exact problem - MTU/MSS size matters.
For example, on Windows machine:

$ netsh interface ipv4 show subinterfaces

shows default MTU of 1500 - everything works

$ netsh interface ipv4 set subinterface "WLAN" mtu=1453

edited: mtu value from #1614, because mss below is 1413

Causes config ws frame to be dropped by the router. ref tcpdump:

> IP DESKTOP-I8S9V0N.lan.62839 > ESPURNA-35A259.lan.80: Flags [S], seq 3747294428, win 64998, options [mss 1413,nop,wscale 8,nop,nop,sackOK], length 0
> IP ESPURNA-35A259.lan.80 > DESKTOP-I8S9V0N.lan.62839: Flags [S.], seq 6513, ack 3747294429, win 5840, options [mss 1460], length 0
...
> IP truncated-ip - 190 bytes missing! ESPURNA-35A259.lan.80 > DESKTOP-I8S9V0N.lan.62839: Flags [P.], seq 530:2180, ack 227, win 5614, length 1650: HTTP

And the same, but properly fragmented, when using lwip2:

> IP ESPURNA-35A259.lan.80 > DESKTOP-I8S9V0N.lan.61152: Flags [P.], seq 530:1943, ack 227, win 5614, length 1413: HTTP
> IP DESKTOP-I8S9V0N.lan.61152 > ESPURNA-35A259.lan.80: Flags [.], ack 1943, win 64998, length 0
> IP ESPURNA-35A259.lan.80 > DESKTOP-I8S9V0N.lan.61152: Flags [P.], seq 1943:2208, ack 227, win 5614, length 265: HTTP
> IP DESKTOP-I8S9V0N.lan.61152 > ESPURNA-35A259.lan.80: Flags [.], ack 2208, win 64733, length 0

edit:
assuming that this configuration change is valid, because I can access a different non-esp device
and assuming that MTU mismatch can also happen somewhere on the network path to the device i.e. router sets this value (which would trigger this too?)

@mcspr
Copy link
Collaborator

mcspr commented Mar 18, 2019

The main question now for me is, why doesn't espurna (or the underlying
lib) split the tcp packet to fit the negotiated mss?

Related lwip bug:
http://savannah.nongnu.org/bugs/?46384
http://git.savannah.gnu.org/cgit/lwip.git/commit/src?id=8e8571da6a6771f9d2d82bbd0b5a6c27474ce0fc

Since we are still suffering with Core 2.3.0, lwip can be built from source:

$ cd ~/.platformio/packages/framework-arduinoespressif8266@1.20300.1/
$ # patch tools/sdk/lwip/src/core/tcp_out.c
$ make -C "tools/sdk/lwip/src" install TOOLS_PATH=~/.platformio/packages/toolchain-xtensa/bin/xtensa-lx106-elf-

Resulting library is tools/sdk/lib/liblwip_src.a. The pre-built one is at tools/sdk/lib/liblwip_gcc.a, so we can just replace the file. Or change LIBS variable of the scons builder (extra_script)

@mcspr
Copy link
Collaborator

mcspr commented Mar 26, 2019

@xoseperez two options about shipping lwip1 fix:

  1. ide & pio option. pre-build lwip lib archive, put it somewhere in the repo (like code/lib). prepend directory path to the LIBPATH in extra_script - linker will pick our lwip variant before the framework's. that is for current releases
    using old framework with arduino ide - manually file replace of <framework-dir>/tools/sdk/lib/liblwip_{gcc,src}.a. otherwise, it might be better to advise using latest Core with lwip2, where this problem is fixed already (and many other things fixed by sdk...)
  2. instead of building beforehand, integrate builder into the pio script. only problem is, it only works if the user has make installed. meaning, no Windows support. no problem for travis though.

@simonedimarzio
Copy link

Hi all,
Thanks @mcspr for debugging this issue and thanks @xoseperez for your work.
Due to this issue is not possible to get access into espurna web UI through an OpenVPN connection.

I'm a programmer but not a very expert C compiler, could you please provide some more info on how to include the fix into the project?
I'm using Windows, it would be great to have the compiled lib to link/substitute into the project.

Thanks a lot!

@mcspr
Copy link
Collaborator

mcspr commented May 19, 2019

@simonedimarzio Easiest way is to just build using latest Core versions, where lwip2 is the default.
But, you can use WSL and ignore the Windows "problem" :) Just install platformio there.
And BTW I may be wrong about incompatibility, never tested that.

I think #1723 is "good enough" solution. If user has all dependencies installed, this will automatically patch lwip1 in Core only when ESPURNA_PIO_PATCH_ISSUE_1610=something is in the environment. That includes releases here, travisN tests and nightly builder (https://github.com/mcspr/espurna-nightly-builder/releases)

@ddf89
Copy link
Author

ddf89 commented May 20, 2019

Thanks for this @mcspr. I can confirm issue is resolved with espurna-1.13.6-dev.nightly20190519+gitaac2e1e4-itead-sonoff-pow-r2.bin (Sonoff POW R2).

@simonedimarzio
Copy link

Thanks @mcspr, also for me the fix worked perfectly!

Steps:

  1. Pulled the latest dev version from the repo
  2. pio run -e <env> -t build-lwip
  3. pio run -e <env>
  4. Flash device

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants