-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Anomalous behavior of TCP cloak (also UDP but less important) when using Android Hotspot, unlikely to be TTL / operator filtering #229
Comments
It could be issues in the android wifi driver when operating in hotspot mode. Cloak transmit code / receive code is very simple (just goroutine loop copying both sides). If there's any latency issues or lags, it could be issues or tuning needed in go scheduler itself. It can also be a tcp stall/breakdown because when there's a proxy kind of situation, there appears a phenomenon similar to buffer bloat where the 2 tcp connections on both sides of cloak are both operating with different network conditions and don't "couple" their "flow rate" correctly. Sometimes when there's a sudden network lag or throughput drop on one side, the other side doesn't feel it, and the congestion in one side of cloak is not properly transmitted as back pressure to the other side to slow down its flow rate accordingly. This is due to too large buffers used in Cloak (at the tcp level, or Cloak level) while copying between the connections. I tried to tackle this in my fork by adding an extra configuration option or two to set the buffer size, but with limited or no effect. I suspect I haven't done something correctly, that's why I didn't make a PR on this repo with these changes yet. Try my fork and see if playing with the extra config options related to receive buffer and send buffer changes anything in the effect you're seeing. |
Another symptom of this breakdown is that the connection doesn't achieve the maximum download rate, because the TCP congestion algorithm depends on feeling packet loss and backpressure to optimize the rate so when there's a desync in the coupling, it's not able to achieve the max possible throughput also. |
Well @notsure2 , I tried the small-tcp-buffer branch and playing with buffer sizes have not improved anything alas. However, I did find a workaround (a completely insane one but it works!) I have installed QubesOS and separated Cloak into a separate proxyVM between the VM running OpenVPN and VM containing passthrough for network card So now it looks like Laptop with Cloak (QubesOS) And this boondongle works Cloak works very well, both TCP and UDP, almost without flaw (well, hiccups I described in another issue persist, but IMO they got better) Also ping to my test remote improved by about 10 +/- 2 ms I think this suggests that Cloak is somehow interacting poorly with my laptop's network card driver . Weird. P.S.: |
Start it at 4MB and half it each time and see how this changes the behavior. Ideally you want it really really small, so that if the connection on other side seizes up or stalls, cloak also stalls the other side by stopping the copy (so that the other side couples its congestion algorithm to the clogged side's congestion algorithm). Edit: also use the notsure2 branch as I may have made some bugfixes since merging it. The binaries i have on my fork are from the notsure2 branch so you don't need to build it. |
@notsure2 Cool! I'll play with your notsure branch from now on. On an unrelated note, what's up with the support-cloudflare branch, does it work (and how to use it) ? P.S.: UDP still suffers a "hard stall" problem, but TCP appears to be immune (or at least stall takes a long while to develop) Which I used to try and capture something in a log server-side, with what might be success Procedure I followed - restarted everything, server included Started Cloak clients in their own VM (networking configured appropriately) Started an openVPN UDP session, and a ping to get idea when connection stalls (also visible in cloak client's trace output, the "reads" just stop) As soon as UDP "irrecoverable stall" was detected I restarted openVPN with TCP config and connected to Cloak server to grab some logs. Following logs: Aug 07 22:03:21 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:03:21Z" level=info msg="Listening on :443" Aug 07 22:03:21 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:03:21Z" level=info msg="Listening on :80" Aug 07 22:03:23 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:03:23Z" level=info msg="New session" UID="UDP-CLIENT-UID==" sessionID=1248497831 Aug 07 22:06:39 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:04:39Z" level=info msg="New session" UID="!!!!!!TCP-CLIENT-UID==" sessionID=3365958504 Aug 07 22:06:58 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:04:58Z" level=warning msg="decryption/authentication faliure: cipher: message authentication failed" UID= encryptionMethod=0 proxyMethod= remoteAddr="CENSORED:36268" sessionId=0 Aug 07 22:06:59 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:04:59Z" level=warning msg="decryption/authentication faliure: cipher: message authentication failed" UID= encryptionMethod=0 proxyMethod= remoteAddr="CENSORED:37902" sessionId=0 Aug 07 22:07:07 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:07Z" level=info msg="Session closed" UID="UDP-CLIENT-UID==" reason=timeout sessionID=1248497831 Aug 07 22:07:07 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:07Z" level=info msg="Terminating active user" UID="UDP-CLIENT-UID==" reason="no session left" Aug 07 22:07:41 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:41Z" level=info msg="New session" UID="UDP-CLIENT-UID==" sessionID=202665511 Aug 07 22:07:54 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:54Z" level=warning msg="error reading first packet: read error after connection is established: EOF" remoteAddr="CENSORED:50530" P.P.S.: |
Okay it seems like limiting buffers to exactly same size (currently all set to exactly same 5mb size) for both send and receive on all ends AND changing the nofile limit for client and server to something big (30k or so) massively improves things. I will let test rig run overnight and see if Cloak UDP seizes up or nah |
Okay, @notsure2 after playing around with buffers a fair lot I have following to report: NOFILE is sadly likely placebo effect (makes sense, lsof shows very few file descriptors open by cloak client) Stalls become harder to reproduce but very very much remain TCP cloak stalls less frequently still (I'd say I could push a full workday in without a stall in a realistic setting) but unlike with UDP a peculiar phenomenon emerged with TCP Cloak stall and 5mb buffer: it became possible to semi-reliably trigger a TCP cloak stall by pushing the connection as much as possible (two 720HD 60 fps videos + speedof.me test on loop + ping running in background do it for mine), after mere minutes of such abuse TCP-Cloak would stall and connection would not recover until client restart. Cloak client logs remain somewhat unenlightening IMHO: INFO[2023-08-08T14:39:07****] Attempting to start a new session My non-programmer bellyfeel here is that cloak (likely cloak-client because server seems doing super fine) is silently running out of some resource and doesn't raise an error, instead simply stops sending/receiving |
@LindaFerum I can't remember the exact details as I found it ultimately useless since Cloudflare throttles the free tier websocket throughput.
|
@notsure2 thanks! I'll try it. |
No. CloudFlare shut off domain fronting, so a hostile isp can zero in on your domain and SNI sniff tcp reset it. I have experienced these stalls also, but I don't know exactly why they happen. My suspicion points to some kind of goroutine issue or scheduler issue. Cloak logging is not good and is hard to interpret flows from it, it needs a some improvement to become more legible. From your log there might be some kind of dynamic blocking done by the ISP, the session that opens and immediately closes looks like it got a TCP reset from the ISP. Try to collect a wireshark at the same time as stalls happen and analyze the TCP Flows to see if there's any abnormal behavior. Try to also see by timestamp matching if there's any error reported on the server side. When ISP sends tcp reset, they send it to both sides to cut the connection from both ends. The server will report this in its log. I think tcp reset shows up as EOF. |
I want some time to tackle this issue very deeply but sadly I don't have it in these days. |
I did consider this being an "ISP asshole thing" since I'm currently in a relatively hostile environment I'm not good with wireshark but I may try using a "guaranteed no-asshole" satellite connection to see if Cloak-TCP will stall there. |
Try NumConn=0. I implemented this specifically to avoid stalls by removing the multiplexing and having each connection map to 1 short-lived session. This works reasonably ok for me. |
Further reporting: Will try NumConn=0 later |
Well, @notsure2 for what it's worth NumConn=0 does seem to help with stalls a lot even on "bad/sus" connection (which I think substantially rules out previous stalls were ISP meddling) but it also it affects performance a lot (compared to the blast I can get out of this connection with multiplexing and 5MB buffers) so if something race-like is happening here it may be that improvement is simply due to lower probability of triggering the stall event. Overall I think I'll stick with multiplexing,5MB buffers and TCP mode for now |
I think I managed to get a reliable reproduction of this issue locally by creating artificial latency with https://github.com/Shopify/toxiproxy and making 10 parallel connections |
@cbeuw Awesome! Glad to know I caught an actual real bug and not just some mystery weirdness :) |
For those who are interested, this is the bug: Cloak has a concept of Streams. Each stream represents a TCP connection between the proxied application client and server (there is only one stream per session in UDP mode). As there are usually far fewer Cloak client to Cloak server TLS connections than there are Streams, data from each stream are interleaved in TLS connections. Upon receiving some data, Cloak server must figure out which stream the data need to go using metadata attached by Cloak client, then send it to the corresponding TCP connection to the application server. This is done in To prevent blocking, each stream has a buffer ( However, a while ago I made an optimisation by adding a This issue is hard to reproduce because most of the times, writing to a TCP connection to an application appear non-blocking. Most apps will read data as they come, plus there's also a buffer in the OS if the app isn't consuming data fast enough. Writing to a TCP socket only blocks when this OS buffer is full, which doesn't happen often. But a stall will happen when one application has two incoming connections A and B, where data coming from A is only consumed when something is received on B. Data through A and B may be routed through the same TLS connection with packets for A at the front and packet for B at the very back. Cloak may fill up A's receiving buffer, causing Simply removing the faulty optimisation should fix the issue. @LindaFerum if you're happy to, could you let me know if the problem is fixed for you? I'll make a new release if you don't see any more problems. Pre-release with the fix here: https://github.com/cbeuw/Cloak/releases/tag/v2.7.1-pre |
@cbeuw awesome news, thanks! I'll try to test it out around next week or maybe even on weekend if I get lucky |
I think I'm still seeing this issue but in UDP. Will try to figure it out. |
I'm still seeing this issue on TCP also, but it's more rare than before. I think this is not completely solved yet. |
Any news on this one ? (no rush just checking back :) ) |
I wonder if this issue is related to goroutine ordering or scheduling.
…-------- Original Message --------
On Feb 6, 2024, 5:07 AM, LindaFerum wrote:
Any news on this one ? (no rush just checking back :) )
—
Reply to this email directly, [view it on GitHub](#229 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AIOIVHZUFO75SLEHN5IMIKLYSGM5ZAVCNFSM6AAAAAA3F7KJNSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRYG4YDCMJVHE).
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
well, @notsure2 I guess it does feel like it. It would appear that shadowsocks spending extra cycles encrypting/decrypting stuff introduces additional latency that makes inducing a scheduling fault somewhere in goroutine harder. Meanwhile with "encryption:none" and high load shadowsocks fires a veritable firehose of packets and that does something bad scheduling-wise on Cloak/goroutine side. Makes sense? |
Yes seems to make sense to me.
…On Tuesday, February 6th, 2024 at 12:44 PM, LindaFerum ***@***.***> wrote:
well, ***@***.***(https://github.com/notsure2) I guess it does feel like it.
Observation - when redirecting traffic to Cloak via shadowsocks I get stalls far easier when I use encryption:none on the shadowsocks (shadowsocks-rust allows that out of the box)
Switching to encryption aes-128-gcm or whatever reduces number of Cloak hiccups.
It would appear that shadowsocks spending extra cycles encrypting/decrypting stuff introduces additional latency that makes inducing a scheduling fault somewhere in goroutine harder. Meanwhile with "encryption:none" and high load shadowsocks fires a veritable firehose of packets and that does something bad scheduling-wise on Cloak/goroutine side.
Makes sense?
—
Reply to this email directly, [view it on GitHub](#229 (comment)), or [unsubscribe](https://github.com/notifications/unsubscribe-auth/AIOIVHZI37WFVVZHW7EMR5LYSICRPAVCNFSM6AAAAAA3F7KJNSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSMRZGI2DEMRZGY).
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Maybe given that "thing still happens" the issue should be reopened (for proper recordkeeping and tracking and such? ) @cbeuw |
Reopened. I can still reproduce this issue locally if I fully load a large number of streams down a small number of TLS connections. I suspect there's something to do with goroutine scheduling when a TCP connection's buffer is full |
hey @cbeuw @notsure2 sorry to be a useless nagging needy user :-) but any progress on this one? |
@cbeuw maybe you need to take the scheduling into own hands... so have like an infinite loop go routines and queue some function on it to control the order of execution |
The blocking happens inside IO calls to Though @LindaFerum I did figure out this to be part of the problem Cloak/internal/multiplex/switchboard.go Lines 82 to 101 in 5988b43
I made an update to always spread the load out evenly across all TLS conns instead. From local testing it appears more resilient. I could still break the connection (with an error, not silently) if I open say 100 connections at once, though this should be unlikely in real scenarios. This is released in https://github.com/cbeuw/Cloak/releases/tag/v2.9.0 |
FWIW I can still trigger it with UDP it seems and it stalls pretty heavily when that happens. Modifying keepalive (as suggested in mention above) does not seem to help much @cbeuw would it perhaps be possible to add some more aggressive detection of stall event and a stall clearing "under the hood" process, ideally separate from keepalives proper (not that keepalives are helping much like I said) so that cloak could recover somewhat gracefully and automatically from this stall issue ? |
Symptoms are basically that after a while the connection just "dies" without any ability to reconnect (eventual Failed to establish new connections to remote error) unless Cloak client is restarted.
Also affects UDP (more obvious there because you can just see " read from stream 1 with err " healthy messages stop with no further events) but it's the first time for me that TCP cloak got completely disabled.
NOTABLE FACT:
connecting the phone to my laptop over USB solves problem utterly and completely, both TCP and UDP cloak start working flawlessly.
I initially suspected:
I think it is not TTL-based throttling because it has been replicated by my colleague in a decent EU country on a connection where mobile hotspot is explicitly enabled and billed separately.
Also in most cases mobile operators also detect USB Tethering and Cloak works perfectly for me when connection is shared over USB
I doubt it is local hostiles somehow detecting cloak because of aforementioned replication by a colleague in EU and because connection from here works over USB
It has been replicated on three different android phones (Pixel 5 with CalyxOS, some noname brick with LineageOS and a XIaomi unit)
The connection looks like this when breaking hard
Laptop with Cloak (linux)
|
|wifi (android hotspot)
|
Android Phone
|
internet
And it looks like this when working perfectly
Laptop with Cloak (linux)
|
|USB wire
|
Android Phone
|
internet
The text was updated successfully, but these errors were encountered: