-
Notifications
You must be signed in to change notification settings - Fork 501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Set Receive/Send socket timeouts appropriately #317
Conversation
This is one of the issues I found "by the way" while looking for real cause of **[this](JustArchiNET/ArchiSteamFarm#318 bug. This report is not connected with ASF one in any way. I've implemented a custom HeartBeat() function that executes ```await SteamApps.PICSGetProductInfo(0, null);``` every minute, and if request times out, while still being ```SteamClient.IsConnected```, function executes ```SteamClient.Connect();``` in order to initiate a reconnect sooner than SK2 would find out. This uncovered some weird thing going on when network stability is not perfect: ``` 2016-08-31 09:06:48|INFO|archi|OnDisconnected() Disconnected from Steam! 2016-08-31 09:06:48|INFO|archi|OnDisconnected() Reconnecting... 2016-08-31 09:06:58|INFO|archi|OnConnected() Connected to Steam! 2016-08-31 09:06:58|INFO|archi|OnConnected() Logging in... 2016-08-31 09:07:33|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting... 2016-08-31 09:08:33|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting... > (repeat above 15 more times in 1-minute interval) 2016-08-31 09:24:33|WARN|archi|HeartBeat() Connection to Steam Network lost, reconnecting... 2016-08-31 09:24:50|INFO|archi|OnDisconnected() Disconnected from Steam! 2016-08-31 09:24:50|INFO|archi|OnDisconnected() Disconnected from Steam! > (repeat above 15 more times) 2016-08-31 09:24:50|INFO|archi|OnDisconnected() Disconnected from Steam! 2016-08-31 09:24:50|INFO|archi|OnConnected() Connected to Steam! 2016-08-31 09:24:50|INFO|archi|OnConnected() Logging in... 2016-08-31 09:24:51|INFO|archi|OnLoggedOn() Successfully logged on! ``` This looked weird to me, that we sent last request to Steam network around ~09:07, then around ~15 ```SteamClient.Connect()``` function calls, and finally we received all callbacks at the same moment, eventually connecting at ~09:25. We can analyze that also in nethook messages: ![image](https://cloud.githubusercontent.com/assets/1069029/18123334/b3eac1f8-6f6d-11e6-9379-f63d77a136d4.png) So what happened: for some reason Steam sent us ```ChannelEncryptRequest``` without previous graceful disconnection (you can notice our heartbeats sent by SK2 a while before). We handled that properly and sent ```ClientLogon``` in 35901 (09:06:58). We didn't get any response for nearly ~20 minutes, and the next ```in``` packet is again ```ChannelEncryptRequest```in 35920 (09:24:50). Debug log of SK2 since packet 35894 to 35924 can be found here: ``` CMClient | <- Recv'd EMsg: ClientClanState (822) (Proto: True) CMClient | Sent -> EMsg: ClientHeartBeat (Proto: True) CMClient | Sent -> EMsg: ClientPICSProductInfoRequest (Proto: True) CMClient | Sent -> EMsg: ClientHeartBeat (Proto: True) CMClient | Sent -> EMsg: ClientHeartBeat (Proto: True) CMClient | Sent -> EMsg: ClientHeartBeat (Proto: True) CMClient | Sent -> EMsg: ClientHeartBeat (Proto: True) CMClient | Sent -> EMsg: ClientHeartBeat (Proto: True) CMClient | Sent -> EMsg: ClientHeartBeat (Proto: True) CMClient | Sent -> EMsg: ClientHeartBeat (Proto: True) TcpConnection | Socket exception occurred while reading packet: System.IO.IOException: Unable to read data from the transport connection: Connection timed out. ---> System.Net.Sockets.SocketException: Connection timed out at System.Net.Sockets.Socket.Receive (System.Byte[] buffer, System.Int32 offset, System.Int32 size, System.Net.Sockets.SocketFlags socketFlags) <0x40eaad20 + 0x001a3> in <5cf7305791db4d5da2dbb66cb115547f>:0 at System.Net.Sockets.NetworkStream.Read (System.Byte[] buffer, System.Int32 offset, System.Int32 size) <0x40e8a890 + 0x000c7> in <5cf7305791db4d5da2dbb66cb115547f>:0 --- End of inner exception stack trace --- at System.Net.Sockets.NetworkStream.Read (System.Byte[] buffer, System.Int32 offset, System.Int32 size) <0x40e8a890 + 0x002b7> in <5cf7305791db4d5da2dbb66cb115547f>:0 at System.IO.BinaryReader.ReadBytes (System.Int32 count) [0x00041] in <f24f53297d1d465396b75be74d37e900>:0 at SteamKit2.TcpConnection.ReadPacket () <0x40eded20 + 0x000f8> in <22a5bf0f0df5491bb43e46408f01ce2f>:0 at SteamKit2.TcpConnection.NetLoop () <0x40eddfe0 + 0x0017b> in <22a5bf0f0df5491bb43e46408f01ce2f>:0 ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connection to 146.66.152.10:27020 cancelled by user ServerList | Next server candidiate: 146.66.152.10:27020 TcpConnection | Connecting to 146.66.152.10:27020... TcpConnection | Connected to 146.66.152.10:27020 CMClient | <- Recv'd EMsg: ChannelEncryptRequest (1303) (Proto: False) CMClient | Got encryption request. Universe: Public Protocol ver: 1 CMClient | Sent -> EMsg: ChannelEncryptResponse (Proto: False) CMClient | <- Recv'd EMsg: ChannelEncryptResult (1305) (Proto: False) CMClient | Encryption result: OK CMClient | Sent -> EMsg: ClientLogon (Proto: True) CMClient | <- Recv'd EMsg: Multi (1) (Proto: True) CMClient | <- Recv'd EMsg: ClientServersAvailable (5501) (Proto: True) ``` All cancellations are most likely done by ```SteamClient.Disconnect()``` executed by ```SteamClient.Connect()``` after client returned from nearly ~20 minutes of death. (Don't worry about extra ```ClientHeartBeat```s, I have multiple steam clients running, those are misleading) --- **So what happened?** I think that such incident is caused by infinite timeout set by default in TCP socket's Receive/Send functions. UDP connection automatically is considered "dead" after 60 seconds, but with TCP one, we could in theory read given packet till something interrupts us, probably OS, and that's why the whole SK2 communication blocks if such incident happens, as the exception is thrown nearly ~20 minutes after TCP connection in fact should be considered dead. I'm not yet sure about what values we should put, for now I sticked with user-provided timeout. While it works for me, I ask you to test it yourself and review the change, because I'm not that familiar with low-level SK2 things, and it's very likely that it might introduce unwanted behaviour. Otherwise, I think that this change is alright. It's also possible that the culprit is in fact somewhere else and I got misleaded - in this case I'd appreciate if somebody else could look into that issue, because I'm trying to help most of the time, but I'm definitely not experienced with SK2 as much as you.
Oh yeah, Mono 4.6 is broken as well, so don't expect that one to succeed in current state: #318 |
What does setting Socket timeouts do in this scenario? I see that TcpConnection did indeed throw an exception and disconnect, because your heartbeat function detected IsConnected is false. The fact that IsConnected is being set to false means that nothing is stalled in Send either, so I don't understand what the timeouts would have changed. This simply appears to be a stall in your callback handler, such as a blocking call. |
How it can be a blocking call lasting ~15 minutes if all calls have typical timeout of 10 seconds? All I'm doing is awaiting (asynchronously) for result of PICS, which should throw exception in max of 10 seconds since calling. Heartbeats are executed once per minute + they should result in My HeartBeat() function can be found declared here. If it's not TCP timeout, then something else is blocking |
Also don't get me wrong - I'm simply trying to locate the culprit that is preventing me from forcing a reconnect of |
What exception are you catching from SteamApps.PICSGetProductInfo to detect that the connection is dead? |
Any. The objective of that function is to detect when |
It would really be helpful if you could capture a stack trace to confirm the issue, as in while it's in whatever broken state. What I believe is happening: The exception is thrown correctly, TcpConnection tries to Release but netLock is held by a Send call. If this is the case, then SendTimeout on the socket is the correct answer. This would also imply that your callback thread is stalled and therefore you aren't getting any callbacks/IsConnected state doesn't change/etc. |
But even then I'm not totally convinced since you're able to call Connect/Disconnect, which also acquire the lock, which is why I still suspect that you have a blocking call on your callback thread |
I wouldn't be so sure, it's totally possible that all those calls are waiting for lock - HeartBeat() is executed by a Timer, and Timer doesn't care in this case about function completion, so it's totally possible that I have 15
Possible. There is not much I can do in this case though, I only made sure that I'm running callback loop as long as I have KeepRunning enabled, and I set it to false only in
I know, I'm trying very hard to catch it in debugger but it's so rare networking issue that I can't reproduce it reliably. Since last 2-3 weeks I'm trying blindly to find out what is wrong, because those are some very short network downtimes/interruptions, I even tested without
In fact, this issue happened first time since a week of tests, and I'm running my project 24/7, so it's like my 3rd try to locate the culprit and fix it. Probably I'll eventually give up and just add proper code that will take whole process dump when detecting such thing, but even detecting it programatically is not easy. I deeply appreciate your help. I'm currently testing SK2 with my own change from this PR to see if it helps or not, and so far it's working, but it'll probably take a month or so to state at least semi-reliably that issue is gone. |
The DebugLog doesn't have any timestamps, so it's difficult to say when things are happening. If I understand correctly, you're saying all the Connect calls happened at the same time. I'm hoping that the actual Exception was thrown much earlier. If the case is that we are stalled on a Send call holding back the actual Release call, then I do agree that the infinite timeout is probably to blame, and that most stacks are probably blocked on netLock. |
I'm merging this since there is a reasonable suspicion. |
I think so too, indeed I didn't think of timestamps initially, I've added them earlier today after posting this PR, so if it ever happens again, I'll have more precise log now. For now I'm having a wait-game for the issue to come back to life. TBH I think that even if setting timeouts doesn't fix initial issue, for sure it'll cause SK2 to be aware of dead connection sooner, which will allow me to remove that experimental workaround and handle the connection statuses where it should be handled - in SK2. The whole objective is to minimize downtime caused by network issues - if timeout value is too low, we can always bump it to more appropriate value, but leaving it with infinite is usually a bad choice. |
It probably won't interest anybody anymore, but I couldn't reproduce this issue since setting up those timeouts - either it's a miracle, or indeed something got fixed/improved in a way that it doesn't annoy me anymore. Thanks! |
This is one of the issues I found "by the way" while looking for real cause of this bug. This report is not connected with ASF one in any way.
I've implemented a custom HeartBeat() function that executes
await SteamApps.PICSGetProductInfo(0, null);
every minute, and if request times out, while still beingSteamClient.IsConnected
, function executesSteamClient.Connect();
in order to initiate a reconnect sooner than SK2 would find out.This uncovered some weird thing going on when network stability is not perfect:
This looked weird to me, that we sent last request to Steam network around ~09:07, then around ~15
SteamClient.Connect()
function calls, and finally we received all callbacks at the same moment, eventually connecting at ~09:25. It's possible that network was down for a short while, but definitely not for more than half a minute.We can analyze that also in nethook messages:
So what happened: for some reason Steam sent us
ChannelEncryptRequest
without previous graceful disconnection (you can notice our heartbeats sent by SK2 a while before). We handled that properly and sentClientLogon
in 35901 (09:06:58). We didn't get any response for nearly ~20 minutes, and the nextin
packet is againChannelEncryptRequest
in 35920 (09:24:50).Debug log of SK2 since packet 35894 to 35924 can be found here:
All cancellations are most likely done by
SteamClient.Disconnect()
executed bySteamClient.Connect()
after client returned from nearly ~20 minutes of death.(Don't worry about extra
ClientHeartBeat
s, I have multiple steam clients running, those are misleading)So what happened?
I think that such incident is caused by infinite timeout set by default in TCP socket's Receive/Send functions. UDP connection automatically is considered "dead" after 60 seconds, but with TCP one, we could in theory read given packet till something interrupts us, probably OS, and that's why the whole SK2 communication blocks if such incident happens, as the exception is thrown nearly ~20 minutes after TCP connection in fact should be considered dead.
I'm not yet sure about what values we should put, for now I sticked with user-provided timeout. While it works for me, I ask you to test it yourself and review the change, because I'm not that familiar with low-level SK2 things, and it's very likely that it might introduce unwanted behaviour. Otherwise, I think that this change is alright.
It's also possible that the culprit is in fact somewhere else and I got misleaded - in this case I'd appreciate if somebody else could look into that issue, because I'm trying to help most of the time, but I'm definitely not experienced with SK2 as much as you.