-
Notifications
You must be signed in to change notification settings - Fork 867
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
select sometimes block forever on a ready TCP sockets #419
Comments
@xilun - Here is my read of the problem. Please correct me if it doesn't sound right. You have reasons to believe that read on WSL TCP sockets returns EAGAIN prematurely (or select blocks), while you are expecting read to return more (> 0) data? The other end of the socket is still open. We recently fixed a bug in the socket EPOLL where a select (EPOLLIN) would block on a disconnected TCP socket, if a subsequent read drained all the data (see #43). But, that is only for disconnected sockets, which does not seem to be the case here. |
Yes. However the local end is half-closed with WSL shutdown(sock, SHUT_WR), because this socket is not used in this direction. Symmetrically, the remote end is half-closed with Winsock shutdown(sock, SD_RECEIVE);
In my case I have no subsequent read that drained all the data, however I've got a read just before that might have completely emptied the TCP buffer at this precise moment, or bugged.
Well, like I said I'm half-closed, if that can have any impact. |
This sounds like it might be related to #324 ? |
@aseering I have no idea if it is related, the straces show both similarities and differences (your last read is not a "whole block") |
I just launched my caller process from a Linux VM instead of WSL (privately patching a few INADDR_LOOPBACK to INADDR_ANY and so over just for this test) and can confirm everything works correctly (I can post complete straces if needed). In the same outbash.exe session I launched the same test from WSL and it failed as usual. That's one more clue there is a bug in WSL select. |
Now I just launched my caller process from Cygwin64, still connected to the same outbash.exe session, and everything also works correctly. Just to be sure I reran it from WSL once again and it failed as usual. At this point I thought: why not gratuitously increase the inception level, just because I can; so using only a single console I did:
which, only doing redirections in win32 to cygwin sockets, unsurprisingly worked, while:
failed the usual WSL way. |
@sunilmut I might have previously misread one of your sentence.
To buffer do seem to be emptied on this half-closed connection after it has been half-closed because this is actually performed right after connect/accept, before any data is transmitted. However the traces show the buffer is emptied several times before the bug, and in those case everything continues to work as expected (select unblocks ASA more data is available, that is at max after a few ms given this test) The buffer is not emptied after a complete close, though, because we are several MB away from the end and I also checked that the socket is still in CLOSE_WAIT and FIN_WAIT_2 when it is improperly blocked. |
As a workaround for now I have temporarily disabled half-closure of sockets in cbwin with commit xilun/cbwin@51902d6 My first tests worked, with strace that shows plenty of full read/write of blocks of 16384 bytes, so that workaround seems to work. |
@xilun - Can you give me clear repro steps for this? More and more fingers are pointing towards a bug in WSL. I would love to get to the bottom of this. |
Ok so on build 14342 either build cbwin from source prior to the workaround, for example using https://github.com/xilun/cbwin/tree/5e32363f9230d218358fe90a33af536b8a53177f , or just use my 0.2 binaries https://github.com/xilun/cbwin/releases/tag/v0.2 (I just checked that the bug occurs with them) To build from source you just need VS2015 for outbash.exe and gcc under WSL for wcmd. For outbash.exe I use cmake but there is no special magic, and you can just create a console project (unicode) and add all .cpp and .h if you don't want to use cmake. wcmd is built from wrun.c in caller/ by ./build.sh If you use my pre-built binaries, just unpack, launch outbash.exe and then you should already be in the good directory for:
It should freeze after a few seconds. At least it always did for me. With the workaround xilun/cbwin@51902d6 that does not half-close the TCP redirections sockets it does not freeze anymore. (no pre-built binaries for now for this one) |
note: this problem is still present on W10 build 14352 |
other note: binary release cbwin v0.3 includes the workaround |
There was no reply to Sunil. Calling this fixedcreatorsupdate 15063 until proven otherwise. |
Hi,
Attached are two straces captured, while developing cbwin, with:
select_block.txt
pselect_block.txt
They show that select() (or pselect(), on a dev version of my tool) sometimes block forever (see at the end, I then interrupt the program with CTRL-C) on ready sockets. The peer has no reason to block forever at this point, and both
> dir /s c:\\windows
and$ wcmd dir /s c:\\windows
work correctly until the end. With| cat
the problem occur very far from the expected termination (several MB of data should still be transmitted). The problem does not occur with| wc
(at least the few dozen of times I ran it) and straces with| wc
shows that read() never returns 16384 bytes in this case, implying TCP buffers are always quite empty. OTOH| cat
is probably slowed down by console display, and somehow the end result is this unexpected blocking.Please note that when it blocks, the last read/write syscalls before show the following pattern:
[5] is the socket, connected to the Win32 side, that causes problems.
Blocks of 16384 (my buffer size) are completely filled and output to stdout which is the pipe connected to
cat
, until read() abruptly returns -1 EAGAIN. This in itself seems a bit odd - to transition from a filled TCP buffer to exactly empty by reads of complete blocks of 16384 bytes - but not impossible, although the problem seems to only occur with this pattern and there are various non problematic occurrences of < 16384 bytes reads followed by -1 EAGAIN sooner.My reproducibility is 100%. I'm on build 14342. When it happens both wcmd and cmd.exe processes are not in a suspended state but use exactly 0% of the CPU. When not stracing, gdb can attach to wcmd and confirms it is blocked in select/pselect. At the same time, the stack of cmd.exe is:
I've not yet tried to isolate whether or not Win32 redirection of standard handles to sockets plays a role in this situation, nor reduce how to trigger the bug to smaller programs, however please note that this time the Winsock handle remains open in outbash.exe that called CreateProcess("cmd") until after cmd.exe terminates (contrary to #347 ), and I've plenty of evidence that shows that select/pselect() (or something beneath it) is actually misbehaving.
Although not equivalent at all, a quick test of "cmd /C dir /s c:\windows | cat" under Cygwin64 under the windows console shows no problem -- this let me think that at least it's not just cmd.exe that got upset by any redirections...
The text was updated successfully, but these errors were encountered: