Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select sometimes block forever on a ready TCP sockets #419

Closed
xilun opened this issue May 23, 2016 · 15 comments
Closed

select sometimes block forever on a ready TCP sockets #419

xilun opened this issue May 23, 2016 · 15 comments
Labels

Comments

@xilun
Copy link

xilun commented May 23, 2016

Hi,

Attached are two straces captured, while developing cbwin, with:

$ strace -tt -o select_block.txt wcmd dir /s c:\\windows | cat

select_block.txt

pselect_block.txt

They show that select() (or pselect(), on a dev version of my tool) sometimes block forever (see at the end, I then interrupt the program with CTRL-C) on ready sockets. The peer has no reason to block forever at this point, and both > dir /s c:\\windows and $ wcmd dir /s c:\\windows work correctly until the end. With | cat the problem occur very far from the expected termination (several MB of data should still be transmitted). The problem does not occur with | wc (at least the few dozen of times I ran it) and straces with | wc shows that read() never returns 16384 bytes in this case, implying TCP buffers are always quite empty. OTOH | cat is probably slowed down by console display, and somehow the end result is this unexpected blocking.

Please note that when it blocks, the last read/write syscalls before show the following pattern:

21:07:23.289414 read(5, "236F_FC70_11D3_A536_0090278A1BB8"..., 16384) = 16384
21:07:23.289414 write(1, "236F_FC70_11D3_A536_0090278A1BB8"..., 16384) = 16384
21:07:23.289414 read(5, "soft_WindowsPhone_CPP_Version_x8"..., 16384) = 16384
21:07:23.289414 write(1, "soft_WindowsPhone_CPP_Version_x8"..., 16384) = 16384
21:07:23.289414 read(5, "       309\302\240326 ProductIcon.ico\r"..., 16384) = 16384
21:07:23.289414 write(1, "       309\302\240326 ProductIcon.ico\r"..., 16384) = 16384
21:07:23.383613 read(5, 0x6081c8, 16384) = -1 EAGAIN (Resource temporarily unavailable)
21:07:23.383613 select(6, [5], [], NULL, NULL) = -1 EINTR (Interrupted system call)
21:09:00.149997 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---

[5] is the socket, connected to the Win32 side, that causes problems.
Blocks of 16384 (my buffer size) are completely filled and output to stdout which is the pipe connected to cat, until read() abruptly returns -1 EAGAIN. This in itself seems a bit odd - to transition from a filled TCP buffer to exactly empty by reads of complete blocks of 16384 bytes - but not impossible, although the problem seems to only occur with this pattern and there are various non problematic occurrences of < 16384 bytes reads followed by -1 EAGAIN sooner.

My reproducibility is 100%. I'm on build 14342. When it happens both wcmd and cmd.exe processes are not in a suspended state but use exactly 0% of the CPU. When not stracing, gdb can attach to wcmd and confirms it is blocked in select/pselect. At the same time, the stack of cmd.exe is:

ntdll.dll!ZwWriteFile+0x14
KERNELBASE.dll!WriteFile+0x88
cmd.exe+0xe3a7
cmd.exe+0x12fb5
cmd.exe+0x2328c
cmd.exe+0x1d310
cmd.exe+0xf277
cmd.exe+0xd569
cmd.exe+0xd827
cmd.exe+0xfab9
cmd.exe+0x11957
cmd.exe+0x119bd
cmd.exe+0xb9d9
cmd.exe+0xb295
cmd.exe+0x53b2
cmd.exe+0x1531d
KERNEL32.DLL!BaseThreadInitThunk+0x14
ntdll.dll!RtlUserThreadStart+0x21

I've not yet tried to isolate whether or not Win32 redirection of standard handles to sockets plays a role in this situation, nor reduce how to trigger the bug to smaller programs, however please note that this time the Winsock handle remains open in outbash.exe that called CreateProcess("cmd") until after cmd.exe terminates (contrary to #347 ), and I've plenty of evidence that shows that select/pselect() (or something beneath it) is actually misbehaving.
Although not equivalent at all, a quick test of "cmd /C dir /s c:\windows | cat" under Cygwin64 under the windows console shows no problem -- this let me think that at least it's not just cmd.exe that got upset by any redirections...

@sunilmut
Copy link
Member

@xilun - Here is my read of the problem. Please correct me if it doesn't sound right. You have reasons to believe that read on WSL TCP sockets returns EAGAIN prematurely (or select blocks), while you are expecting read to return more (> 0) data? The other end of the socket is still open. We recently fixed a bug in the socket EPOLL where a select (EPOLLIN) would block on a disconnected TCP socket, if a subsequent read drained all the data (see #43). But, that is only for disconnected sockets, which does not seem to be the case here.

@xilun
Copy link
Author

xilun commented May 24, 2016

You have reasons to believe that read on WSL TCP sockets returns EAGAIN prematurely (or select blocks), while you are expecting read to return more (> 0) data? The other end of the socket is still open.

Yes. However the local end is half-closed with WSL shutdown(sock, SHUT_WR), because this socket is not used in this direction. Symmetrically, the remote end is half-closed with Winsock shutdown(sock, SD_RECEIVE);

We recently fixed a bug in the socket EPOLL where a select (EPOLLIN) would block on a disconnected TCP socket, if a subsequent read drained all the data

In my case I have no subsequent read that drained all the data, however I've got a read just before that might have completely emptied the TCP buffer at this precise moment, or bugged.

only for disconnected sockets, which does not seem to be the case here.

Well, like I said I'm half-closed, if that can have any impact.

@aseering
Copy link
Contributor

aseering commented May 24, 2016

This sounds like it might be related to #324 ?

@xilun
Copy link
Author

xilun commented May 24, 2016

@aseering I have no idea if it is related, the straces show both similarities and differences (your last read is not a "whole block")

@xilun
Copy link
Author

xilun commented May 24, 2016

I just launched my caller process from a Linux VM instead of WSL (privately patching a few INADDR_LOOPBACK to INADDR_ANY and so over just for this test) and can confirm everything works correctly (I can post complete straces if needed). In the same outbash.exe session I launched the same test from WSL and it failed as usual.

That's one more clue there is a bug in WSL select.

@xilun
Copy link
Author

xilun commented May 24, 2016

Now I just launched my caller process from Cygwin64, still connected to the same outbash.exe session, and everything also works correctly. Just to be sure I reran it from WSL once again and it failed as usual.

At this point I thought: why not gratuitously increase the inception level, just because I can; so using only a single console I did:

xilun@WINVM:/mnt/c/Users/xilun/Documents/Projects/cbwin/caller$ /usr/local/bin/wcmd --env OUTBASH_PORT=7777  'wcmd.exe dir /s c:\windows | C:\cygwin64\bin\cat.exe'

which, only doing redirections in win32 to cygwin sockets, unsurprisingly worked, while:

xilun@WINVM:/mnt/c/Users/xilun/Documents/Projects/cbwin/caller$ /usr/local/bin/wcmd --env OUTBASH_PORT=7777  'wcmd.exe dir /s c:\windows | C:\cygwin64\bin\cat.exe' | cat

failed the usual WSL way.

@xilun
Copy link
Author

xilun commented May 24, 2016

@sunilmut I might have previously misread one of your sentence.

We recently fixed a bug in the socket EPOLL where a select (EPOLLIN) would block on a disconnected TCP socket, if a subsequent read drained all the data.

To buffer do seem to be emptied on this half-closed connection after it has been half-closed because this is actually performed right after connect/accept, before any data is transmitted. However the traces show the buffer is emptied several times before the bug, and in those case everything continues to work as expected (select unblocks ASA more data is available, that is at max after a few ms given this test)

The buffer is not emptied after a complete close, though, because we are several MB away from the end and I also checked that the socket is still in CLOSE_WAIT and FIN_WAIT_2 when it is improperly blocked.

xilun added a commit to xilun/cbwin that referenced this issue May 24, 2016
@xilun
Copy link
Author

xilun commented May 24, 2016

As a workaround for now I have temporarily disabled half-closure of sockets in cbwin with commit xilun/cbwin@51902d6

My first tests worked, with strace that shows plenty of full read/write of blocks of 16384 bytes, so that workaround seems to work.

@sunilmut
Copy link
Member

@xilun - Can you give me clear repro steps for this? More and more fingers are pointing towards a bug in WSL. I would love to get to the bottom of this.

@xilun
Copy link
Author

xilun commented May 25, 2016

Ok so on build 14342 either build cbwin from source prior to the workaround, for example using https://github.com/xilun/cbwin/tree/5e32363f9230d218358fe90a33af536b8a53177f , or just use my 0.2 binaries https://github.com/xilun/cbwin/releases/tag/v0.2 (I just checked that the bug occurs with them)

To build from source you just need VS2015 for outbash.exe and gcc under WSL for wcmd. For outbash.exe I use cmake but there is no special magic, and you can just create a console project (unicode) and add all .cpp and .h if you don't want to use cmake.

wcmd is built from wrun.c in caller/ by ./build.sh

If you use my pre-built binaries, just unpack, launch outbash.exe and then you should already be in the good directory for:

$ ./wcmd dir /s c:\\windows | cat

It should freeze after a few seconds. At least it always did for me.

With the workaround xilun/cbwin@51902d6 that does not half-close the TCP redirections sockets it does not freeze anymore. (no pre-built binaries for now for this one)

@xilun
Copy link
Author

xilun commented May 27, 2016

note: this problem is still present on W10 build 14352

@xilun
Copy link
Author

xilun commented May 27, 2016

other note: binary release cbwin v0.3 includes the workaround

@sunilmut
Copy link
Member

sunilmut commented Dec 4, 2016

With the fix for #610 and #616, we have changed the way TCP sockets are now closed in WSL. Maybe, that resolves the issue here. The fix is not yet available in the Insider build, but, once it is, maybe you can give this one more go. I will ping this thread once the fix is in the Insider build.

@sunilmut
Copy link
Member

@xilun - With #610 and #616 fixed in 15002, can you give your scenario a try? We would like to hear if your issue is resolved in 15002.

@therealkenc
Copy link
Collaborator

There was no reply to Sunil. Calling this fixedcreatorsupdate 15063 until proven otherwise.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants