select sometimes block forever on a ready TCP sockets #419

xilun · 2016-05-23T22:04:51Z

Hi,

Attached are two straces captured, while developing cbwin, with:

$ strace -tt -o select_block.txt wcmd dir /s c:\\windows | cat

select_block.txt

pselect_block.txt

They show that select() (or pselect(), on a dev version of my tool) sometimes block forever (see at the end, I then interrupt the program with CTRL-C) on ready sockets. The peer has no reason to block forever at this point, and both > dir /s c:\\windows and $ wcmd dir /s c:\\windows work correctly until the end. With | cat the problem occur very far from the expected termination (several MB of data should still be transmitted). The problem does not occur with | wc (at least the few dozen of times I ran it) and straces with | wc shows that read() never returns 16384 bytes in this case, implying TCP buffers are always quite empty. OTOH | cat is probably slowed down by console display, and somehow the end result is this unexpected blocking.

Please note that when it blocks, the last read/write syscalls before show the following pattern:

21:07:23.289414 read(5, "236F_FC70_11D3_A536_0090278A1BB8"..., 16384) = 16384
21:07:23.289414 write(1, "236F_FC70_11D3_A536_0090278A1BB8"..., 16384) = 16384
21:07:23.289414 read(5, "soft_WindowsPhone_CPP_Version_x8"..., 16384) = 16384
21:07:23.289414 write(1, "soft_WindowsPhone_CPP_Version_x8"..., 16384) = 16384
21:07:23.289414 read(5, "       309\302\240326 ProductIcon.ico\r"..., 16384) = 16384
21:07:23.289414 write(1, "       309\302\240326 ProductIcon.ico\r"..., 16384) = 16384
21:07:23.383613 read(5, 0x6081c8, 16384) = -1 EAGAIN (Resource temporarily unavailable)
21:07:23.383613 select(6, [5], [], NULL, NULL) = -1 EINTR (Interrupted system call)
21:09:00.149997 --- SIGINT {si_signo=SIGINT, si_code=SI_KERNEL} ---

[5] is the socket, connected to the Win32 side, that causes problems.
Blocks of 16384 (my buffer size) are completely filled and output to stdout which is the pipe connected to cat, until read() abruptly returns -1 EAGAIN. This in itself seems a bit odd - to transition from a filled TCP buffer to exactly empty by reads of complete blocks of 16384 bytes - but not impossible, although the problem seems to only occur with this pattern and there are various non problematic occurrences of < 16384 bytes reads followed by -1 EAGAIN sooner.

My reproducibility is 100%. I'm on build 14342. When it happens both wcmd and cmd.exe processes are not in a suspended state but use exactly 0% of the CPU. When not stracing, gdb can attach to wcmd and confirms it is blocked in select/pselect. At the same time, the stack of cmd.exe is:

ntdll.dll!ZwWriteFile+0x14
KERNELBASE.dll!WriteFile+0x88
cmd.exe+0xe3a7
cmd.exe+0x12fb5
cmd.exe+0x2328c
cmd.exe+0x1d310
cmd.exe+0xf277
cmd.exe+0xd569
cmd.exe+0xd827
cmd.exe+0xfab9
cmd.exe+0x11957
cmd.exe+0x119bd
cmd.exe+0xb9d9
cmd.exe+0xb295
cmd.exe+0x53b2
cmd.exe+0x1531d
KERNEL32.DLL!BaseThreadInitThunk+0x14
ntdll.dll!RtlUserThreadStart+0x21

I've not yet tried to isolate whether or not Win32 redirection of standard handles to sockets plays a role in this situation, nor reduce how to trigger the bug to smaller programs, however please note that this time the Winsock handle remains open in outbash.exe that called CreateProcess("cmd") until after cmd.exe terminates (contrary to #347 ), and I've plenty of evidence that shows that select/pselect() (or something beneath it) is actually misbehaving.
Although not equivalent at all, a quick test of "cmd /C dir /s c:\windows | cat" under Cygwin64 under the windows console shows no problem -- this let me think that at least it's not just cmd.exe that got upset by any redirections...

The text was updated successfully, but these errors were encountered:

sunilmut · 2016-05-24T05:28:06Z

@xilun - Here is my read of the problem. Please correct me if it doesn't sound right. You have reasons to believe that read on WSL TCP sockets returns EAGAIN prematurely (or select blocks), while you are expecting read to return more (> 0) data? The other end of the socket is still open. We recently fixed a bug in the socket EPOLL where a select (EPOLLIN) would block on a disconnected TCP socket, if a subsequent read drained all the data (see #43). But, that is only for disconnected sockets, which does not seem to be the case here.

xilun · 2016-05-24T10:20:34Z

You have reasons to believe that read on WSL TCP sockets returns EAGAIN prematurely (or select blocks), while you are expecting read to return more (> 0) data? The other end of the socket is still open.

Yes. However the local end is half-closed with WSL shutdown(sock, SHUT_WR), because this socket is not used in this direction. Symmetrically, the remote end is half-closed with Winsock shutdown(sock, SD_RECEIVE);

We recently fixed a bug in the socket EPOLL where a select (EPOLLIN) would block on a disconnected TCP socket, if a subsequent read drained all the data

In my case I have no subsequent read that drained all the data, however I've got a read just before that might have completely emptied the TCP buffer at this precise moment, or bugged.

only for disconnected sockets, which does not seem to be the case here.

Well, like I said I'm half-closed, if that can have any impact.

aseering · 2016-05-24T16:52:27Z

This sounds like it might be related to #324 ?

xilun · 2016-05-24T17:15:34Z

@aseering I have no idea if it is related, the straces show both similarities and differences (your last read is not a "whole block")

xilun · 2016-05-24T21:54:56Z

I just launched my caller process from a Linux VM instead of WSL (privately patching a few INADDR_LOOPBACK to INADDR_ANY and so over just for this test) and can confirm everything works correctly (I can post complete straces if needed). In the same outbash.exe session I launched the same test from WSL and it failed as usual.

That's one more clue there is a bug in WSL select.

xilun · 2016-05-24T23:05:33Z

Now I just launched my caller process from Cygwin64, still connected to the same outbash.exe session, and everything also works correctly. Just to be sure I reran it from WSL once again and it failed as usual.

At this point I thought: why not gratuitously increase the inception level, just because I can; so using only a single console I did:

xilun@WINVM:/mnt/c/Users/xilun/Documents/Projects/cbwin/caller$ /usr/local/bin/wcmd --env OUTBASH_PORT=7777  'wcmd.exe dir /s c:\windows | C:\cygwin64\bin\cat.exe'

which, only doing redirections in win32 to cygwin sockets, unsurprisingly worked, while:

xilun@WINVM:/mnt/c/Users/xilun/Documents/Projects/cbwin/caller$ /usr/local/bin/wcmd --env OUTBASH_PORT=7777  'wcmd.exe dir /s c:\windows | C:\cygwin64\bin\cat.exe' | cat

failed the usual WSL way.

xilun · 2016-05-24T23:24:12Z

@sunilmut I might have previously misread one of your sentence.

We recently fixed a bug in the socket EPOLL where a select (EPOLLIN) would block on a disconnected TCP socket, if a subsequent read drained all the data.

To buffer do seem to be emptied on this half-closed connection after it has been half-closed because this is actually performed right after connect/accept, before any data is transmitted. However the traces show the buffer is emptied several times before the bug, and in those case everything continues to work as expected (select unblocks ASA more data is available, that is at max after a few ms given this test)

The buffer is not emptied after a complete close, though, because we are several MB away from the end and I also checked that the socket is still in CLOSE_WAIT and FIN_WAIT_2 when it is improperly blocked.

xilun · 2016-05-24T23:49:39Z

As a workaround for now I have temporarily disabled half-closure of sockets in cbwin with commit xilun/cbwin@51902d6

My first tests worked, with strace that shows plenty of full read/write of blocks of 16384 bytes, so that workaround seems to work.

sunilmut · 2016-05-24T23:50:46Z

@xilun - Can you give me clear repro steps for this? More and more fingers are pointing towards a bug in WSL. I would love to get to the bottom of this.

xilun · 2016-05-25T00:13:27Z

Ok so on build 14342 either build cbwin from source prior to the workaround, for example using https://github.com/xilun/cbwin/tree/5e32363f9230d218358fe90a33af536b8a53177f , or just use my 0.2 binaries https://github.com/xilun/cbwin/releases/tag/v0.2 (I just checked that the bug occurs with them)

To build from source you just need VS2015 for outbash.exe and gcc under WSL for wcmd. For outbash.exe I use cmake but there is no special magic, and you can just create a console project (unicode) and add all .cpp and .h if you don't want to use cmake.

wcmd is built from wrun.c in caller/ by ./build.sh

If you use my pre-built binaries, just unpack, launch outbash.exe and then you should already be in the good directory for:

$ ./wcmd dir /s c:\\windows | cat

It should freeze after a few seconds. At least it always did for me.

With the workaround xilun/cbwin@51902d6 that does not half-close the TCP redirections sockets it does not freeze anymore. (no pre-built binaries for now for this one)

xilun · 2016-05-27T20:24:13Z

note: this problem is still present on W10 build 14352

xilun · 2016-05-27T20:26:06Z

other note: binary release cbwin v0.3 includes the workaround

sunilmut · 2016-12-04T10:30:03Z

With the fix for #610 and #616, we have changed the way TCP sockets are now closed in WSL. Maybe, that resolves the issue here. The fix is not yet available in the Insider build, but, once it is, maybe you can give this one more go. I will ping this thread once the fix is in the Insider build.

sunilmut · 2017-01-16T19:32:23Z

@xilun - With #610 and #616 fixed in 15002, can you give your scenario a try? We would like to hear if your issue is resolved in 15002.

therealkenc · 2018-03-24T21:28:08Z

There was no reply to Sunil. Calling this fixedcreatorsupdate 15063 until proven otherwise.

benhillis added the network label May 24, 2016

xilun added a commit to xilun/cbwin that referenced this issue May 24, 2016

workaround for microsoft/WSL#419

51902d6

therealkenc closed this as completed Mar 24, 2018

therealkenc added the fixedincreatorsupdate label Mar 24, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

select sometimes block forever on a ready TCP sockets #419

select sometimes block forever on a ready TCP sockets #419

xilun commented May 23, 2016

sunilmut commented May 24, 2016

xilun commented May 24, 2016

aseering commented May 24, 2016 •

edited

Loading

xilun commented May 24, 2016

xilun commented May 24, 2016

xilun commented May 24, 2016

xilun commented May 24, 2016 •

edited

Loading

xilun commented May 24, 2016

sunilmut commented May 24, 2016

xilun commented May 25, 2016

xilun commented May 27, 2016

xilun commented May 27, 2016

sunilmut commented Dec 4, 2016

sunilmut commented Jan 16, 2017

therealkenc commented Mar 24, 2018

select sometimes block forever on a ready TCP sockets #419

select sometimes block forever on a ready TCP sockets #419

Comments

xilun commented May 23, 2016

sunilmut commented May 24, 2016

xilun commented May 24, 2016

aseering commented May 24, 2016 • edited Loading

xilun commented May 24, 2016

xilun commented May 24, 2016

xilun commented May 24, 2016

xilun commented May 24, 2016 • edited Loading

xilun commented May 24, 2016

sunilmut commented May 24, 2016

xilun commented May 25, 2016

xilun commented May 27, 2016

xilun commented May 27, 2016

sunilmut commented Dec 4, 2016

sunilmut commented Jan 16, 2017

therealkenc commented Mar 24, 2018

aseering commented May 24, 2016 •

edited

Loading

xilun commented May 24, 2016 •

edited

Loading