Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: server.maxConnections does not trigger error on some platforms #1885

Closed
Trott opened this issue Jun 3, 2015 · 27 comments
Closed

net: server.maxConnections does not trigger error on some platforms #1885

Trott opened this issue Jun 3, 2015 · 27 comments
Labels
confirmed-bug Issues with confirmed bugs. libuv Issues and PRs related to the libuv dependency or the uv binding. net Issues and PRs related to the net subsystem.

Comments

@Trott
Copy link
Member

Trott commented Jun 3, 2015

See #1881 for discussion and some details. Exceeding server.maxConnections on OS X (and probably others) does not trigger ECONNRESET (at least in some situations) but it seemingly should and it does on FreeBSD (and probably others).

@brendanashworth
Copy link
Contributor

FWIW that test also broke ARM, SmartOS and Windows, it isn't just FreeBSD.

@brendanashworth brendanashworth added the test Issues and PRs related to the tests. label Jun 3, 2015
@brendanashworth
Copy link
Contributor

Here is a refined test case that illustrates the bug. For me it always fails on FreeBSD 10.0 amd64, and does not on OSX 10.10.3.

var net = require('net');

var server = net.createServer();
// Hack to reject all connections
server.maxConnections = -1;

// Listen, connect (closed asap), but write in between.
server.listen(9876, function() {
  console.error('connecting');

  var connection = net.connect(9876, function() {
    var buffered = this.write('foo');

    console.error('sent message, data', buffered ? 'flushed' : 'queued');
  });

  connection.on('close', function() {
    server.close();

    console.error('this run did not error and die');
  });
});

Non-error run (OSX @ v2.2.1):

connecting
sent message, data flushed
this run did not error and die

Error run (FreeBSD @ 43a82f8):

connecting
sent message, data flushed
events.js:141
      throw er; // Unhandled 'error' event
            ^
Error: read ECONNRESET
    at exports._errnoException (util.js:846:11)
    at TCP.onread (net.js:540:26)

The error is emitted on the client connection. I think what is happening here is, since maxConnections closes a client after it connects, the client tries to write to the server (at which point the socket is open), but the server closes the socket before the data can be read into the server.

I think this is a bug on the platforms that fail with ECONNRESET. I'm fairly sure there is no way to get whether the data was actually sent to the server; socket.write takes a callback, but it doesn't seem to give an error.

@brendanashworth brendanashworth added confirmed-bug Issues with confirmed bugs. net Issues and PRs related to the net subsystem. and removed test Issues and PRs related to the tests. labels Jun 3, 2015
@Trott
Copy link
Member Author

Trott commented Jun 4, 2015

I installed FreeBSD 10.1 on VirtualBox on my Mac and am unable to replicate this either with the test code above or by reverting test/parallel/test-net-server-max-connections-close-makes-more-available.js. @santigimeno reported the bug on FreeBSD 10.1, so it's not a 10.0 vs. 10.1 issue. Maybe it's because I am running a VM on top of OS X and that somehow squashes the ECONNRESET?

So, flying somewhat blind because I am unable to replicate the problem (any assistance welcome) even though I have no doubt the problem is very real...

Likely relevant for any archeology on this:

Maybe @brendanashworth or @santigimeno (or someone else running a version of FreeBSD that exhibits this issue) can compile with that line changed to return self._destroy(); and see if the problem goes away? Not saying that's necessarily a good solution. Just trying to confirm that that is where the issue comes up.

Not sure if this is a case of "oh well, different operating systems will behave differently" or if there's a good reason to attempt a workaround.

For what it's worth, it looks like ruby may have implemented a workaround specifically for this.

@brendanashworth
Copy link
Contributor

@Trott fwiw I use the FreeBSD in https://github.com/brendanashworth/vms, just run setup.sh, CC=clang CXX=clang++ ./configure and make in /vagrant. I can reproduce there, even on OSX.

I think you should follow up what you're looking at and try your idea. That ruby patch looks like it's related, or at least feels like it. Too bad it doesn't mention any of the other OSs.

cc: @bnoordhuis for the earlier commit

@Trott
Copy link
Member Author

Trott commented Jun 4, 2015

Thanks, @brendanashworth. That worked for replicating the problem. And the change I suggested "solves" it but, you know, might break other tests. (test-child-process-fork-dgram and others hang and then give EADDRINUSE for me with and without the change, so something is up with the vagrant FreeBSD VM, or at least for my setup.)

@Trott
Copy link
Member Author

Trott commented Jun 5, 2015

This may be getting off the trail here, but hey @brendanashworth, when you set up your FreeBSD 10.0 via Vagrant with OS X as the host OS, does this succeed for you?

./iojs  test/parallel/test-child-process-fork-dgram.js

If not, does applying these two changes fix it?

--- a/test/parallel/test-child-process-fork-dgram.js
+++ b/test/parallel/test-child-process-fork-dgram.js
@@ -34,10 +34,12 @@ if (process.argv[2] === 'child') {
       server.on('message', function() {
         process.send('gotMessage');
       });
+      setInterval(function () {}, 1000);

     } else if (msg === 'stop') {
       server.close();
       process.removeListener('message', removeMe);
+      process.exit();
     }
   });

@brendanashworth
Copy link
Contributor

@Trott it seems to work fine via the test suite but not individually (hangs forever). It doesn't seem to be blocking the CI so maybe it is related to #1876.

@Trott
Copy link
Member Author

Trott commented Jun 5, 2015

Doesn't work for me in the suite or individually. And not temp file related. So not #1876-related, I don't think. Either way, if It's Just Me, then maybe I ought to just blow everything away, start over, and see if it goes away...

@Trott
Copy link
Member Author

Trott commented Jun 5, 2015

Bumping my VM from 1 CPU to 2 CPUs and fixed the test-child-process-fork-dgram issue for me. Hooray. Now back to the main issue at hand...

@Trott
Copy link
Member Author

Trott commented Jun 5, 2015

Er...um...OK...bumping myself from 1 CPU to 2 CPUs also made the test case above start passing for me on FreeBSD. Er...@brendanashworth, do you experience that too? Or is the test still failing for you on FreeBSD if you go from (presumably) a single CPU to 2?

@brendanashworth
Copy link
Contributor

@Trott erm. that fixes it. perhaps it is related to how libuv handles writes on various platforms async/sync? this made the bug twice as weird.

@santigimeno
Copy link
Member

I've run the test from @brendanashworth and it always fails in my FreeBSD box.
After running dtruss in the test I'm not sure it's a bug. Here's the relevant output:

51758/100478:   3772485      15     10 socket(0x2, 0x30000001, 0x0)      = 12 0
51758/100478:   3772522     529     26 connect(0xC, 0x7FFFFFFF5908, 0x10)       = 0 0
51758/100478:   3772573       4      0 sigprocmask(0x1, 0x7FFFFFFF6890, 0x0)        = 0x0 0
51758/100478:   3772615       8      4 kevent(0x5, 0x7FFFFFFF6980, 0x1)         = 2 0
51758/100478:   3772616       2      0 sigprocmask(0x2, 0x7FFFFFFF6960, 0x0)        = 0x0 0
51758/100478:   3772651      10      6 accept4(0xB, 0x0, 0x0)       = 13 0
51758/100478:   3772760      17     13 clock_gettime(0x4, 0x7FFFFFFF62F0, 0x0)      = 0 0
51758/100478:   3773074      17     13 clock_gettime(0x4, 0x7FFFFFFF62F0, 0x0)      = 0 0
51758/100478:   3773115     493     19 close(0xD)       = 0 0
51758/100478:   3773135       5      0 getsockopt(0xC, 0xFFFF, 0x1007)      = 0 0
51758/100478:   3776219     500     12 write(0xC, "foo\0", 0x3)         = 3 0
51758/100478:   3779578       2      0 sigprocmask(0x1, 0x7FFFFFFF6890, 0x0)        = 0x0 0
51758/100478:   3779584       7      5 kevent(0x5, 0x7FFFFFFF6980, 0x1)         = 2 0
51758/100478:   3779584       2      0 sigprocmask(0x2, 0x7FFFFFFF6960, 0x0)        = 0x0 0
51758/100478:   3779601       4      2 kevent(0x5, 0x7FFFFFFF6900, 0x1)         = 0 0
51758/100478:   3779653       5      1 read(0xC, "\0", 0x10000)         = -1 Err#54
51758/100478:   3782483       7      4 ioctl(0x2, 0x402C7413, 0x7FFFFFFF5AB0)       = 0 0
51758/100478:   3783853       7      3 ioctl(0x2, 0x402C7413, 0x7FFFFFFF5990)       = 0 0
51758/100478:   3783941    7807     86 open("/dev/tty\0", 0x100002, 0x0)        = 13 0
51758/100478:   3783964       7      3 fcntl(0xD, 0x12, 0x2)        = 2 0
51758/100478:   3784010       7      3 ioctl(0xD, 0x8004667E, 0x7FFFFFFF5A64)       = 0 0
51758/100478:   3784846      10      5 __sysctl(0x7FFFFFFF5228, 0x2, 0x7FFFFFFF5224)        = 0 0
51758/100478:   3784852       7      4 mmap(0x3A5B62873000, 0x200000, 0x0)      = 1653026816 0
51758/100478:   3784857       8      4 munmap(0x3A5B62873000, 0x8D000)      = 0 0
51758/100478:   3784859       3      1 __sysctl(0x7FFFFFFF5228, 0x2, 0x7FFFFFFF5224)        = 0 0
51758/100478:   3784861       3      1 munmap(0x3A5B62A00000, 0x73000)      = 0 0
51758/100478:   3784864       4      3 mmap(0x3A5B62900000, 0x100000, 0x3)      = 1653604352 0
51758/100478:   3786057       7      3 ioctl(0xD, 0x40087468, 0x7FFFFFFF5AE0)       = 0 0
51758/100478:   3786223      16     11 close(0xC)       = 0 0

After the server closes the connection close(0xD) = 0 it seems that the write(0xC, "foo\0", 0x3) = 3 syscall doesn't fail and that the client detects that the connection has been closed later on at the read(0xC, "\0", 0x10000) = -1syscall. I would say a ECONNRESET it's the correct error to return.

@santigimeno
Copy link
Member

Disregard the last comment, I don't think there should be an error emitted.

@santigimeno
Copy link
Member

Perhaps nread should have been uv.UV_EOF in the onread function https://github.com/nodejs/io.js/blob/ca9eb718fbf2cd2c60c7aeb3ca33413e17fcbbf0/lib/net.js#L537 , so the socket closes gracefully.

@brendanashworth
Copy link
Contributor

It feels to me like a libuv issue - but it looks to be the same as libuv/libuv#45, which was closed as "not a libuv issue". Perhaps take a look here too.

@bnoordhuis
Copy link
Member

There is a tiny platform-specific workaround libuv could do here (squelch the ECONNRESET if EV_EOF is set) but that won't work on all platforms.

brendanashworth added a commit to brendanashworth/libuv that referenced this issue Jun 17, 2015
On various platforms, behaviour in return values differs on the read(2)
syscall. On OSX (+ others), read() on a closed-ish socket will return an
EOF, while on FreeBSD (+ others), read() on the same closed-ish socket
will emit an ECONNRESET error.

This commit swallows the ECONNRESET error on the offending platforms to
instead emit an EOF, allowing behaviour to be consistent between
platforms.

Ref: nodejs/node#1885
@brendanashworth
Copy link
Contributor

@bnoordhuis taking a look, I think it could work on all platforms, but please correct me if I'm wrong.

<soliloquy>
The difference in behavior on these platforms is originating from the read(2) call. On OSX, it is returning EOF, while FreeBSD & friends err with an ECONNRESET. Here are the man files on this (man 2 read):

OSX:

[ECONNRESET] The connection is closed by the peer during a read attempt on a socket.

FreeBSD:

[ECONNRESET] The fd argument refers to a socket, and the remote socket end is forcibly closed.

Even though our friend from 2005 wasn't quite clear on his documentation, the behavior seems to show that FreeBSD tends toward the ECONNRESET more than OSX. Looking at the documentation shows that this error is a lot like EOF, except in io.js (and libuv!) it is emitted as an error, rather than gracefully handled as EOF is. I think this is where the bug lies.

If we apply this patch to the io.js tree, we can escape the ECONNRESET on all platforms and instead emit an EOF:

diff --git a/deps/uv/src/unix/stream.c b/deps/uv/src/unix/stream.c
index 7ad1658..73ec0cf 100644
--- a/deps/uv/src/unix/stream.c
+++ b/deps/uv/src/unix/stream.c
@@ -1139,6 +1139,9 @@ static void uv__read(uv_stream_t* stream) {
           uv__stream_osx_interrupt_select(stream);
         }
         stream->read_cb(stream, 0, &buf);
+      } else if (errno == ECONNRESET) {
+        uv__stream_eof(stream, &buf);
       } else {
         /* Error. User should call uv_close(). */
         stream->read_cb(stream, -errno, &buf);

With that patch, all ECONNRESET errors are instead emitted as an EOF, safeguarding the socket from the error on the misbehaving (?) platforms. This allows the test case to exit without error.

You can view the full commit here: brendanashworth/libuv@a302dda.
</soliloquy>

cc @saghul, would libuv be okay in carrying a patch like that?

@saghul
Copy link
Member

saghul commented Jun 17, 2015

@brendanashworth I don;t think I like that approach. EOF means the connection was closed cleanly, as in all data was read. If the application which calls close sets SO_LINGER with timeout 0, for example, ECONNRESET is produced on the other side, and we cannot be sure if they read all the data we sent.

Squelching it where possible, if EOF was already set, as @bnoordhuis pointed out here: #1885 (comment) could help, but I'm -1 on the general case.

@bnoordhuis
Copy link
Member

What @saghul said, it would hide real connection reset errors. I think checking for EV_EOF (or EPOLLRDHUP on Linux) is the best libuv can do here.

@santigimeno
Copy link
Member

I have given this a try: santigimeno/libuv@3e5eeb9 . Does it make any sense? I'm not familiar with libuv internals. It fixes the test case for me in FreeBSD but some libuv tests don't look good.

@santigimeno
Copy link
Member

PR here : libuv/libuv#403 updated with @bnoordhuis comments

@brendanashworth
Copy link
Contributor

The patch has landed: libuv/libuv@05a003a.

@santigimeno
Copy link
Member

Notice that it solves the issue on FreeBSD but not on Windows

@brendanashworth
Copy link
Contributor

Patch first available here: #2310

@Trott Trott added the libuv Issues and PRs related to the libuv dependency or the uv binding. label Mar 11, 2016
@Trott
Copy link
Member Author

Trott commented Jul 7, 2017

I see a libuv patch landed but was then reverted. There hasn't been any comments on this issue in nearly two years. Does anyone know if it's still an issue? Should it be closed? Or perhaps updated?

@Trott
Copy link
Member Author

Trott commented Dec 24, 2017

I'm going to close this dormant issue that I originally opened 2.5 years ago. 😮 It's not clear to me that it's a bug as opposed to a timing issue. No one seems to be actively working on it nor does it seem to be affecting users.

@Trott Trott closed this as completed Dec 24, 2017
@Trott
Copy link
Member Author

Trott commented Dec 24, 2017

(Of course, if you think I'm wrong to close it, re-open if GitHub lets you, or leave a comment indicating that you believe it should be re-opened.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
confirmed-bug Issues with confirmed bugs. libuv Issues and PRs related to the libuv dependency or the uv binding. net Issues and PRs related to the net subsystem.
Projects
None yet
Development

No branches or pull requests

5 participants