Windows implementation consumes lots of memory (linux doesn't) #439

sateffen · 2016-07-20T10:57:36Z

Hey,

like mentioned in #415 windows consumes a huge lot of memory for a simple echo server with lots (5-10k) connections (sending and receiving data).

On this simple project creating around 10k connections consumes ~700mb of ram, on a linux machine it takes around 7mb. After all connections are closed the memory is freed again.

To reproduce:

Clone the repo
open a shell and execute cargo run
open a shell and execute node test.js

Setup:

Windows 10
Node 6.2.2
Rust 1.8.0
mio 0.5.1

This will spin up 10k connections to localhost:8888 (the waiting rust echo server).

I'm experiencing this on a windows 10 laptop, simply observed with the taskmanager. As linux reference I've used a private rootserver and a cloud9 machine, none of them showed this behaviour.

As soon as I've got some time I'll update my compiler and try to create some simpler code to reproduce, but currently my time is somehow limited.

If any questions left, just ask :)

Cheers

alexcrichton · 2016-07-20T20:39:55Z

Windows is implemented differently than Unix due to the IOCP model, so there's bound to be some more overhead than before. Right now Windows uses a blanket 64KB buffer for all reads on sockets, and if you have 10000 sockets that's 655 MB right there (close to the 700MB you're seeing)

The behavior on Windows should likely be smarter than just "always allocate 64KB buffers" along with the ability to tune the buffer size per connection ideally.

sateffen · 2016-07-21T06:54:24Z

That sounds pretty much like the behaviour I'm experiencing. The 65mb + base memory + the bytes copied to my local buffers might go up to ~700mb (I don't remember the actual number, but lets assume it's around 650-700mb).

In my case I send around 14 byte payload, 64KB is pretty much too much in this case. Is there a chance to optimize this, or is the IOCP model that wrecked, that it has to be like this?

How is libuv doing the job not raising so much memory? Maybe we can check their tricks to learn from them?

alexcrichton · 2016-07-21T16:02:36Z

Oh there's nothing inherently bad about IOCP here, we just have some heuristics about when to schedule I/O which probably need a large amount of tweaking or ability to configure.

ghost · 2016-07-21T19:40:27Z

In libuv user is fully responsible for memory allocation and ensuring that
buffers are kept alive until IO operations complete, so that is not exactly
comparable with what is in MIO.

But this makes me wonder, why is overlapped used in Windows implementation in
the first place? I understand that it is preferable way to do things on
Windows, but is it still true if you don't expose completion based API to
end-user and need additional copies anyway? Wouldn't non-blocking IO (without
overlapped) be more suitable given the MIO interface?

tailhook · 2016-07-21T20:31:30Z

I think this is another sign that mio implements IOCP support at the wrong level of abstraction. I've tried to explain it here. Let me try to elaborate on that.

The biggest point here is that it's hard to find a good heuristics for buffers size without knowing the application domain.

But if windows support is implemented at the rotor-stream level the communication between application and IO handling code would be in the terms of the Expectation structure, which looks like this (simplified):

pub enum Expectation {
    Bytes(/* max_bytes: */ usize),
    Delimiter { delimiter: &'static [u8], max_bytes: usize },
    Sleep,
}

Let me explain a little bit how it is used in HTTP protocol (simplified):

On connection accept http protocol parser returns Bytes(1) to rotor-stream which basically means read into the smallest buffer
When read is complete we check if there is the end of headers in the buffer(*). If there is not, we expect a little bit more bytes. But buffer is kept in rotor-stream, so we don't have to reallocate buffer if we received few bytes, just move the pointer
If headers are received we check Content-Length and return Bytes(content_length) expectation
When request is received we return Sleep expectation (which would mean no buffer is needed for read using IOCP as long as request is being processed)
After sending response we go back to (1) for keep-alive connection (i.e. as small buffer as possible)

Most protocols could be broken down into these kinds of primitives (may be current set of expectations is not full or wrong, we just need to figure out) improving the control over the size of the buffer. This also allows reusing a buffer for multiple operations on the same connection.

(*) we don't use Delimiter(b"\r\n\r\n") there because there are some clients that expect that b"\n\n" is a valid delimiter for HTTP headers despite the spec

The alternative like add a tunable sock.set_iocp_buffer_size(n) is not very useful. Linux users will not get it right very often because they don't need it and can't actually test. And windows users will not get it right too because changing buffer size doesn't influence correctness, so forgetting to change a value in some state transition (like inactive connection becomes active or vice versa) will be visible only after a large amount of profiling.

( @alexcrichton , @carllerche looking forward to discussing these things in person on Rust Belt Rust :) )

alexcrichton · 2016-07-22T23:39:57Z

@tailhook yeah I think the best solution here would be providing more hooks into the I/O layer to tune buffer size and have more control about when buffer reads/writes are scheduled. It actually shouldn't be too hard to do so as a windows-specific extension trait I think, but the pieces are indeed tricky!

rrichardson · 2016-07-23T17:05:59Z

If we made it so the writer and reader knew how to chain buffers, we could
just have one reusable slab for all connections.

On Fri, Jul 22, 2016, 7:40 PM Alex Crichton notifications@github.com
wrote:

@tailhook https://github.com/tailhook yeah I think the best solution
here would be providing more hooks into the I/O layer to tune buffer size
and have more control about when buffer reads/writes are scheduled. It
actually shouldn't be too hard to do so as a windows-specific extension
trait I think, but the pieces are indeed tricky!

—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#439 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAHlC3sqWprtkrkuo72Nom9jiVT7TRAgks5qYVTOgaJpZM4JQqMb
.

carllerche · 2016-08-22T21:29:47Z

@sateffen At the end of the day, IOCP requires that you pass ownership of the data to the OS, so you will need to have memory allocated for each in-flight operation. The best we can do is to tune things to balance memory / speed.

@alexcrichton Is there a reason you picked 64kb? Could that be reduced a bit?

alexcrichton · 2016-08-23T05:26:13Z

@carllerche I believe it was because long ago libstd picked 64kb for a buffer size because long ago libuv picked a 64kb buffer size because purportedly long ago linux worked beset with 64k buffers.

In other words, we should be free to change at will, I see no reason to keep it so high.

nayato · 2016-09-13T07:45:47Z

libuv does a read in the following pattern:

async read with a zero-length buffer which is used only for signaling that there's data to read from socket's receive buffer (which is OS managed and can be configured but isn't recommended to turn off unless you really know what you're doing).
sync non-blocking read until socket's receive buffer is drained - either WSAEWOULDBLOCK is returned or lenght of the buffer supplied in WSARecv call is greater than count of read bytes.

We do the same in DotNetty (.net port of netty). That way you can have lots (200K+) connections with 0 buffers reserved for any of them. Somewhat interesting optimization is to allow to tune the size of async buffer which might be beneficial when people are fine sacrificing some memory if they can estimate the size of expected messages.

carllerche · 2016-09-13T15:06:24Z

I am... amazed that works, but if it does it would be great 👍 Thanks for the tip @nayato I will look more into the two examples you pointed out.

carllerche · 2016-09-13T15:26:10Z

@nayato Do you know if this trick works w/ writing buffers too?

This commit employs a trick [1] for avoiding allocation of an intermediate buffer for in-flight calls to `TcpStream::read`. Previously this was done to receive data, but apparently it works out to pass a 0-length buffer to the kernel and get a notification back when there's internally buffered data. This in turn allows us to avoid allocation any buffers while getting an epoll-style readiness notification for TCP streams. Once a 0-byte read has completed we just call the normal `TcpStream::read` function and wait for `WouldBlock` to get returned. Note that this involves putting the socket into nonblocking mode (which we managed to avoid before). [1]: tokio-rs#439 (comment) cc tokio-rs#439

alexcrichton · 2016-09-13T23:54:33Z

Thanks for the info @nayato! I've posted a PR at #471

It looked like libuv didn't employ this trick for TCP writes, nor for UDP sends/recvs. I've left those with the previous strategy of allocating buffers for now.

nayato · 2016-09-16T02:13:58Z

no, you can't submit zero-byte buffer to WSASend unfortunately.

nayato · 2016-09-16T10:34:27Z

@alexcrichton, another optimization you might want to consider to use SetFileCompletionNotificationModes to indicate that if async operation completes synchronously, IO completion should not be triggered. There's one bug with that for UDP though which I know was fixed in Win 8 but I'm not sure if it was back-ported to Win7. There's a comment on that in libuv: https://github.com/libuv/libuv/blob/v1.x/src/win/winsock.c#L271.

alexcrichton · 2016-09-26T17:35:19Z

@nayato indeed yeah! I saw a lot of comments related to that in libuv. I'll open a separate tracking issue for that.

alexcrichton · 2016-09-26T17:41:36Z

(opened #476)

This commit employs a trick [1] for avoiding allocation of an intermediate buffer for in-flight calls to `TcpStream::read`. Previously this was done to receive data, but apparently it works out to pass a 0-length buffer to the kernel and get a notification back when there's internally buffered data. This in turn allows us to avoid allocation any buffers while getting an epoll-style readiness notification for TCP streams. Once a 0-byte read has completed we just call the normal `TcpStream::read` function and wait for `WouldBlock` to get returned. Note that this involves putting the socket into nonblocking mode (which we managed to avoid before). [1]: #439 (comment) cc #439

alexcrichton · 2016-10-31T03:13:52Z

Ok, I've published #476 as 0.6.1 of mio, so in theory the memory usage with lots of TCP reads in flight should be much less

carllerche · 2017-02-09T17:17:08Z

Thanks all. I believe that we've done as much as we can to resolve this issue. I am going to close it now. If there are further ideas of how to improve the memory situation on windows, a new issue should be opened.

carllerche added this to the v0.6 milestone Aug 22, 2016

carllerche added bug windows Related to the Windows OS. labels Aug 22, 2016

carllerche modified the milestones: v1.0, v0.6 Aug 22, 2016

carllerche added enhancement and removed bug labels Aug 22, 2016

carllerche modified the milestones: v0.7, v1.0 Sep 13, 2016

alexcrichton mentioned this issue Sep 13, 2016

windows: Don't allocate buffers for TCP reads #471

Closed

carllerche closed this as completed Feb 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Windows implementation consumes lots of memory (linux doesn't) #439

Windows implementation consumes lots of memory (linux doesn't) #439

sateffen commented Jul 20, 2016

alexcrichton commented Jul 20, 2016

sateffen commented Jul 21, 2016

alexcrichton commented Jul 21, 2016

ghost commented Jul 21, 2016

tailhook commented Jul 21, 2016

alexcrichton commented Jul 22, 2016

rrichardson commented Jul 23, 2016

carllerche commented Aug 22, 2016

alexcrichton commented Aug 23, 2016

nayato commented Sep 13, 2016 •

edited

Loading

carllerche commented Sep 13, 2016

carllerche commented Sep 13, 2016

alexcrichton commented Sep 13, 2016

nayato commented Sep 16, 2016

nayato commented Sep 16, 2016

alexcrichton commented Sep 26, 2016

alexcrichton commented Sep 26, 2016

alexcrichton commented Oct 31, 2016

carllerche commented Feb 9, 2017

Windows implementation consumes lots of memory (linux doesn't) #439

Windows implementation consumes lots of memory (linux doesn't) #439

Comments

sateffen commented Jul 20, 2016

alexcrichton commented Jul 20, 2016

sateffen commented Jul 21, 2016

alexcrichton commented Jul 21, 2016

ghost commented Jul 21, 2016

tailhook commented Jul 21, 2016

alexcrichton commented Jul 22, 2016

rrichardson commented Jul 23, 2016

carllerche commented Aug 22, 2016

alexcrichton commented Aug 23, 2016

nayato commented Sep 13, 2016 • edited Loading

carllerche commented Sep 13, 2016

carllerche commented Sep 13, 2016

alexcrichton commented Sep 13, 2016

nayato commented Sep 16, 2016

nayato commented Sep 16, 2016

alexcrichton commented Sep 26, 2016

alexcrichton commented Sep 26, 2016

alexcrichton commented Oct 31, 2016

carllerche commented Feb 9, 2017

nayato commented Sep 13, 2016 •

edited

Loading