Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows implementation consumes lots of memory (linux doesn't) #439

Closed
sateffen opened this issue Jul 20, 2016 · 19 comments
Closed

Windows implementation consumes lots of memory (linux doesn't) #439

sateffen opened this issue Jul 20, 2016 · 19 comments
Labels
enhancement windows Related to the Windows OS.
Milestone

Comments

@sateffen
Copy link

Hey,

like mentioned in #415 windows consumes a huge lot of memory for a simple echo server with lots (5-10k) connections (sending and receiving data).

On this simple project creating around 10k connections consumes ~700mb of ram, on a linux machine it takes around 7mb. After all connections are closed the memory is freed again.

To reproduce:

  • Clone the repo
  • open a shell and execute cargo run
  • open a shell and execute node test.js

Setup:

  • Windows 10
  • Node 6.2.2
  • Rust 1.8.0
  • mio 0.5.1

This will spin up 10k connections to localhost:8888 (the waiting rust echo server).

I'm experiencing this on a windows 10 laptop, simply observed with the taskmanager. As linux reference I've used a private rootserver and a cloud9 machine, none of them showed this behaviour.

As soon as I've got some time I'll update my compiler and try to create some simpler code to reproduce, but currently my time is somehow limited.

If any questions left, just ask :)

Cheers

@alexcrichton
Copy link
Contributor

Windows is implemented differently than Unix due to the IOCP model, so there's bound to be some more overhead than before. Right now Windows uses a blanket 64KB buffer for all reads on sockets, and if you have 10000 sockets that's 655 MB right there (close to the 700MB you're seeing)

The behavior on Windows should likely be smarter than just "always allocate 64KB buffers" along with the ability to tune the buffer size per connection ideally.

@sateffen
Copy link
Author

That sounds pretty much like the behaviour I'm experiencing. The 65mb + base memory + the bytes copied to my local buffers might go up to ~700mb (I don't remember the actual number, but lets assume it's around 650-700mb).

In my case I send around 14 byte payload, 64KB is pretty much too much in this case. Is there a chance to optimize this, or is the IOCP model that wrecked, that it has to be like this?

How is libuv doing the job not raising so much memory? Maybe we can check their tricks to learn from them?

@alexcrichton
Copy link
Contributor

Oh there's nothing inherently bad about IOCP here, we just have some heuristics about when to schedule I/O which probably need a large amount of tweaking or ability to configure.

@ghost
Copy link

ghost commented Jul 21, 2016

In libuv user is fully responsible for memory allocation and ensuring that
buffers are kept alive until IO operations complete, so that is not exactly
comparable with what is in MIO.

But this makes me wonder, why is overlapped used in Windows implementation in
the first place? I understand that it is preferable way to do things on
Windows, but is it still true if you don't expose completion based API to
end-user and need additional copies anyway? Wouldn't non-blocking IO (without
overlapped) be more suitable given the MIO interface?

@tailhook
Copy link
Contributor

I think this is another sign that mio implements IOCP support at the wrong level of abstraction. I've tried to explain it here. Let me try to elaborate on that.

The biggest point here is that it's hard to find a good heuristics for buffers size without knowing the application domain.

But if windows support is implemented at the rotor-stream level the communication between application and IO handling code would be in the terms of the Expectation structure, which looks like this (simplified):

pub enum Expectation {
    Bytes(/* max_bytes: */ usize),
    Delimiter { delimiter: &'static [u8], max_bytes: usize },
    Sleep,
}

Let me explain a little bit how it is used in HTTP protocol (simplified):

  1. On connection accept http protocol parser returns Bytes(1) to rotor-stream which basically means read into the smallest buffer
  2. When read is complete we check if there is the end of headers in the buffer(*). If there is not, we expect a little bit more bytes. But buffer is kept in rotor-stream, so we don't have to reallocate buffer if we received few bytes, just move the pointer
  3. If headers are received we check Content-Length and return Bytes(content_length) expectation
  4. When request is received we return Sleep expectation (which would mean no buffer is needed for read using IOCP as long as request is being processed)
  5. After sending response we go back to (1) for keep-alive connection (i.e. as small buffer as possible)

Most protocols could be broken down into these kinds of primitives (may be current set of expectations is not full or wrong, we just need to figure out) improving the control over the size of the buffer. This also allows reusing a buffer for multiple operations on the same connection.

(*) we don't use Delimiter(b"\r\n\r\n") there because there are some clients that expect that b"\n\n" is a valid delimiter for HTTP headers despite the spec


The alternative like add a tunable sock.set_iocp_buffer_size(n) is not very useful. Linux users will not get it right very often because they don't need it and can't actually test. And windows users will not get it right too because changing buffer size doesn't influence correctness, so forgetting to change a value in some state transition (like inactive connection becomes active or vice versa) will be visible only after a large amount of profiling.

( @alexcrichton , @carllerche looking forward to discussing these things in person on Rust Belt Rust :) )

@alexcrichton
Copy link
Contributor

@tailhook yeah I think the best solution here would be providing more hooks into the I/O layer to tune buffer size and have more control about when buffer reads/writes are scheduled. It actually shouldn't be too hard to do so as a windows-specific extension trait I think, but the pieces are indeed tricky!

@rrichardson
Copy link
Contributor

If we made it so the writer and reader knew how to chain buffers, we could
just have one reusable slab for all connections.

On Fri, Jul 22, 2016, 7:40 PM Alex Crichton notifications@github.com
wrote:

@tailhook https://github.com/tailhook yeah I think the best solution
here would be providing more hooks into the I/O layer to tune buffer size
and have more control about when buffer reads/writes are scheduled. It
actually shouldn't be too hard to do so as a windows-specific extension
trait I think, but the pieces are indeed tricky!


You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
#439 (comment), or mute
the thread
https://github.com/notifications/unsubscribe-auth/AAHlC3sqWprtkrkuo72Nom9jiVT7TRAgks5qYVTOgaJpZM4JQqMb
.

@carllerche carllerche added this to the v0.6 milestone Aug 22, 2016
@carllerche carllerche added bug windows Related to the Windows OS. labels Aug 22, 2016
@carllerche
Copy link
Member

@sateffen At the end of the day, IOCP requires that you pass ownership of the data to the OS, so you will need to have memory allocated for each in-flight operation. The best we can do is to tune things to balance memory / speed.

@alexcrichton Is there a reason you picked 64kb? Could that be reduced a bit?

@carllerche carllerche modified the milestones: v1.0, v0.6 Aug 22, 2016
@alexcrichton
Copy link
Contributor

@carllerche I believe it was because long ago libstd picked 64kb for a buffer size because long ago libuv picked a 64kb buffer size because purportedly long ago linux worked beset with 64k buffers.

In other words, we should be free to change at will, I see no reason to keep it so high.

@nayato
Copy link

nayato commented Sep 13, 2016

libuv does a read in the following pattern:

  • async read with a zero-length buffer which is used only for signaling that there's data to read from socket's receive buffer (which is OS managed and can be configured but isn't recommended to turn off unless you really know what you're doing).
  • sync non-blocking read until socket's receive buffer is drained - either WSAEWOULDBLOCK is returned or lenght of the buffer supplied in WSARecv call is greater than count of read bytes.

We do the same in DotNetty (.net port of netty). That way you can have lots (200K+) connections with 0 buffers reserved for any of them. Somewhat interesting optimization is to allow to tune the size of async buffer which might be beneficial when people are fine sacrificing some memory if they can estimate the size of expected messages.

@carllerche
Copy link
Member

I am... amazed that works, but if it does it would be great 👍 Thanks for the tip @nayato I will look more into the two examples you pointed out.

@carllerche
Copy link
Member

@nayato Do you know if this trick works w/ writing buffers too?

@carllerche carllerche modified the milestones: v0.7, v1.0 Sep 13, 2016
alexcrichton added a commit to alexcrichton/mio that referenced this issue Sep 13, 2016
This commit employs a trick [1] for avoiding allocation of an intermediate
buffer for in-flight calls to `TcpStream::read`. Previously this was done to
receive data, but apparently it works out to pass a 0-length buffer to the
kernel and get a notification back when there's internally buffered data.

This in turn allows us to avoid allocation any buffers while getting an
epoll-style readiness notification for TCP streams. Once a 0-byte read has
completed we just call the normal `TcpStream::read` function and wait for
`WouldBlock` to get returned. Note that this involves putting the socket into
nonblocking mode (which we managed to avoid before).

[1]: tokio-rs#439 (comment)

cc tokio-rs#439
@alexcrichton
Copy link
Contributor

Thanks for the info @nayato! I've posted a PR at #471

It looked like libuv didn't employ this trick for TCP writes, nor for UDP sends/recvs. I've left those with the previous strategy of allocating buffers for now.

@nayato
Copy link

nayato commented Sep 16, 2016

no, you can't submit zero-byte buffer to WSASend unfortunately.

@nayato
Copy link

nayato commented Sep 16, 2016

@alexcrichton, another optimization you might want to consider to use SetFileCompletionNotificationModes to indicate that if async operation completes synchronously, IO completion should not be triggered. There's one bug with that for UDP though which I know was fixed in Win 8 but I'm not sure if it was back-ported to Win7. There's a comment on that in libuv: https://github.com/libuv/libuv/blob/v1.x/src/win/winsock.c#L271.

@alexcrichton
Copy link
Contributor

@nayato indeed yeah! I saw a lot of comments related to that in libuv. I'll open a separate tracking issue for that.

@alexcrichton
Copy link
Contributor

(opened #476)

alexcrichton added a commit that referenced this issue Oct 31, 2016
This commit employs a trick [1] for avoiding allocation of an intermediate
buffer for in-flight calls to `TcpStream::read`. Previously this was done to
receive data, but apparently it works out to pass a 0-length buffer to the
kernel and get a notification back when there's internally buffered data.

This in turn allows us to avoid allocation any buffers while getting an
epoll-style readiness notification for TCP streams. Once a 0-byte read has
completed we just call the normal `TcpStream::read` function and wait for
`WouldBlock` to get returned. Note that this involves putting the socket into
nonblocking mode (which we managed to avoid before).

[1]: #439 (comment)

cc #439
@alexcrichton
Copy link
Contributor

Ok, I've published #476 as 0.6.1 of mio, so in theory the memory usage with lots of TCP reads in flight should be much less

@carllerche
Copy link
Member

Thanks all. I believe that we've done as much as we can to resolve this issue. I am going to close it now. If there are further ideas of how to improve the memory situation on windows, a new issue should be opened.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement windows Related to the Windows OS.
Projects
None yet
Development

No branches or pull requests

6 participants