-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad performance after introducing zeroing of UninitSlice #6
Comments
Interesting, so it was introduced with #3 We have not run any benchmarks back then, but we might need to investigate it. Thanks for the report! |
It would be great if this could be fixed. I'm using tungstenite via wrap, which uses tokio and tokio-tungstenite under the hood. Tokio now has proper support for reading into uninitialized memory (https://tokio.rs/blog/2020-10-tokio-0-3) so this performance issue is definitely something that could be avoided IMHO. |
Here is my benchmark.
After
Why is it a
|
Not sure I got the idea behind "quadratic zeroing", but thanks for doing some benchmarks. So it seems that the changes introduced in above mentioned PR [while added some more safety] had a significant performance impact. The change was done for the safety purposes:
That's right, however the problem that we have is that I see 2 ways to solve this:
Related issues:
CC: @agalakhov @jxs |
By quadratic zeroing, I mean we are zeroing buffer more than necessary. The larger the buffer, the more bytes we zero(grows quadratically). If we carefully track where we have zeroed, we can zero every byte exactly once and acheive high performance. |
Note that if you need to allocate a large zero-initialized buffer, then calling |
Are you sure we zero more than necessary? We only zero the buffer whenever we increase the capacity, i.e. if you read a big chunk of data from the stream, then each call to the Ok fellows, I tried to analyze the information that @qiujiangkun provided in benchmarks to understand what's going on with our performance and I think I found the culprit. First thing that I did was fetching the benchmarks that @qiujiangkun provided and running them locally using the same buffer size ( And indeed the difference was quite significant:
I was really surprised by such a huge difference given that we do the same things [algorithmically], i.e. in case of
The
I.e. in both cases we have a reservation of memory to store the final result + 2 write operations of the same size (considering we're using the same benchmark that @qiujiangkun provided, i.e. reading from a "mock" stream until the end). Such huge difference in performance meant that the resulted compiled code differs heavily / optimized differently. I could sort of figure some performance benefit from having a fixed buffer of a known size on stack, but could not imagine such huge difference between them. Checking the reserve step did not make sense as it's identical in both cases, checking the writing ( The function relies on ...
while let Some(element) = iterator.next() {
let len = self.len();
if len == self.capacity() {
let (lower, _) = iterator.size_hint();
self.reserve(lower.saturating_add(1));
}
unsafe {
ptr::write(self.as_mut_ptr().add(len), element);
// NB can't overflow since we would have had to alloc the address space
self.set_len(len + 1);
}
}
... and the one which is ...
self.reserve(additional);
unsafe {
let mut ptr = self.as_mut_ptr().add(self.len());
let mut local_len = SetLenOnDrop::new(&mut self.len);
iterator.for_each(move |element| {
ptr::write(ptr, element);
ptr = ptr.offset(1);
// NB can't overflow since we would have had to alloc the address space
local_len.increment_len(1);
});
}
... The first big difference that I spotted is that it does not rely on So I prepared a version of the
I was quite curious about that since Compared to our current implementation, the new implementation that does zeroing in a slightly different way is much more efficient reaching almost the same performance as our old unsound implementation that skipped the zeroing step. I created a PR to address this issue. However there is an important thing that the one can spot immediately: our
So it looks like this simplified version of Because of this, I have the following plan in mind:
What do you guys think? |
That's a throughout investigation, especially the safe, sound, speedy solution. I'm happy to see tungstenite geting faster. Though I'm interested in using refactored |
We're also deprecating the usage of `input_buffer` crate, see: snapview/input_buffer#6 (comment)
We're also deprecating the usage of `input_buffer` crate, see: snapview/input_buffer#6 (comment)
We're also deprecating the usage of `input_buffer` crate, see: snapview/input_buffer#6 (comment)
After the following commit zeroing of UninitSlice completely dominates websocket CPU time:
It's very easy to reproduce by running the
server
example fromtungstenite-rs
and generating some websocket traffic (I've used websocat to send a large file via websocket):The profile looks like this:
It seems that the generated code is very suboptimal: it's writing zero bytes into the uninitialized buffer one byte at a time and thus causes a huge performance overhead in a websocket server.
The text was updated successfully, but these errors were encountered: