-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Websocket performance tweaks #73
Conversation
d9b40b5
to
922954a
Compare
53f14f6
to
6a82b78
Compare
@moogle19 you may be interested in the further tweak to your mask tweaks from a a few months back. Benchee: Mix.install([:benchee])
defmodule Old do
# Note that masking is an involution, so we don't need a separate unmask function
def mask(payload, mask, acc \\ <<>>)
def mask(payload, mask, acc) when is_integer(mask), do: mask(payload, <<mask::32>>, acc)
def mask(<<h::32, rest::binary>>, <<mask::32>>, acc) do
mask(rest, mask, acc <> <<Bitwise.bxor(h, mask)::32>>)
end
def mask(<<h::8, rest::binary>>, <<current::8, mask::24>>, acc) do
mask(rest, <<mask::24, current::8>>, acc <> <<Bitwise.bxor(h, current)::8>>)
end
def mask(<<>>, _mask, acc), do: acc
end
defmodule New do
# Note that masking is an involution, so we don't need a separate unmask function
def mask(payload, mask) do
payload
|> do_mask(<<mask::32>>, [])
|> IO.iodata_to_binary()
end
defp do_mask(<<h::32, rest::binary>>, <<int_mask::32>> = mask, acc) do
do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::32>>])
end
defp do_mask(<<h::8, rest::binary>>, <<current::8, mask::24>>, acc) do
do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::8>>])
end
defp do_mask(<<>>, _mask, acc), do: acc
end
foo = String.duplicate("a", 10_002)
Benchee.run(
%{
"old" => fn -> Old.mask(foo, 1234) end,
"new" => fn -> New.mask(foo, 1234) end
},
time: 10,
memory_time: 2
)
...
Operating System: macOS
CPU Information: Apple M1
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.14.1
Erlang 25.1
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 28 s
Benchmarking new ...
Benchmarking old ...
Name ips average deviation median 99th %
new 9.28 K 107.71 μs ±9.44% 106.67 μs 122.88 μs
old 7.11 K 140.74 μs ±20.00% 136.08 μs 277.74 μs
Comparison:
new 9.28 K
old 7.11 K - 1.31x slower +33.03 μs
Memory usage statistics:
Name Memory usage
new 235.49 KB
old 450.59 KB - 1.91x memory usage +215.09 KB |
If we replicate the mask to e.g. 256 bits and use that to Mix.install([:benchee])
defmodule NewImproved do
@mask_size 256
# Note that masking is an involution, so we don't need a separate unmask function
def mask(payload, mask) do
payload
|> do_mask(
<<mask::32, mask::32, mask::32, mask::32, mask::32, mask::32, mask::32, mask::32>>,
[]
)
|> IO.iodata_to_binary()
end
# Matching the full mask size
defp do_mask(
<<h::unquote(@mask_size), rest::binary>>,
<<int_mask::unquote(@mask_size)>> = mask,
acc
) do
do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::unquote(@mask_size)>>])
end
# Generate do_mask for each mask size from @mask_size down to 8
for x <- (@mask_size - 8)..8//-8 do
defp do_mask(
<<h::unquote(x), rest::binary>>,
<<current::unquote(x), mask::unquote(@mask_size - x)>>,
acc
) do
do_mask(rest, <<mask::unquote(@mask_size - x), current::unquote(x)>>, [
acc,
<<Bitwise.bxor(h, current)::unquote(x)>>
])
end
end
defp do_mask(<<>>, _mask, acc), do: acc
end
defmodule New do
# Note that masking is an involution, so we don't need a separate unmask function
def mask(payload, mask) do
payload
|> do_mask(<<mask::32>>, [])
|> IO.iodata_to_binary()
end
defp do_mask(<<h::32, rest::binary>>, <<int_mask::32>> = mask, acc) do
do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::32>>])
end
defp do_mask(<<h::24, rest::binary>>, <<current::24, mask::8>>, acc) do
do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::24>>])
end
defp do_mask(<<h::16, rest::binary>>, <<current::16, mask::16>>, acc) do
do_mask(rest, <<mask::16, current::16>>, [acc, <<Bitwise.bxor(h, current)::16>>])
end
defp do_mask(<<h::8, rest::binary>>, <<current::8, mask::24>>, acc) do
do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::8>>])
end
defp do_mask(<<>>, _mask, acc), do: acc
end
Benchee.run(
%{
"new_improved" => fn input -> NewImproved.mask(input, 1234) end,
"new" => fn input -> New.mask(input, 1234) end
},
time: 10,
memory_time: 2,
inputs: %{
"tiny" => String.duplicate("a", 102),
"small" => String.duplicate("a", 1_002),
"medium" => String.duplicate("a", 10_002),
"large" => String.duplicate("a", 100_002),
"huge" => String.duplicate("a", 1_000_002)
}
)
...
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.14.3
Erlang 25.2
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: huge, large, medium, small, tiny
Estimated total run time: 2.33 min
Benchmarking new with input huge ...
Benchmarking new with input large ...
Benchmarking new with input medium ...
Benchmarking new with input small ...
Benchmarking new with input tiny ...
Benchmarking new_improved with input huge ...
Benchmarking new_improved with input large ...
Benchmarking new_improved with input medium ...
Benchmarking new_improved with input small ...
Benchmarking new_improved with input tiny ...
##### With input huge #####
Name ips average deviation median 99th %
new_improved 234.32 4.27 ms ±37.19% 3.96 ms 13.60 ms
new 44.03 22.71 ms ±17.54% 21.60 ms 42.17 ms
Comparison:
new_improved 234.32
new 44.03 - 5.32x slower +18.45 ms
Memory usage statistics:
Name Memory usage
new_improved 5.96 MB
new 22.89 MB - 3.84x memory usage +16.93 MB
**All measurements for memory usage were the same**
##### With input large #####
Name ips average deviation median 99th %
new_improved 2.15 K 0.47 ms ±222.87% 0.41 ms 1.41 ms
new 0.58 K 1.72 ms ±63.47% 1.51 ms 5.08 ms
Comparison:
new_improved 2.15 K
new 0.58 K - 3.70x slower +1.26 ms
Memory usage statistics:
Name Memory usage
new_improved 0.60 MB
new 2.29 MB - 3.84x memory usage +1.69 MB
**All measurements for memory usage were the same**
##### With input medium #####
Name ips average deviation median 99th %
new_improved 18.71 K 53.46 μs ±339.33% 30.04 μs 546.86 μs
new 5.27 K 189.86 μs ±115.64% 165.92 μs 386.85 μs
Comparison:
new_improved 18.71 K
new 5.27 K - 3.55x slower +136.40 μs
Memory usage statistics:
Name Memory usage
new_improved 61.30 KB
new 234.63 KB - 3.83x memory usage +173.32 KB
**All measurements for memory usage were the same**
##### With input small #####
Name ips average deviation median 99th %
new_improved 123.01 K 8.13 μs ±1516.11% 3.25 μs 8.21 μs
new 40.91 K 24.44 μs ±619.30% 9.54 μs 488.95 μs
Comparison:
new_improved 123.01 K
new 40.91 K - 3.01x slower +16.31 μs
Memory usage statistics:
Name Memory usage
new_improved 6.39 KB
new 23.66 KB - 3.70x memory usage +17.27 KB
**All measurements for memory usage were the same**
##### With input tiny #####
Name ips average deviation median 99th %
new_improved 835.71 K 1.20 μs ±5243.19% 0.63 μs 1.71 μs
new 286.17 K 3.49 μs ±3343.86% 1.08 μs 2.58 μs
Comparison:
new_improved 835.71 K
new 286.17 K - 2.92x slower +2.30 μs
Memory usage statistics:
Name Memory usage
new_improved 0.90 KB
new 2.57 KB - 2.86x memory usage +1.67 KB
**All measurements for memory usage were the same** |
I also benchmarked it for 512 and 1024 bit:Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.14.3
Erlang 25.2
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: huge, large, medium, small, tiny
Estimated total run time: 4.67 min
Benchmarking new with input huge ...
Benchmarking new with input large ...
Benchmarking new with input medium ...
Benchmarking new with input small ...
Benchmarking new with input tiny ...
Benchmarking new_improved1024 with input huge ...
Benchmarking new_improved1024 with input large ...
Benchmarking new_improved1024 with input medium ...
Benchmarking new_improved1024 with input small ...
Benchmarking new_improved1024 with input tiny ...
Benchmarking new_improved256 with input huge ...
Benchmarking new_improved256 with input large ...
Benchmarking new_improved256 with input medium ...
Benchmarking new_improved256 with input small ...
Benchmarking new_improved256 with input tiny ...
Benchmarking new_improved512 with input huge ...
Benchmarking new_improved512 with input large ...
Benchmarking new_improved512 with input medium ...
Benchmarking new_improved512 with input small ...
Benchmarking new_improved512 with input tiny ...
##### With input huge #####
Name ips average deviation median 99th %
new_improved512 350.04 2.86 ms ±35.95% 2.61 ms 7.39 ms
new_improved1024 256.25 3.90 ms ±28.90% 3.67 ms 9.54 ms
new_improved256 237.23 4.22 ms ±31.26% 3.97 ms 9.99 ms
new 45.86 21.81 ms ±11.82% 21.07 ms 35.19 ms
Comparison:
new_improved512 350.04
new_improved1024 256.25 - 1.37x slower +1.05 ms
new_improved256 237.23 - 1.48x slower +1.36 ms
new 45.86 - 7.63x slower +18.95 ms
Memory usage statistics:
Name Memory usage
new_improved512 4.41 MB
new_improved1024 2.92 MB - 0.66x memory usage -1.48987 MB
new_improved256 5.96 MB - 1.35x memory usage +1.55 MB
new 22.89 MB - 5.19x memory usage +18.48 MB
**All measurements for memory usage were the same**
##### With input large #####
Name ips average deviation median 99th %
new_improved512 3.18 K 314.86 μs ±121.44% 287.25 μs 879.69 μs
new_improved1024 2.88 K 347.39 μs ±83.77% 332.63 μs 535.17 μs
new_improved256 2.26 K 442.44 μs ±72.49% 414.88 μs 1012.27 μs
new 0.60 K 1657.07 μs ±40.17% 1522.42 μs 4845.66 μs
Comparison:
new_improved512 3.18 K
new_improved1024 2.88 K - 1.10x slower +32.54 μs
new_improved256 2.26 K - 1.41x slower +127.58 μs
new 0.60 K - 5.26x slower +1342.21 μs
Memory usage statistics:
Name Memory usage
new_improved512 452.01 KB
new_improved1024 299.47 KB - 0.66x memory usage -152.53906 KB
new_improved256 610.66 KB - 1.35x memory usage +158.66 KB
new 2344.01 KB - 5.19x memory usage +1892 KB
**All measurements for memory usage were the same**
##### With input medium #####
Name ips average deviation median 99th %
new_improved1024 32.22 K 31.04 μs ±217.66% 23.88 μs 171.09 μs
new_improved512 25.95 K 38.53 μs ±314.93% 20.04 μs 613.90 μs
new_improved256 19.37 K 51.61 μs ±260.61% 29.50 μs 549.00 μs
new 4.88 K 205.02 μs ±186.19% 167.13 μs 616.62 μs
Comparison:
new_improved1024 32.22 K
new_improved512 25.95 K - 1.24x slower +7.49 μs
new_improved256 19.37 K - 1.66x slower +20.57 μs
new 4.88 K - 6.60x slower +173.98 μs
Memory usage statistics:
Name Memory usage
new_improved1024 30.32 KB
new_improved512 45.55 KB - 1.50x memory usage +15.23 KB
new_improved256 61.30 KB - 2.02x memory usage +30.98 KB
new 234.74 KB - 7.74x memory usage +204.42 KB
**All measurements for memory usage were the same**
##### With input small #####
Name ips average deviation median 99th %
new_improved1024 278.14 K 3.60 μs ±465.30% 3.08 μs 4.54 μs
new_improved512 186.00 K 5.38 μs ±2006.55% 2.38 μs 7.13 μs
new_improved256 133.49 K 7.49 μs ±1409.02% 3.25 μs 6.67 μs
new 39.07 K 25.60 μs ±764.58% 9.67 μs 506.61 μs
Comparison:
new_improved1024 278.14 K
new_improved512 186.00 K - 1.50x slower +1.78 μs
new_improved256 133.49 K - 2.08x slower +3.90 μs
new 39.07 K - 7.12x slower +22.00 μs
Memory usage statistics:
Name Memory usage
new_improved1024 3.23 KB
new_improved512 4.84 KB - 1.50x memory usage +1.61 KB
new_improved256 6.39 KB - 1.98x memory usage +3.16 KB
new 23.78 KB - 7.35x memory usage +20.55 KB
**All measurements for memory usage were the same**
##### With input tiny #####
Name ips average deviation median 99th %
new_improved512 898.10 K 1.11 μs ±4578.12% 0.67 μs 1.63 μs
new_improved256 877.63 K 1.14 μs ±5178.60% 0.63 μs 1.58 μs
new_improved1024 832.47 K 1.20 μs ±2101.38% 1 μs 1.29 μs
new 275.72 K 3.63 μs ±3196.55% 1.13 μs 2.96 μs
Comparison:
new_improved512 898.10 K
new_improved256 877.63 K - 1.02x slower +0.0260 μs
new_improved1024 832.47 K - 1.08x slower +0.0878 μs
new 275.72 K - 3.26x slower +2.51 μs
Memory usage statistics:
Name Memory usage
new_improved512 808 B
new_improved256 920 B - 1.14x memory usage +112 B
new_improved1024 568 B - 0.70x memory usage -240 B
new 2752 B - 3.41x memory usage +1944 B
**All measurements for memory usage were the same** |
Some real https://www.theonion.com/fuck-everything-were-doing-five-blades-1819584036 energy here. I love it. |
All joking aside this looks great. Lemme run a quick benchmark on an x86 machine and this should be good to go |
512 is the sweet spot for x86 as well. I'll get this done up as a commit on this branch (I have a couple of small changes) and we can do a review pass on it. |
c9d8f6a
to
a5f8188
Compare
I ran the Limits/Performance parts of the Autobahn suite with the current vs. 512 bit masking: While the fragmented binary / text message part look quite nice (left current vs right 512 bit masking): the non-fragmented binary / text message part performance is equal to worse than the current implementation: |
def mask(<<h::32, rest::binary>>, <<mask::32>>, acc) do | ||
mask(rest, mask, acc <> <<Bitwise.bxor(h, mask)::32>>) | ||
defp do_mask(<<h::32, rest::binary>>, <<int_mask::32, _mask_rest::binary>> = mask, acc) do | ||
do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::32>>]) | ||
end | ||
|
||
def mask(<<h::8, rest::binary>>, <<current::8, mask::24>>, acc) do | ||
mask(rest, <<mask::24, current::8>>, acc <> <<Bitwise.bxor(h, current)::8>>) | ||
defp do_mask(<<h::8, rest::binary>>, <<current::8, mask::24, _mask_rest::binary>>, acc) do | ||
do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::8>>]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@moogle19 I didn't bring over all the macro'd in matches here - as implemented here it just adds in an extra 512 bit stride (so we'll chew off 512 bits at a time until we have less than that left, then 32 bits at a time until we have less than that left, then 8 bits at a time). The difference in performance in negligible (as far as I can benchmark, anyway), and it's a bit easier for readers to grok.
Looking at the code for case 9.6.6 (https://github.com/crossbario/autobahn-testsuite/blob/master/autobahntestsuite/autobahntestsuite/case/case9_6_6.py), I don't see why there should be any systematic change. Is the difference reproducible, or could this just be test noise (FWIW, I don't generally put any value in differences of +/- 10% or so in the CI benchmarker; it's just noise at that point). |
More golf:
I noticed that the overhead of the 512 bit approach was of benefit only on larger strings, and was actually a pretty significant penalty on smaller strings (which is what I suspect most real-world websocket frames are). This adaptive approach uses either the previous 32 bit approach on this PR for smaller frames, and the 512 bit approach for larger frames, providing the best of both worlds. I want to cover this with proper tests for larger and smaller frames; I'll get this coded up tomorrow if it makes sense to y'all. |
If we need even more performance, we can also take a look at Rustler benchmarkOperating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.14.3
Erlang 25.2
Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: huge, large, medium, micro, small, tiny
Estimated total run time: 2.80 min
Benchmarking new_512 with input huge ...
Benchmarking new_512 with input large ...
Benchmarking new_512 with input medium ...
Benchmarking new_512 with input micro ...
Benchmarking new_512 with input small ...
Benchmarking new_512 with input tiny ...
Benchmarking rustler with input huge ...
Benchmarking rustler with input large ...
Benchmarking rustler with input medium ...
Benchmarking rustler with input micro ...
Benchmarking rustler with input small ...
Benchmarking rustler with input tiny ...
##### With input huge #####
Name ips average deviation median 99th %
rustler 1.31 K 0.76 ms ±8.56% 0.77 ms 0.90 ms
new_512 0.29 K 3.45 ms ±64.91% 2.71 ms 13.55 ms
Comparison:
rustler 1.31 K
new_512 0.29 K - 4.51x slower +2.69 ms
Memory usage statistics:
Name Memory usage
rustler 0.00005 MB
new_512 4.41 MB - 96346.00x memory usage +4.41 MB
**All measurements for memory usage were the same**
##### With input large #####
Name ips average deviation median 99th %
rustler 13.29 K 75.27 μs ±8.88% 76.75 μs 99.67 μs
new_512 3.07 K 325.91 μs ±128.06% 286.67 μs 1224.17 μs
Comparison:
rustler 13.29 K
new_512 3.07 K - 4.33x slower +250.64 μs
Memory usage statistics:
Name Memory usage
rustler 0.0469 KB
new_512 452.69 KB - 9657.33x memory usage +452.64 KB
**All measurements for memory usage were the same**
##### With input medium #####
Name ips average deviation median 99th %
rustler 130.14 K 7.68 μs ±147.18% 7.50 μs 9.54 μs
new_512 22.31 K 44.83 μs ±527.87% 20.29 μs 600.35 μs
Comparison:
rustler 130.14 K
new_512 22.31 K - 5.83x slower +37.14 μs
Memory usage statistics:
Name Memory usage
rustler 0.0469 KB
new_512 45.89 KB - 979.00x memory usage +45.84 KB
**All measurements for memory usage were the same**
##### With input micro #####
Name ips average deviation median 99th %
rustler 3.39 M 294.88 ns ±20545.13% 125 ns 250 ns
new_512 1.05 M 952.93 ns ±10177.57% 292 ns 1500 ns
Comparison:
rustler 3.39 M
new_512 1.05 M - 3.23x slower +658.05 ns
Memory usage statistics:
Name Memory usage
rustler 80 B
new_512 608 B - 7.60x memory usage +528 B
**All measurements for memory usage were the same**
##### With input small #####
Name ips average deviation median 99th %
rustler 1.11 M 0.90 μs ±3611.37% 0.79 μs 0.96 μs
new_512 0.133 M 7.53 μs ±1871.88% 2.54 μs 7.79 μs
Comparison:
rustler 1.11 M
new_512 0.133 M - 8.35x slower +6.63 μs
Memory usage statistics:
Name Memory usage
rustler 0.0469 KB
new_512 5.70 KB - 121.50x memory usage +5.65 KB
**All measurements for memory usage were the same**
##### With input tiny #####
Name ips average deviation median 99th %
rustler 3.01 M 0.33 μs ±12263.70% 0.21 μs 0.42 μs
new_512 0.38 M 2.62 μs ±4929.71% 0.71 μs 2.17 μs
Comparison:
rustler 3.01 M
new_512 0.38 M - 7.88x slower +2.28 μs
Memory usage statistics:
Name Memory usage
rustler 0.0469 KB
new_512 1.55 KB - 33.17x memory usage +1.51 KB
**All measurements for memory usage were the same** Maybe an approach like jason_native could be interesting in the future. |
Improved 512 bit strides for masking are in and green. I'm now looking at a similar approach for the largest component of our profiling, UTF-8 string validation:
So, quite a bit faster, in exchange for higher memory usage. I'm going to tinker with this a bit more and see if I can improve the memory numbers. |
I really liked this approach |
Really nice! I don't think we want to land native dependencies just yet (I'm not at all opposed to them, but I think we should wait until post 1.0 and maybe do them as a pluggable library, especially for ones outside the usual gcc based idiom). But it's great to know that we're able to get them (there's a whole bunch more golf to do on the rust side too!). Doing eg: HTTP/1 parsing and other really hot paths in Rust is very interesting. |
@mtrudel I respect your opinion but would like to disagree a little. |
Don't get me wrong - I really like @moogle19's approach here, a LOT. I can foresee a time when native code like this replaces (or more correctly, 'optionally replaces') the native implementation within Bandit. I'm just saying that that time isn't here yet. Some rationale:
Hopefully that clears up what my position on this is. I really can't emphasize enough how much I'm not saying 'no', but rather 'not right now'. |
It is also worth noting that The NIF theme on the BEAM is subtle... many stories of successes and also of troubles :) |
I think that for web servers, especially on BEAM there are few optimal options in terms of raw performance. We know all the qualities of BEAM and it is fantastic, but for certain use cases these qualities are not all that is needed. That said, if we could have an option that would give the user the option to make the choice then why not? NIF will always be a tricky case, but because the BEAM VM is preemptive and therefore there must be compromises to be made when you want performance over security (user choice). In addition, there are ways to provide the necessary security, and also we are not talking about C++ code, we are talking about Rust, which is a thousand times more secure than anything else out there. |
For example, in my current project we already use Bandit and we would benefit a lot if something like the parsing of the http headers was done via rust because being a Sidecar based solution every microsecond in the additional latency matters a lot to us. But I agree that we must exhaust all forms in Elixir code first. I liked @moogle19 suggestion precisely because it allows this to be optional for the user. |
I implemented Cowboy's implementation of UTF-8 validation in Elixir and benched it, and it's actually the slowest of the bunch by quite a bit:
I'm not sure that the improvements to runtime are worth the tradeoff on memory; I think I'm going to leave UTF-8 detection alone for now. |
In the medium case it was faster, wasn't it? Could you use this only for cases where the gain is certain? |
It's much faster (about 3x). Unfortunately it's also about 20x more memory intensive. |
Suggestion. Again it might be a case of letting the user choose via configuration how he wants this to behave. |
No description provided.