Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Websocket performance tweaks #73

Merged
merged 6 commits into from
Jan 17, 2023
Merged

Websocket performance tweaks #73

merged 6 commits into from
Jan 17, 2023

Conversation

mtrudel
Copy link
Owner

@mtrudel mtrudel commented Jan 3, 2023

No description provided.

@mtrudel mtrudel added the benchmark Assign this to a PR to have the benchmark CI suite run label Jan 3, 2023
@mtrudel mtrudel marked this pull request as ready for review January 13, 2023 22:20
@mtrudel
Copy link
Owner Author

mtrudel commented Jan 13, 2023

@moogle19 you may be interested in the further tweak to your mask tweaks from a a few months back. Benchee:

Mix.install([:benchee])

defmodule Old do
  # Note that masking is an involution, so we don't need a separate unmask function
  def mask(payload, mask, acc \\ <<>>)

  def mask(payload, mask, acc) when is_integer(mask), do: mask(payload, <<mask::32>>, acc)

  def mask(<<h::32, rest::binary>>, <<mask::32>>, acc) do
    mask(rest, mask, acc <> <<Bitwise.bxor(h, mask)::32>>)
  end

  def mask(<<h::8, rest::binary>>, <<current::8, mask::24>>, acc) do
    mask(rest, <<mask::24, current::8>>, acc <> <<Bitwise.bxor(h, current)::8>>)
  end

  def mask(<<>>, _mask, acc), do: acc
end

defmodule New do
  # Note that masking is an involution, so we don't need a separate unmask function
  def mask(payload, mask) do
    payload
    |> do_mask(<<mask::32>>, [])
    |> IO.iodata_to_binary()
  end

  defp do_mask(<<h::32, rest::binary>>, <<int_mask::32>> = mask, acc) do
    do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::32>>])
  end

  defp do_mask(<<h::8, rest::binary>>, <<current::8, mask::24>>, acc) do
    do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::8>>])
  end

  defp do_mask(<<>>, _mask, acc), do: acc
end

foo = String.duplicate("a", 10_002)

Benchee.run(
  %{
    "old" => fn -> Old.mask(foo, 1234) end,
    "new" => fn -> New.mask(foo, 1234) end
  },
  time: 10,
  memory_time: 2
)
...
Operating System: macOS
CPU Information: Apple M1
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.14.1
Erlang 25.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: none specified
Estimated total run time: 28 s

Benchmarking new ...
Benchmarking old ...

Name           ips        average  deviation         median         99th %
new         9.28 K      107.71 μs     ±9.44%      106.67 μs      122.88 μs
old         7.11 K      140.74 μs    ±20.00%      136.08 μs      277.74 μs

Comparison:
new         9.28 K
old         7.11 K - 1.31x slower +33.03 μs

Memory usage statistics:

Name    Memory usage
new        235.49 KB
old        450.59 KB - 1.91x memory usage +215.09 KB

@moogle19
Copy link
Contributor

If we replicate the mask to e.g. 256 bits and use that to bxor we can make it even faster:

Mix.install([:benchee])

defmodule NewImproved do
  @mask_size 256
  # Note that masking is an involution, so we don't need a separate unmask function
  def mask(payload, mask) do
    payload
    |> do_mask(
      <<mask::32, mask::32, mask::32, mask::32, mask::32, mask::32, mask::32, mask::32>>,
      []
    )
    |> IO.iodata_to_binary()
  end
  
  # Matching the full mask size
  defp do_mask(
         <<h::unquote(@mask_size), rest::binary>>,
         <<int_mask::unquote(@mask_size)>> = mask,
         acc
       ) do
    do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::unquote(@mask_size)>>])
  end

  # Generate do_mask for each mask size from @mask_size down to 8
  for x <- (@mask_size - 8)..8//-8 do
    defp do_mask(
           <<h::unquote(x), rest::binary>>,
           <<current::unquote(x), mask::unquote(@mask_size - x)>>,
           acc
         ) do
      do_mask(rest, <<mask::unquote(@mask_size - x), current::unquote(x)>>, [
        acc,
        <<Bitwise.bxor(h, current)::unquote(x)>>
      ])
    end
  end

  defp do_mask(<<>>, _mask, acc), do: acc
end

defmodule New do
  # Note that masking is an involution, so we don't need a separate unmask function
  def mask(payload, mask) do
    payload
    |> do_mask(<<mask::32>>, [])
    |> IO.iodata_to_binary()
  end

  defp do_mask(<<h::32, rest::binary>>, <<int_mask::32>> = mask, acc) do
    do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::32>>])
  end

  defp do_mask(<<h::24, rest::binary>>, <<current::24, mask::8>>, acc) do
    do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::24>>])
  end

  defp do_mask(<<h::16, rest::binary>>, <<current::16, mask::16>>, acc) do
    do_mask(rest, <<mask::16, current::16>>, [acc, <<Bitwise.bxor(h, current)::16>>])
  end

  defp do_mask(<<h::8, rest::binary>>, <<current::8, mask::24>>, acc) do
    do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::8>>])
  end

  defp do_mask(<<>>, _mask, acc), do: acc
end

Benchee.run(
  %{
    "new_improved" => fn input -> NewImproved.mask(input, 1234) end,
    "new" => fn input -> New.mask(input, 1234) end
  },
  time: 10,
  memory_time: 2,
  inputs: %{
    "tiny" => String.duplicate("a", 102),
    "small" => String.duplicate("a", 1_002),
    "medium" => String.duplicate("a", 10_002),
    "large" => String.duplicate("a", 100_002),
    "huge" => String.duplicate("a", 1_000_002)
  }
)
...
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.14.3
Erlang 25.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: huge, large, medium, small, tiny
Estimated total run time: 2.33 min

Benchmarking new with input huge ...
Benchmarking new with input large ...
Benchmarking new with input medium ...
Benchmarking new with input small ...
Benchmarking new with input tiny ...
Benchmarking new_improved with input huge ...
Benchmarking new_improved with input large ...
Benchmarking new_improved with input medium ...
Benchmarking new_improved with input small ...
Benchmarking new_improved with input tiny ...

##### With input huge #####
Name                   ips        average  deviation         median         99th %
new_improved        234.32        4.27 ms    ±37.19%        3.96 ms       13.60 ms
new                  44.03       22.71 ms    ±17.54%       21.60 ms       42.17 ms

Comparison: 
new_improved        234.32
new                  44.03 - 5.32x slower +18.45 ms

Memory usage statistics:

Name            Memory usage
new_improved         5.96 MB
new                 22.89 MB - 3.84x memory usage +16.93 MB

**All measurements for memory usage were the same**

##### With input large #####
Name                   ips        average  deviation         median         99th %
new_improved        2.15 K        0.47 ms   ±222.87%        0.41 ms        1.41 ms
new                 0.58 K        1.72 ms    ±63.47%        1.51 ms        5.08 ms

Comparison: 
new_improved        2.15 K
new                 0.58 K - 3.70x slower +1.26 ms

Memory usage statistics:

Name            Memory usage
new_improved         0.60 MB
new                  2.29 MB - 3.84x memory usage +1.69 MB

**All measurements for memory usage were the same**

##### With input medium #####
Name                   ips        average  deviation         median         99th %
new_improved       18.71 K       53.46 μs   ±339.33%       30.04 μs      546.86 μs
new                 5.27 K      189.86 μs   ±115.64%      165.92 μs      386.85 μs

Comparison: 
new_improved       18.71 K
new                 5.27 K - 3.55x slower +136.40 μs

Memory usage statistics:

Name            Memory usage
new_improved        61.30 KB
new                234.63 KB - 3.83x memory usage +173.32 KB

**All measurements for memory usage were the same**

##### With input small #####
Name                   ips        average  deviation         median         99th %
new_improved      123.01 K        8.13 μs  ±1516.11%        3.25 μs        8.21 μs
new                40.91 K       24.44 μs   ±619.30%        9.54 μs      488.95 μs

Comparison: 
new_improved      123.01 K
new                40.91 K - 3.01x slower +16.31 μs

Memory usage statistics:

Name            Memory usage
new_improved         6.39 KB
new                 23.66 KB - 3.70x memory usage +17.27 KB

**All measurements for memory usage were the same**

##### With input tiny #####
Name                   ips        average  deviation         median         99th %
new_improved      835.71 K        1.20 μs  ±5243.19%        0.63 μs        1.71 μs
new               286.17 K        3.49 μs  ±3343.86%        1.08 μs        2.58 μs

Comparison: 
new_improved      835.71 K
new               286.17 K - 2.92x slower +2.30 μs

Memory usage statistics:

Name            Memory usage
new_improved         0.90 KB
new                  2.57 KB - 2.86x memory usage +1.67 KB

**All measurements for memory usage were the same**

@moogle19
Copy link
Contributor

I also benchmarked it for 512 and 1024 bit:
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.14.3
Erlang 25.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: huge, large, medium, small, tiny
Estimated total run time: 4.67 min

Benchmarking new with input huge ...
Benchmarking new with input large ...
Benchmarking new with input medium ...
Benchmarking new with input small ...
Benchmarking new with input tiny ...
Benchmarking new_improved1024 with input huge ...
Benchmarking new_improved1024 with input large ...
Benchmarking new_improved1024 with input medium ...
Benchmarking new_improved1024 with input small ...
Benchmarking new_improved1024 with input tiny ...
Benchmarking new_improved256 with input huge ...
Benchmarking new_improved256 with input large ...
Benchmarking new_improved256 with input medium ...
Benchmarking new_improved256 with input small ...
Benchmarking new_improved256 with input tiny ...
Benchmarking new_improved512 with input huge ...
Benchmarking new_improved512 with input large ...
Benchmarking new_improved512 with input medium ...
Benchmarking new_improved512 with input small ...
Benchmarking new_improved512 with input tiny ...

##### With input huge #####
Name                       ips        average  deviation         median         99th %
new_improved512         350.04        2.86 ms    ±35.95%        2.61 ms        7.39 ms
new_improved1024        256.25        3.90 ms    ±28.90%        3.67 ms        9.54 ms
new_improved256         237.23        4.22 ms    ±31.26%        3.97 ms        9.99 ms
new                      45.86       21.81 ms    ±11.82%       21.07 ms       35.19 ms

Comparison: 
new_improved512         350.04
new_improved1024        256.25 - 1.37x slower +1.05 ms
new_improved256         237.23 - 1.48x slower +1.36 ms
new                      45.86 - 7.63x slower +18.95 ms

Memory usage statistics:

Name                Memory usage
new_improved512          4.41 MB
new_improved1024         2.92 MB - 0.66x memory usage -1.48987 MB
new_improved256          5.96 MB - 1.35x memory usage +1.55 MB
new                     22.89 MB - 5.19x memory usage +18.48 MB

**All measurements for memory usage were the same**

##### With input large #####
Name                       ips        average  deviation         median         99th %
new_improved512         3.18 K      314.86 μs   ±121.44%      287.25 μs      879.69 μs
new_improved1024        2.88 K      347.39 μs    ±83.77%      332.63 μs      535.17 μs
new_improved256         2.26 K      442.44 μs    ±72.49%      414.88 μs     1012.27 μs
new                     0.60 K     1657.07 μs    ±40.17%     1522.42 μs     4845.66 μs

Comparison: 
new_improved512         3.18 K
new_improved1024        2.88 K - 1.10x slower +32.54 μs
new_improved256         2.26 K - 1.41x slower +127.58 μs
new                     0.60 K - 5.26x slower +1342.21 μs

Memory usage statistics:

Name                Memory usage
new_improved512        452.01 KB
new_improved1024       299.47 KB - 0.66x memory usage -152.53906 KB
new_improved256        610.66 KB - 1.35x memory usage +158.66 KB
new                   2344.01 KB - 5.19x memory usage +1892 KB

**All measurements for memory usage were the same**

##### With input medium #####
Name                       ips        average  deviation         median         99th %
new_improved1024       32.22 K       31.04 μs   ±217.66%       23.88 μs      171.09 μs
new_improved512        25.95 K       38.53 μs   ±314.93%       20.04 μs      613.90 μs
new_improved256        19.37 K       51.61 μs   ±260.61%       29.50 μs      549.00 μs
new                     4.88 K      205.02 μs   ±186.19%      167.13 μs      616.62 μs

Comparison: 
new_improved1024       32.22 K
new_improved512        25.95 K - 1.24x slower +7.49 μs
new_improved256        19.37 K - 1.66x slower +20.57 μs
new                     4.88 K - 6.60x slower +173.98 μs

Memory usage statistics:

Name                Memory usage
new_improved1024        30.32 KB
new_improved512         45.55 KB - 1.50x memory usage +15.23 KB
new_improved256         61.30 KB - 2.02x memory usage +30.98 KB
new                    234.74 KB - 7.74x memory usage +204.42 KB

**All measurements for memory usage were the same**

##### With input small #####
Name                       ips        average  deviation         median         99th %
new_improved1024      278.14 K        3.60 μs   ±465.30%        3.08 μs        4.54 μs
new_improved512       186.00 K        5.38 μs  ±2006.55%        2.38 μs        7.13 μs
new_improved256       133.49 K        7.49 μs  ±1409.02%        3.25 μs        6.67 μs
new                    39.07 K       25.60 μs   ±764.58%        9.67 μs      506.61 μs

Comparison: 
new_improved1024      278.14 K
new_improved512       186.00 K - 1.50x slower +1.78 μs
new_improved256       133.49 K - 2.08x slower +3.90 μs
new                    39.07 K - 7.12x slower +22.00 μs

Memory usage statistics:

Name                Memory usage
new_improved1024         3.23 KB
new_improved512          4.84 KB - 1.50x memory usage +1.61 KB
new_improved256          6.39 KB - 1.98x memory usage +3.16 KB
new                     23.78 KB - 7.35x memory usage +20.55 KB

**All measurements for memory usage were the same**

##### With input tiny #####
Name                       ips        average  deviation         median         99th %
new_improved512       898.10 K        1.11 μs  ±4578.12%        0.67 μs        1.63 μs
new_improved256       877.63 K        1.14 μs  ±5178.60%        0.63 μs        1.58 μs
new_improved1024      832.47 K        1.20 μs  ±2101.38%           1 μs        1.29 μs
new                   275.72 K        3.63 μs  ±3196.55%        1.13 μs        2.96 μs

Comparison: 
new_improved512       898.10 K
new_improved256       877.63 K - 1.02x slower +0.0260 μs
new_improved1024      832.47 K - 1.08x slower +0.0878 μs
new                   275.72 K - 3.26x slower +2.51 μs

Memory usage statistics:

Name                Memory usage
new_improved512            808 B
new_improved256            920 B - 1.14x memory usage +112 B
new_improved1024           568 B - 0.70x memory usage -240 B
new                       2752 B - 3.41x memory usage +1944 B

**All measurements for memory usage were the same**

@mtrudel
Copy link
Owner Author

mtrudel commented Jan 16, 2023

Some real https://www.theonion.com/fuck-everything-were-doing-five-blades-1819584036 energy here. I love it.

@mtrudel
Copy link
Owner Author

mtrudel commented Jan 16, 2023

All joking aside this looks great. Lemme run a quick benchmark on an x86 machine and this should be good to go

@mtrudel
Copy link
Owner Author

mtrudel commented Jan 16, 2023

512 is the sweet spot for x86 as well. I'll get this done up as a commit on this branch (I have a couple of small changes) and we can do a review pass on it.

@moogle19
Copy link
Contributor

I ran the Limits/Performance parts of the Autobahn suite with the current vs. 512 bit masking:

While the fragmented binary / text message part look quite nice (left current vs right 512 bit masking):

Screenshot 2023-01-16 at 15 56 21

the non-fragmented binary / text message part performance is equal to worse than the current implementation:

Screenshot 2023-01-16 at 15 56 43

Comment on lines -103 to +120
def mask(<<h::32, rest::binary>>, <<mask::32>>, acc) do
mask(rest, mask, acc <> <<Bitwise.bxor(h, mask)::32>>)
defp do_mask(<<h::32, rest::binary>>, <<int_mask::32, _mask_rest::binary>> = mask, acc) do
do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::32>>])
end

def mask(<<h::8, rest::binary>>, <<current::8, mask::24>>, acc) do
mask(rest, <<mask::24, current::8>>, acc <> <<Bitwise.bxor(h, current)::8>>)
defp do_mask(<<h::8, rest::binary>>, <<current::8, mask::24, _mask_rest::binary>>, acc) do
do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::8>>])
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@moogle19 I didn't bring over all the macro'd in matches here - as implemented here it just adds in an extra 512 bit stride (so we'll chew off 512 bits at a time until we have less than that left, then 32 bits at a time until we have less than that left, then 8 bits at a time). The difference in performance in negligible (as far as I can benchmark, anyway), and it's a bit easier for readers to grok.

@mtrudel
Copy link
Owner Author

mtrudel commented Jan 16, 2023

Looking at the code for case 9.6.6 (https://github.com/crossbario/autobahn-testsuite/blob/master/autobahntestsuite/autobahntestsuite/case/case9_6_6.py), I don't see why there should be any systematic change. Is the difference reproducible, or could this just be test noise (FWIW, I don't generally put any value in differences of +/- 10% or so in the CI benchmarker; it's just noise at that point).

@mtrudel
Copy link
Owner Author

mtrudel commented Jan 17, 2023

More golf:

Mix.install([:benchee])

defmodule Old do
  # Note that masking is an involution, so we don't need a separate unmask function
  def mask(payload, mask) do
    payload
    |> do_mask(<<mask::32>>, [])
    |> IO.iodata_to_binary()
  end

  defp do_mask(<<h::32, rest::binary>>, <<int_mask::32>> = mask, acc) do
    do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::32>>])
  end

  defp do_mask(<<h::8, rest::binary>>, <<current::8, mask::24>>, acc) do
    do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::8>>])
  end

  defp do_mask(<<>>, _mask, acc), do: acc
end

defmodule New512 do
  @mask_size 512
  # Note that masking is an involution, so we don't need a separate unmask function
  def mask(payload, mask) do
    payload
    |> do_mask(String.duplicate(<<mask::32>>, div(@mask_size, 32)), [])
    |> IO.iodata_to_binary()
  end

  # Matching the full mask size
  defp do_mask(
         <<h::unquote(@mask_size), rest::binary>>,
         <<int_mask::unquote(@mask_size)>> = mask,
         acc
       ) do
    do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::unquote(@mask_size)>>])
  end

  defp do_mask(<<h::32, rest::binary>>, <<int_mask::32, _mask_rest::binary>> = mask, acc) do
    do_mask(rest, mask, [acc, <<Bitwise.bxor(h, int_mask)::32>>])
  end

  defp do_mask(<<h::8, rest::binary>>, <<current::8, mask::24, _mask_rest::binary>>, acc) do
    do_mask(rest, <<mask::24, current::8>>, [acc, <<Bitwise.bxor(h, current)::8>>])
  end

  defp do_mask(<<>>, _mask, acc), do: acc
end

defmodule Adaptive512 do
  @mask_size 512

  def mask(payload, mask) when bit_size(payload) >= @mask_size do
    New512.mask(payload, mask)
  end

  def mask(payload, mask) do
    Old.mask(payload, mask)
  end
end

Benchee.run(
  %{
    "old" => fn input -> Old.mask(input, 1234) end,
    "new_512" => fn input -> New512.mask(input, 1234) end,
    "adaptive_512" => fn input -> Adaptive512.mask(input, 1234) end,
  },
  time: 10,
  memory_time: 2,
  inputs: %{
    "micro" => String.duplicate("a", 10),
    "tiny" => String.duplicate("a", 102),
    "small" => String.duplicate("a", 1_002),
    "medium" => String.duplicate("a", 10_002),
    "large" => String.duplicate("a", 100_002),
    "huge" => String.duplicate("a", 1_000_002)
  }
)

I noticed that the overhead of the 512 bit approach was of benefit only on larger strings, and was actually a pretty significant penalty on smaller strings (which is what I suspect most real-world websocket frames are). This adaptive approach uses either the previous 32 bit approach on this PR for smaller frames, and the 512 bit approach for larger frames, providing the best of both worlds.

I want to cover this with proper tests for larger and smaller frames; I'll get this coded up tomorrow if it makes sense to y'all.

@moogle19
Copy link
Contributor

If we need even more performance, we can also take a look at rustler:

Rustler benchmark
Operating System: macOS
CPU Information: Apple M1 Pro
Number of Available Cores: 10
Available memory: 32 GB
Elixir 1.14.3
Erlang 25.2

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: huge, large, medium, micro, small, tiny
Estimated total run time: 2.80 min

Benchmarking new_512 with input huge ...
Benchmarking new_512 with input large ...
Benchmarking new_512 with input medium ...
Benchmarking new_512 with input micro ...
Benchmarking new_512 with input small ...
Benchmarking new_512 with input tiny ...
Benchmarking rustler with input huge ...
Benchmarking rustler with input large ...
Benchmarking rustler with input medium ...
Benchmarking rustler with input micro ...
Benchmarking rustler with input small ...
Benchmarking rustler with input tiny ...

##### With input huge #####
Name              ips        average  deviation         median         99th %
rustler        1.31 K        0.76 ms     ±8.56%        0.77 ms        0.90 ms
new_512        0.29 K        3.45 ms    ±64.91%        2.71 ms       13.55 ms

Comparison:
rustler        1.31 K
new_512        0.29 K - 4.51x slower +2.69 ms

Memory usage statistics:

Name       Memory usage
rustler      0.00005 MB
new_512         4.41 MB - 96346.00x memory usage +4.41 MB

**All measurements for memory usage were the same**

##### With input large #####
Name              ips        average  deviation         median         99th %
rustler       13.29 K       75.27 μs     ±8.88%       76.75 μs       99.67 μs
new_512        3.07 K      325.91 μs   ±128.06%      286.67 μs     1224.17 μs

Comparison:
rustler       13.29 K
new_512        3.07 K - 4.33x slower +250.64 μs

Memory usage statistics:

Name       Memory usage
rustler       0.0469 KB
new_512       452.69 KB - 9657.33x memory usage +452.64 KB

**All measurements for memory usage were the same**

##### With input medium #####
Name              ips        average  deviation         median         99th %
rustler      130.14 K        7.68 μs   ±147.18%        7.50 μs        9.54 μs
new_512       22.31 K       44.83 μs   ±527.87%       20.29 μs      600.35 μs

Comparison:
rustler      130.14 K
new_512       22.31 K - 5.83x slower +37.14 μs

Memory usage statistics:

Name       Memory usage
rustler       0.0469 KB
new_512        45.89 KB - 979.00x memory usage +45.84 KB

**All measurements for memory usage were the same**

##### With input micro #####
Name              ips        average  deviation         median         99th %
rustler        3.39 M      294.88 ns ±20545.13%         125 ns         250 ns
new_512        1.05 M      952.93 ns ±10177.57%         292 ns        1500 ns

Comparison:
rustler        3.39 M
new_512        1.05 M - 3.23x slower +658.05 ns

Memory usage statistics:

Name       Memory usage
rustler            80 B
new_512           608 B - 7.60x memory usage +528 B

**All measurements for memory usage were the same**

##### With input small #####
Name              ips        average  deviation         median         99th %
rustler        1.11 M        0.90 μs  ±3611.37%        0.79 μs        0.96 μs
new_512       0.133 M        7.53 μs  ±1871.88%        2.54 μs        7.79 μs

Comparison:
rustler        1.11 M
new_512       0.133 M - 8.35x slower +6.63 μs

Memory usage statistics:

Name       Memory usage
rustler       0.0469 KB
new_512         5.70 KB - 121.50x memory usage +5.65 KB

**All measurements for memory usage were the same**

##### With input tiny #####
Name              ips        average  deviation         median         99th %
rustler        3.01 M        0.33 μs ±12263.70%        0.21 μs        0.42 μs
new_512        0.38 M        2.62 μs  ±4929.71%        0.71 μs        2.17 μs

Comparison:
rustler        3.01 M
new_512        0.38 M - 7.88x slower +2.28 μs

Memory usage statistics:

Name       Memory usage
rustler       0.0469 KB
new_512         1.55 KB - 33.17x memory usage +1.51 KB

**All measurements for memory usage were the same**

Maybe an approach like jason_native could be interesting in the future.
I played around with it a little here and here.

@mtrudel
Copy link
Owner Author

mtrudel commented Jan 17, 2023

Improved 512 bit strides for masking are in and green.

I'm now looking at a similar approach for the largest component of our profiling, UTF-8 string validation:

Mix.install([:benchee])

defmodule BulkMatch do
  defguardp is_inner_byte(c) when c >= 128 and c < 192

  def valid?(str), do: do_valid?(str)
  defp do_valid?(<<a::512, rest::binary>>) when Bitwise.band(a, 0x80808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080) == 0, do: do_valid?(rest)
  defp do_valid?(<<a::8, rest::binary>>) when a < 128, do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, rest::binary>>) when a >= 192 and a < 224 and is_inner_byte(b), do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, c::8, rest::binary>>) when a >= 224 and a < 240 and is_inner_byte(b) and is_inner_byte(c), do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, c::8, d::8, rest::binary>>) when a >= 240 and is_inner_byte(b) and is_inner_byte(c) and is_inner_byte(d), do: do_valid?(rest)
  defp do_valid?(<<>>), do: true
  defp do_valid?(_str), do: false
end


Benchee.run(
  %{
    "old" => fn input -> String.valid?(input) end,
    "new" => fn input -> BulkMatch.valid?(input) end,
  },
  time: 10,
  memory_time: 2,
  inputs: %{
    "micro" => String.duplicate("a", 10),
    "tiny" => String.duplicate("a", 102),
    "small" => String.duplicate("a", 1_002),
    "medium" => String.duplicate("a", 10_002),
    "large" => String.duplicate("a", 100_002),
    "huge" => String.duplicate("a", 1_000_002)
  }
)

Operating System: macOS
CPU Information: Apple M1
Number of Available Cores: 8
Available memory: 16 GB
Elixir 1.14.1
Erlang 25.1

Benchmark suite executing with the following configuration:
warmup: 2 s
time: 10 s
memory time: 2 s
reduction time: 0 ns
parallel: 1
inputs: huge, large, medium, micro, small, tiny
Estimated total run time: 2.80 min

Benchmarking new with input huge ...
Benchmarking new with input large ...
Benchmarking new with input medium ...
Benchmarking new with input micro ...
Benchmarking new with input small ...
Benchmarking new with input tiny ...
Benchmarking old with input huge ...
Benchmarking old with input large ...
Benchmarking old with input medium ...
Benchmarking old with input micro ...
Benchmarking old with input small ...
Benchmarking old with input tiny ...

##### With input huge #####
Name           ips        average  deviation         median         99th %
new         729.28        1.37 ms    ±21.53%        1.34 ms        2.01 ms
old         343.63        2.91 ms     ±0.79%        2.90 ms        2.97 ms

Comparison:
new         729.28
old         343.63 - 2.12x slower +1.54 ms

Memory usage statistics:

Name    Memory usage
new          1.07 MB
old       0.00063 MB - 0.00x memory usage -1.07288 MB

**All measurements for memory usage were the same**

##### With input large #####
Name           ips        average  deviation         median         99th %
new         6.69 K      149.45 μs    ±45.53%      117.21 μs      347.64 μs
old         3.42 K      292.27 μs    ±13.25%      290.61 μs      312.70 μs

Comparison:
new         6.69 K
old         3.42 K - 1.96x slower +142.82 μs

Memory usage statistics:

Name    Memory usage
new        110.47 KB
old          0.64 KB - 0.01x memory usage -109.82813 KB

**All measurements for memory usage were the same**

##### With input medium #####
Name           ips        average  deviation         median         99th %
new        57.01 K       17.54 μs   ±525.67%        9.79 μs       83.89 μs
old        33.36 K       29.98 μs    ±59.90%       29.25 μs       36.58 μs

Comparison:
new        57.01 K
old        33.36 K - 1.71x slower +12.44 μs

Memory usage statistics:

Name    Memory usage
new         11.61 KB
old          0.64 KB - 0.06x memory usage -10.96875 KB

**All measurements for memory usage were the same**

##### With input micro #####
Name           ips        average  deviation         median         99th %
old       954.66 K        1.05 μs ±10898.38%        0.25 μs        1.38 μs
new       920.50 K        1.09 μs ±10298.99%        0.29 μs        1.42 μs

Comparison:
old       954.66 K
new       920.50 K - 1.04x slower +0.0389 μs

Memory usage statistics:

Name    Memory usage
old            656 B
new            656 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

##### With input small #####
Name           ips        average  deviation         median         99th %
new       324.15 K        3.09 μs  ±3288.18%        1.42 μs        2.79 μs
old       261.03 K        3.83 μs  ±1508.17%        3.08 μs        4.46 μs

Comparison:
new       324.15 K
old       261.03 K - 1.24x slower +0.75 μs

Memory usage statistics:

Name    Memory usage
new          1.70 KB
old          0.64 KB - 0.38x memory usage -1.05469 KB

**All measurements for memory usage were the same**

##### With input tiny #####
Name           ips        average  deviation         median         99th %
old       768.04 K        1.30 μs  ±7823.95%        0.50 μs        1.67 μs
new       766.80 K        1.30 μs  ±7496.43%        0.54 μs        1.71 μs

Comparison:
old       768.04 K
new       766.80 K - 1.00x slower +0.00210 μs

Memory usage statistics:

Name    Memory usage
old            656 B
new            728 B - 1.11x memory usage +72 B

**All measurements for memory usage were the same**

So, quite a bit faster, in exchange for higher memory usage. I'm going to tinker with this a bit more and see if I can improve the memory numbers.

@sleipnir
Copy link

sleipnir commented Jan 17, 2023

If we need even more performance, we can also take a look at rustler:

I really liked this approach

@mtrudel
Copy link
Owner Author

mtrudel commented Jan 17, 2023

Really nice! I don't think we want to land native dependencies just yet (I'm not at all opposed to them, but I think we should wait until post 1.0 and maybe do them as a pluggable library, especially for ones outside the usual gcc based idiom). But it's great to know that we're able to get them (there's a whole bunch more golf to do on the rust side too!). Doing eg: HTTP/1 parsing and other really hot paths in Rust is very interesting.

@sleipnir
Copy link

sleipnir commented Jan 17, 2023

Really nice! I don't think we want to land native dependencies just yet (I'm not at all opposed to them, but I think we should wait until post 1.0 and maybe do them as a pluggable library, especially for ones outside the usual gcc based idiom). But it's great to know that we're able to get them (there's a whole bunch more golf to do on the rust side too!). Doing eg: HTTP/1 parsing and other really hot paths in Rust is very interesting.

@mtrudel I respect your opinion but would like to disagree a little.
I think the minimalist approach that has been presented by @moogle19 could be an excellent way forward here. Even more so with the proposal being optional, I think you have everything to follow both paths here and make users who need brute force happy and those who don't too. It's a very small piece of effort for a very large gain.

@mtrudel
Copy link
Owner Author

mtrudel commented Jan 17, 2023

Don't get me wrong - I really like @moogle19's approach here, a LOT. I can foresee a time when native code like this replaces (or more correctly, 'optionally replaces') the native implementation within Bandit. I'm just saying that that time isn't here yet. Some rationale:

  • We're currently in the 0.6 release series, with just one more series (0.7, covering aspects of configurability) before prepping for a 1.0. It's been a long road to get here, and one that I'd like to see completed sooner rather than later. The perf work in the 0.6 series is intended to tweak the broad implementation we have today, not to introduce new paradigms that may slow progress or diminish the high quality bar we've set for ourselves.

  • We're still very much in the 'adoption' phase of the project, and as such should be doing everything we can to ensure that Bandit is as much of a 'drop-in' replacement as possible. While rustler does a great job of streamlining things, the fact remains that NIFs aren't as universally straightforward to get running in all instances, and we run the very real risk of jeopardizing Bandit adoption by making the upgrade process even slightly more error-prone, which isn't a tradeoff I want to make at this point.

  • There are a lot of hot paths that would benefit greatly from native execution beyond this one. Off the top of my head: header parsing in HTTP/1, iolist splitting in (mostly) HTTP/2, and UTF-8 validation and masking in WebSockets are all prime candidates for hot paths that would be ideal for 'NIF sprinkling'. I'd like us to tackle these in a consistent way as an intentional, standalone workup, not have it as a secondary tag-along in an otherwise pretty unremarkable PR.

  • From the Bandit README:

    It is written with correctness, clarity & performance as fundamental goals.

    I should note that, even though it isn't called out explicitly, I consider this an ordered list. Specifically, it is fundamentally more important to me for the project's code to be clear than it is for the code to be fast. I'm not yet sure of the best way to handle native tooling in this respect, and I'm loathe to rush into it.

Hopefully that clears up what my position on this is. I really can't emphasize enough how much I'm not saying 'no', but rather 'not right now'.

@victorolinasc
Copy link

It is also worth noting that :jason_native, used here as the example, does not use rustler. There is an explanation here on the project README: https://github.com/spawnfest/json_native#why-not-rust

The NIF theme on the BEAM is subtle... many stories of successes and also of troubles :)

@sleipnir
Copy link

sleipnir commented Jan 17, 2023

It is also worth noting that :jason_native, used here as the example, does not use rustler. There is an explanation here on the project README: https://github.com/spawnfest/json_native#why-not-rust

The NIF theme on the BEAM is subtle... many stories of successes and also of troubles :)

I think that for web servers, especially on BEAM there are few optimal options in terms of raw performance. We know all the qualities of BEAM and it is fantastic, but for certain use cases these qualities are not all that is needed. That said, if we could have an option that would give the user the option to make the choice then why not? NIF will always be a tricky case, but because the BEAM VM is preemptive and therefore there must be compromises to be made when you want performance over security (user choice). In addition, there are ways to provide the necessary security, and also we are not talking about C++ code, we are talking about Rust, which is a thousand times more secure than anything else out there.
So about the project's roadmap, I think it's worth waiting, but I would consider it a waste of opportunity to leave this indefinitely behind.

@sleipnir
Copy link

sleipnir commented Jan 17, 2023

For example, in my current project we already use Bandit and we would benefit a lot if something like the parsing of the http headers was done via rust because being a Sidecar based solution every microsecond in the additional latency matters a lot to us. But I agree that we must exhaust all forms in Elixir code first. I liked @moogle19 suggestion precisely because it allows this to be optional for the user.

@mtrudel
Copy link
Owner Author

mtrudel commented Jan 17, 2023

I implemented Cowboy's implementation of UTF-8 validation in Elixir and benched it, and it's actually the slowest of the bunch by quite a bit:



Mix.install([:benchee])

defmodule BulkMatch1024 do
  defguardp is_inner_byte(c) when c >= 128 and c < 192

  def valid?(str), do: do_valid?(str)
  defp do_valid?(<<a::1024, rest::binary>>) when Bitwise.band(a, 0x8080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080) == 0, do: do_valid?(rest)
  defp do_valid?(<<a::8, rest::binary>>) when a < 128, do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, rest::binary>>) when a >= 192 and a < 224 and is_inner_byte(b), do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, c::8, rest::binary>>) when a >= 224 and a < 240 and is_inner_byte(b) and is_inner_byte(c), do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, c::8, d::8, rest::binary>>) when a >= 240 and is_inner_byte(b) and is_inner_byte(c) and is_inner_byte(d), do: do_valid?(rest)
  defp do_valid?(<<>>), do: true
  defp do_valid?(_str), do: false
end

defmodule BulkMatch512 do
  defguardp is_inner_byte(c) when c >= 128 and c < 192

  def valid?(str), do: do_valid?(str)
  defp do_valid?(<<a::512, rest::binary>>) when Bitwise.band(a, 0x80808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080808080) == 0, do: do_valid?(rest)
  defp do_valid?(<<a::8, rest::binary>>) when a < 128, do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, rest::binary>>) when a >= 192 and a < 224 and is_inner_byte(b), do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, c::8, rest::binary>>) when a >= 224 and a < 240 and is_inner_byte(b) and is_inner_byte(c), do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, c::8, d::8, rest::binary>>) when a >= 240 and is_inner_byte(b) and is_inner_byte(c) and is_inner_byte(d), do: do_valid?(rest)
  defp do_valid?(<<>>), do: true
  defp do_valid?(_str), do: false
end

defmodule BulkMatch64 do
  defguardp is_inner_byte(c) when c >= 128 and c < 192

  def valid?(str), do: do_valid?(str)
  defp do_valid?(<<a::64, rest::binary>>) when Bitwise.band(a, 0x8080808080808080) == 0, do: do_valid?(rest)
  defp do_valid?(<<a::8, rest::binary>>) when a < 128, do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, rest::binary>>) when a >= 192 and a < 224 and is_inner_byte(b), do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, c::8, rest::binary>>) when a >= 224 and a < 240 and is_inner_byte(b) and is_inner_byte(c), do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, c::8, d::8, rest::binary>>) when a >= 240 and is_inner_byte(b) and is_inner_byte(c) and is_inner_byte(d), do: do_valid?(rest)
  defp do_valid?(<<>>), do: true
  defp do_valid?(_str), do: false
end

defmodule BulkMatch32 do
  defguardp is_inner_byte(c) when c >= 128 and c < 192

  def valid?(str), do: do_valid?(str)
  defp do_valid?(<<a::32, rest::binary>>) when Bitwise.band(a, 0x80808080) == 0, do: do_valid?(rest)
  defp do_valid?(<<a::8, rest::binary>>) when a < 128, do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, rest::binary>>) when a >= 192 and a < 224 and is_inner_byte(b), do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, c::8, rest::binary>>) when a >= 224 and a < 240 and is_inner_byte(b) and is_inner_byte(c), do: do_valid?(rest)
  defp do_valid?(<<a::8, b::8, c::8, d::8, rest::binary>>) when a >= 240 and is_inner_byte(b) and is_inner_byte(c) and is_inner_byte(d), do: do_valid?(rest)
  defp do_valid?(<<>>), do: true
  defp do_valid?(_str), do: false
end

defmodule Cowboy do
  def valid?(str), do: validate_utf8(str, 0) == 0

  defp validate_utf8(<<>>, state), do: state
  defp validate_utf8(<<c::8, rest::bits>>, 0) when c < 128, do: validate_utf8(rest, 0)
  defp validate_utf8(<<c::8, rest::bits>>, 2) when c >= 128 and c < 144, do: validate_utf8(rest, 0)
  defp validate_utf8(<<c::8, rest::bits>>, 3) when c >= 128 and c < 144, do: validate_utf8(rest, 2)
  defp validate_utf8(<<c::8, rest::bits>>, 5) when c >= 128 and c < 144, do: validate_utf8(rest, 2)
  defp validate_utf8(<<c::8, rest::bits>>, 7) when c >= 128 and c < 144, do: validate_utf8(rest, 3)
  defp validate_utf8(<<c::8, rest::bits>>, 8) when c >= 128 and c < 144, do: validate_utf8(rest, 3)
  defp validate_utf8(<<c::8, rest::bits>>, 2) when c >= 144 and c < 160, do: validate_utf8(rest, 0)
  defp validate_utf8(<<c::8, rest::bits>>, 3) when c >= 144 and c < 160, do: validate_utf8(rest, 2)
  defp validate_utf8(<<c::8, rest::bits>>, 5) when c >= 144 and c < 160, do: validate_utf8(rest, 2)
  defp validate_utf8(<<c::8, rest::bits>>, 6) when c >= 144 and c < 160, do: validate_utf8(rest, 3)
  defp validate_utf8(<<c::8, rest::bits>>, 7) when c >= 144 and c < 160, do: validate_utf8(rest, 3)
  defp validate_utf8(<<c::8, rest::bits>>, 2) when c >= 160 and c < 192, do: validate_utf8(rest, 0)
  defp validate_utf8(<<c::8, rest::bits>>, 3) when c >= 160 and c < 192, do: validate_utf8(rest, 2)
  defp validate_utf8(<<c::8, rest::bits>>, 4) when c >= 160 and c < 192, do: validate_utf8(rest, 2)
  defp validate_utf8(<<c::8, rest::bits>>, 6) when c >= 160 and c < 192, do: validate_utf8(rest, 3)
  defp validate_utf8(<<c::8, rest::bits>>, 7) when c >= 160 and c < 192, do: validate_utf8(rest, 3)
  defp validate_utf8(<<c::8, rest::bits>>, 0) when c >= 194 and c < 224, do: validate_utf8(rest, 2)
  defp validate_utf8(<<224::8, rest::bits>>, 0), do: validate_utf8(rest, 4)
  defp validate_utf8(<<c::8, rest::bits>>, 0) when c >= 225 and c < 237, do: validate_utf8(rest, 3)
  defp validate_utf8(<<237::8, rest::bits>>, 0), do: validate_utf8(rest, 5)
  defp validate_utf8(<<c::8, rest::bits>>, 0) when c == 238 and c == 239, do: validate_utf8(rest, 3)
  defp validate_utf8(<<240::8, rest::bits>>, 0), do: validate_utf8(rest, 6)
  defp validate_utf8(<<c::8, rest::bits>>, 0) when c == 241 and c == 242 and c == 243, do: validate_utf8(rest, 7)
  defp validate_utf8(<<244::8, rest::bits>>, 0), do: validate_utf8(rest, 8)
  defp validate_utf8(_, _), do: 1
end

Benchee.run(
  %{
    "old" => fn input -> String.valid?(input) end,
    "cowboy" => fn input -> Cowboy.valid?(input) end,
    "32" => fn input -> BulkMatch32.valid?(input) end,
    "1024" => fn input -> BulkMatch1024.valid?(input) end,
    "512" => fn input -> BulkMatch512.valid?(input) end,
  },
  time: 10,
  memory_time: 2,
  inputs: %{
    "micro" => String.duplicate("a", 10),
    "medium" => String.duplicate("a", 10_002),
  }
)


##### With input medium #####
Name             ips        average  deviation         median         99th %
1024         99.56 K       10.04 μs   ±100.61%        9.58 μs       21.96 μs
512          99.11 K       10.09 μs    ±91.80%        9.75 μs       14.17 μs
old          34.32 K       29.14 μs    ±20.36%       28.75 μs       35.04 μs
32           32.79 K       30.50 μs    ±14.86%       30.17 μs       35.22 μs
cowboy       17.79 K       56.22 μs    ±10.55%       55.75 μs       73.99 μs

Comparison:
1024         99.56 K
512          99.11 K - 1.00x slower +0.0461 μs
old          34.32 K - 2.90x slower +19.09 μs
32           32.79 K - 3.04x slower +20.45 μs
cowboy       17.79 K - 5.60x slower +46.18 μs

Memory usage statistics:

Name      Memory usage
1024           11264 B
512            11888 B - 1.06x memory usage +624 B
old              656 B - 0.06x memory usage -10608 B
32               656 B - 0.06x memory usage -10608 B
cowboy           656 B - 0.06x memory usage -10608 B

**All measurements for memory usage were the same**

##### With input micro #####
Name             ips        average  deviation         median         99th %
old           2.80 M      356.65 ns ±11605.38%         250 ns         417 ns
32            2.78 M      360.32 ns ±11340.30%         250 ns         417 ns
cowboy        2.57 M      388.62 ns ±10957.31%         291 ns         458 ns
512           2.47 M      405.04 ns ±10510.44%         292 ns         458 ns
1024          2.45 M      407.58 ns ±10235.15%         292 ns         458 ns

Comparison:
old           2.80 M
32            2.78 M - 1.01x slower +3.67 ns
cowboy        2.57 M - 1.09x slower +31.98 ns
512           2.47 M - 1.14x slower +48.39 ns
1024          2.45 M - 1.14x slower +50.93 ns

Memory usage statistics:

Name      Memory usage
old              656 B
32               656 B - 1.00x memory usage +0 B
cowboy           656 B - 1.00x memory usage +0 B
512              656 B - 1.00x memory usage +0 B
1024             656 B - 1.00x memory usage +0 B

**All measurements for memory usage were the same**

I'm not sure that the improvements to runtime are worth the tradeoff on memory; I think I'm going to leave UTF-8 detection alone for now.

@mtrudel mtrudel merged commit 6ab40a1 into main Jan 17, 2023
@mtrudel mtrudel deleted the websocket_perf branch January 17, 2023 20:51
@sleipnir
Copy link

I'm not sure that the improvements to runtime are worth the tradeoff on memory; I think I'm going to leave UTF-8 detection alone for now.

In the medium case it was faster, wasn't it? Could you use this only for cases where the gain is certain?

@mtrudel
Copy link
Owner Author

mtrudel commented Jan 17, 2023

I'm not sure that the improvements to runtime are worth the tradeoff on memory; I think I'm going to leave UTF-8 detection alone for now.

In the medium case it was faster, wasn't it? Could you use this only for cases where the gain is certain?

It's much faster (about 3x). Unfortunately it's also about 20x more memory intensive.

@sleipnir
Copy link

It's much faster (about 3x). Unfortunately it's also about 20x more memory intensive.

Suggestion. Again it might be a case of letting the user choose via configuration how he wants this to behave.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
benchmark Assign this to a PR to have the benchmark CI suite run
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants