Continuation for problems under high load #431

atavistock · 2024-11-25T20:54:25Z

Followup from #412

ab is definitely now passing, but unfortunately the benchmarks are not. ;(

They're actually using wrk to do the benchmarking. I was able to reproduce the issue easily with ab but its still breaking under load. After seeing it pass with ab and not pass in the benchmarks I reproduced the problem with wrk.

To reproduce:

Make pipeline.lua with:

init = function(args)
  local r = {}
  local depth = tonumber(args[1]) or 1
  for i=1,depth do
    r[i] = wrk.format()
  end
  req = table.concat(r)
end

request = function()
  return req
end

run
wrk -H 'Host: localhost' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 1024 --timeout 8 -t 10 http://localhost:4000/plaintext -s ./pipeline.lua -- 16

edited Nov 27, 2024

There is a flood of [error] ** (Bandit.HTTPError) Request line HTTP error: "\n" before the server stops responding.

For one, I'm not really sure what that error means. It got an empty line as a request?

Two, is it possible that when these occur its not closing the connection?

The text was updated successfully, but these errors were encountered:

robredmond · 2024-11-27T22:25:06Z

@atavistock - curious if you were seeing either of the following as well?

[error] ** (Bandit.HTTPError) Header read timeout
[error] ** (Bandit.HTTPError) Excess body read

atavistock · 2024-11-28T01:40:44Z

Not that I've seen.

It seems to be mostly [error] ** (Bandit.HTTPError) Request line HTTP error: "\n"

atavistock · 2024-11-28T01:54:48Z

Theres also a flood of

[notice] Handler :default switched from :sync to :drop mode
[notice] Handler :default switched from :drop to :sync mode

before the other errors

but that looks like its coming from Erlang logger code
https://github.com/erlang/otp/blob/187a6be313a6cf36f92ae75fd2f30c3a6938c249/lib/kernel/src/logger_proxy.erl#L159

mtrudel · 2024-11-29T18:05:11Z

[notice] Handler :default switched from :sync to :drop mode
[notice] Handler :default switched from :drop to :sync mode

Those definitely aren't coming from Bandit; likely they are logger as you suspect.

In terms of [error] ** (Bandit.HTTPError) Request line HTTP error: "\n" errors, those are likely happening because Bandit is seeing an extra newline in between requests on a keep-alive connection. A single complete keep-alive client request looks like so:

GET /plaintext HTTP/1.1\r\n
Host: localhost:4000\r\n
Connection: keep-alive\r\n
\r\n

Subsequent requests from the client should look similar without any extraneous characters in between (other than perhaps a \r\n per RFC9112§2.2). The error you're seeing suggests otherwise, likely something like:

GET /plaintext HTTP/1.1\r\n
Host: localhost:4000\r\n
Connection: keep-alive\r\n
\r\n
\n

(Note that this parsing isn't done by Bandit directly, but delegated to :erlang.decode_packet). It's possible that we're doing something wrong here; a minimal repro case would be immensely helpful here.

atavistock · 2024-12-02T18:30:36Z

So I've tried getting a minimal repro case and it only seems to happen in the full environment. I've also just verified that these errors do not happen with Cowboy.

Heres steps to see whats happening

Clone or fork the full benchmark suite from https://github.com/TechEmpower/FrameworkBenchmarks
Most logging is disabled so edit the file frameworks/Elixir/phoenix/config/prod.exs and make comment out log_protocol_errors: false and config :phoenix, :logger, false
Only the plain text check is failing so from the root of the git directory run ./tfb --test phoenix --type plaintext

atavistock · 2024-12-02T18:48:31Z

Some random thoughts

I've switched out the configs to use Cowboy and this error does not happen. So likely any shared erlang code is not the problem.

This is only happening with the plaintext test case. A very similar case returning json seems to work fine. They in the same controller and have the entire pipeline striped down. These are both one line functions:

Working

  # avoid namespace collision
  def _json(conn, _params) do
    json(conn, %{message: "Hello, World!"})
  end

Fails

  def plaintext(conn, _params) do
    text(conn, "Hello, World!")
  end

Digging a little deeper the Phoenix controller code is nearly identical for these two functions

def json(conn, data) do
  response = Phoenix.json_library().encode_to_iodata!(data)
  send_resp(conn, conn.status || 200, "application/json", response)
end

...

def text(conn, data) do
  send_resp(conn, conn.status || 200, "text/plain", to_string(data))
end

The only differences I can see is the content type and the json version is iodata (a list)

atavistock · 2024-12-03T03:15:06Z

Okay, so still collecting data hoping that something catches.

I've forked Bandit to be able to see internal state. Requests are coming through as

GET /plaintext HTTP/1.1\r\nHost: tfb-server\r\nConnection: keep-alive\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\n\r\n

If I look at the socket.buffer that is being passed to :erlang.decode_packet I see those requests as fine for the first few thousand, but then at some point in there the buffer gets weird and I start seeing

GET /plaintext HTTP/1.1\r
GET /plaintext HTTP/1.1\r
GET /plaintext HTTP/1.1\r
GET /plaintext HTTP/1.1\r
\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r

Once that starts I no longer see regular requests in the buffer.

atavistock · 2024-12-03T04:22:02Z

Okay. This all might come down to a part of the HTTP/1.1 spec that was "official" but barely used. So wrk supports using HTTP1.1 pipelining https://en.wikipedia.org/wiki/HTTP_pipelining .

Basically 5 http requests are getting packed together into a single request body. In the first few thousand of these Bandit is discarding the last 4 and only processing the first one, up until this :erlang.decode_packet is not returning {:more, _}. For some reason :erlang.decode_packet sends a {:more, _} which starts parsing those parts differently which is where the broken requests start coming in.

You can see the first case (dropping all but the first) by just using this

echo -en "GET /test HTTP/1.1\r\nHost: localhost\r\n\r\nGET /test HTTP/1.1\r\nHost: localhost\r\n\r\nGET /test HTTP/1.1\r\nHost: localhost\r\n\r\nGET /test HTTP/1.1\r\nHost: localhost\r\n\r\nGET /test HTTP/1.1\r\nHost: localhost\r\n\r\n" | nc localhost 4000

You should see all 5 responses in the same response body, but only the first comes back.

I don't have an easy way to trigger the chunk behavior.

atavistock · 2024-12-03T04:29:06Z

All that said, this could very easily be :erlang.decode_packet not supporting http pipelining

atavistock · 2024-12-04T17:19:44Z

So I wrote a little script to verify that :erlang.decode_packet DOES work as expected with HTTP pipelining

defmodule Parser do
  def process(packet) do
    case :erlang.decode_packet(:http_bin, packet, []) do
      {:more, len} -> 
        if len > 0, do: IO.puts("Remaining content '#{inspect(packet)}'")

      {:ok, {:http_request, method, request_target, version}, rest} ->
        {headers, remainder} = process_headers(rest)
        IO.puts("
          Headers #{inspect(headers)}
          Request: #{inspect(method)} #{inspect(request_target)} #{inspect(version)}
        ")
        process(remainder)

      {:error, reason} ->
        IO.puts("Request line unknown error: #{inspect(reason)}")
        nil
    end
  end

  defp process_headers(packet, headers \\ []) do
    case :erlang.decode_packet(:httph_bin, packet, []) do
      {:more, len} ->
        if len > 0, do: IO.puts("Header skipped? '#{inspect(packet)}'")

      {:ok, {:http_header, _, header, _, value}, rest} ->
        header_name = header |> to_string() |> String.downcase(:ascii)
        headers = [{header_name, value} | headers]
        process_headers(rest, headers)

      {:ok, :http_eoh, rest} ->
        {headers, rest}

      {:error, reason} ->
        IO.puts("Unknown header issue: #{inspect(reason)}")
        nil
    end
  end
end

request = "GET /index HTTP/1.1\r\nHost: localhost\r\n\r\n"
packet = String.duplicate(request, 10)
Parser.process(packet)

atavistock · 2024-12-04T17:25:21Z

@mtrudel As I've dug deeper I've found this is not a load related issue at all. This specific issue has a lot of things going on. I'm thinking I might close this and make a different issue for targetted to what we now know. Thoughts?

mtrudel · 2024-12-04T17:32:03Z

Sorry just catching up on this now. Pipelining is rife with corner cases and (as you say) only spottily supported. I have no plan or intention to support it; if wrk is using it then I'm happy to call that unsupported.

That being said, tighter repro cases are always good! If you feel like you can get repros for other things that we SHOULD be supporting but aren't, I'd love to see the reports! Smaller / more focused / reproducible strongly preferred.

Thanks again for all your work on this!

mtrudel · 2024-12-04T18:44:19Z

@atavistock looking closer at your repro case here that's actually something that Bandit should support. Let me dig in a bit and see what I'm missing.

mtrudel · 2024-12-04T18:48:08Z

Yep I see the issue, it's on Bandit's side. I've characterized it and should have a fix up soon

mtrudel · 2024-12-04T20:09:58Z

@atavistock are you able to check if the work in #437 resolves this?

atavistock · 2024-12-04T23:00:44Z

Woohoo! That definitely fixes it. It does expose another different issue of about 1 in 1_000 request doing a timeout (yay, only like 0.1%). This solves the real problem! Thanks so much for getting this one.

mtrudel · 2024-12-05T03:06:10Z

Great to hear! I assume that 0.1% case is the one you just filed on a separate ticket?

atavistock · 2024-12-05T17:37:54Z

Yup. I need to close out this issue.On Dec 4, 2024, at 7:06 PM, Mat Trudel ***@***.***> wrote: Great to hear! I assume that 0.1% case is the one you just filed on a separate ticket? —Reply to this email directly, view it on GitHub, or unsubscribe.You are receiving this because you were mentioned.Message ID: ***@***.***>

atavistock · 2024-12-05T17:47:21Z

Thanks so much for chasing this down with me and getting it resolved.

mtrudel · 2024-12-06T14:00:50Z

Fixed in 1.6.1, just released

atavistock mentioned this issue Nov 28, 2024

[Elixir/phoenix] Implementing suggestions from @josevalim TechEmpower/FrameworkBenchmarks#9302

Merged

mtrudel mentioned this issue Dec 4, 2024

Handle pipelined requests #437

Merged

atavistock closed this as completed Dec 5, 2024

mtrudel mentioned this issue Jan 10, 2025

Trace tooling #453

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continuation for problems under high load #431

Continuation for problems under high load #431

atavistock commented Nov 25, 2024 •

edited

Loading

robredmond commented Nov 27, 2024

atavistock commented Nov 28, 2024

atavistock commented Nov 28, 2024 •

edited

Loading

mtrudel commented Nov 29, 2024

atavistock commented Dec 2, 2024

atavistock commented Dec 2, 2024 •

edited

Loading

atavistock commented Dec 3, 2024

atavistock commented Dec 3, 2024

atavistock commented Dec 3, 2024

atavistock commented Dec 4, 2024

atavistock commented Dec 4, 2024

mtrudel commented Dec 4, 2024

mtrudel commented Dec 4, 2024

mtrudel commented Dec 4, 2024 •

edited

Loading

mtrudel commented Dec 4, 2024

atavistock commented Dec 4, 2024 •

edited

Loading

mtrudel commented Dec 5, 2024

atavistock commented Dec 5, 2024 via email

atavistock commented Dec 5, 2024

mtrudel commented Dec 6, 2024

Continuation for problems under high load #431

Continuation for problems under high load #431

Comments

atavistock commented Nov 25, 2024 • edited Loading

robredmond commented Nov 27, 2024

atavistock commented Nov 28, 2024

atavistock commented Nov 28, 2024 • edited Loading

mtrudel commented Nov 29, 2024

atavistock commented Dec 2, 2024

atavistock commented Dec 2, 2024 • edited Loading

atavistock commented Dec 3, 2024

atavistock commented Dec 3, 2024

atavistock commented Dec 3, 2024

atavistock commented Dec 4, 2024

atavistock commented Dec 4, 2024

mtrudel commented Dec 4, 2024

mtrudel commented Dec 4, 2024

mtrudel commented Dec 4, 2024 • edited Loading

mtrudel commented Dec 4, 2024

atavistock commented Dec 4, 2024 • edited Loading

mtrudel commented Dec 5, 2024

atavistock commented Dec 5, 2024 via email

atavistock commented Dec 5, 2024

mtrudel commented Dec 6, 2024

atavistock commented Nov 25, 2024 •

edited

Loading

atavistock commented Nov 28, 2024 •

edited

Loading

atavistock commented Dec 2, 2024 •

edited

Loading

mtrudel commented Dec 4, 2024 •

edited

Loading

atavistock commented Dec 4, 2024 •

edited

Loading