Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Continuation for problems under high load #431

Closed
atavistock opened this issue Nov 25, 2024 · 20 comments
Closed

Continuation for problems under high load #431

atavistock opened this issue Nov 25, 2024 · 20 comments

Comments

@atavistock
Copy link

atavistock commented Nov 25, 2024

Followup from #412

ab is definitely now passing, but unfortunately the benchmarks are not. ;(

They're actually using wrk to do the benchmarking. I was able to reproduce the issue easily with ab but its still breaking under load. After seeing it pass with ab and not pass in the benchmarks I reproduced the problem with wrk.

To reproduce:

  • Make pipeline.lua with:
init = function(args)
  local r = {}
  local depth = tonumber(args[1]) or 1
  for i=1,depth do
    r[i] = wrk.format()
  end
  req = table.concat(r)
end

request = function()
  return req
end
  • run
    wrk -H 'Host: localhost' -H 'Accept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7' -H 'Connection: keep-alive' --latency -d 15 -c 1024 --timeout 8 -t 10 http://localhost:4000/plaintext -s ./pipeline.lua -- 16

edited Nov 27, 2024

There is a flood of [error] ** (Bandit.HTTPError) Request line HTTP error: "\n" before the server stops responding.

For one, I'm not really sure what that error means. It got an empty line as a request?

Two, is it possible that when these occur its not closing the connection?

@robredmond
Copy link

@atavistock - curious if you were seeing either of the following as well?

  • [error] ** (Bandit.HTTPError) Header read timeout
  • [error] ** (Bandit.HTTPError) Excess body read

@atavistock
Copy link
Author

Not that I've seen.

It seems to be mostly [error] ** (Bandit.HTTPError) Request line HTTP error: "\n"

@atavistock
Copy link
Author

atavistock commented Nov 28, 2024

Theres also a flood of

[notice] Handler :default switched from :sync to :drop mode
[notice] Handler :default switched from :drop to :sync mode

before the other errors

but that looks like its coming from Erlang logger code
https://github.com/erlang/otp/blob/187a6be313a6cf36f92ae75fd2f30c3a6938c249/lib/kernel/src/logger_proxy.erl#L159

@mtrudel
Copy link
Owner

mtrudel commented Nov 29, 2024

[notice] Handler :default switched from :sync to :drop mode
[notice] Handler :default switched from :drop to :sync mode

Those definitely aren't coming from Bandit; likely they are logger as you suspect.

In terms of [error] ** (Bandit.HTTPError) Request line HTTP error: "\n" errors, those are likely happening because Bandit is seeing an extra newline in between requests on a keep-alive connection. A single complete keep-alive client request looks like so:

GET /plaintext HTTP/1.1\r\n
Host: localhost:4000\r\n
Connection: keep-alive\r\n
\r\n

Subsequent requests from the client should look similar without any extraneous characters in between (other than perhaps a \r\n per RFC9112§2.2). The error you're seeing suggests otherwise, likely something like:

GET /plaintext HTTP/1.1\r\n
Host: localhost:4000\r\n
Connection: keep-alive\r\n
\r\n
\n

(Note that this parsing isn't done by Bandit directly, but delegated to :erlang.decode_packet). It's possible that we're doing something wrong here; a minimal repro case would be immensely helpful here.

@atavistock
Copy link
Author

So I've tried getting a minimal repro case and it only seems to happen in the full environment. I've also just verified that these errors do not happen with Cowboy.

Heres steps to see whats happening

  1. Clone or fork the full benchmark suite from https://github.com/TechEmpower/FrameworkBenchmarks
  2. Most logging is disabled so edit the file frameworks/Elixir/phoenix/config/prod.exs and make comment out log_protocol_errors: false and config :phoenix, :logger, false
  3. Only the plain text check is failing so from the root of the git directory run ./tfb --test phoenix --type plaintext

@atavistock
Copy link
Author

atavistock commented Dec 2, 2024

Some random thoughts


I've switched out the configs to use Cowboy and this error does not happen. So likely any shared erlang code is not the problem.


This is only happening with the plaintext test case. A very similar case returning json seems to work fine. They in the same controller and have the entire pipeline striped down. These are both one line functions:

Working

  # avoid namespace collision
  def _json(conn, _params) do
    json(conn, %{message: "Hello, World!"})
  end

Fails

  def plaintext(conn, _params) do
    text(conn, "Hello, World!")
  end

Digging a little deeper the Phoenix controller code is nearly identical for these two functions

def json(conn, data) do
  response = Phoenix.json_library().encode_to_iodata!(data)
  send_resp(conn, conn.status || 200, "application/json", response)
end

...

def text(conn, data) do
  send_resp(conn, conn.status || 200, "text/plain", to_string(data))
end

The only differences I can see is the content type and the json version is iodata (a list)

@atavistock
Copy link
Author

Okay, so still collecting data hoping that something catches.

I've forked Bandit to be able to see internal state. Requests are coming through as

GET /plaintext HTTP/1.1\r\nHost: tfb-server\r\nConnection: keep-alive\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\n\r\n

If I look at the socket.buffer that is being passed to :erlang.decode_packet I see those requests as fine for the first few thousand, but then at some point in there the buffer gets weird and I start seeing

GET /plaintext HTTP/1.1\r
GET /plaintext HTTP/1.1\r
GET /plaintext HTTP/1.1\r
GET /plaintext HTTP/1.1\r
\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r\nGET /plaintext HTTP/1.1\r\nAccept: text/plain,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\nHost: tfb-server\r\nConnection: keep-alive\r\n\r

Once that starts I no longer see regular requests in the buffer.

@atavistock
Copy link
Author

Okay. This all might come down to a part of the HTTP/1.1 spec that was "official" but barely used. So wrk supports using HTTP1.1 pipelining https://en.wikipedia.org/wiki/HTTP_pipelining .

Basically 5 http requests are getting packed together into a single request body. In the first few thousand of these Bandit is discarding the last 4 and only processing the first one, up until this :erlang.decode_packet is not returning {:more, _}. For some reason :erlang.decode_packet sends a {:more, _} which starts parsing those parts differently which is where the broken requests start coming in.

You can see the first case (dropping all but the first) by just using this

echo -en "GET /test HTTP/1.1\r\nHost: localhost\r\n\r\nGET /test HTTP/1.1\r\nHost: localhost\r\n\r\nGET /test HTTP/1.1\r\nHost: localhost\r\n\r\nGET /test HTTP/1.1\r\nHost: localhost\r\n\r\nGET /test HTTP/1.1\r\nHost: localhost\r\n\r\n" | nc localhost 4000

You should see all 5 responses in the same response body, but only the first comes back.

I don't have an easy way to trigger the chunk behavior.

@atavistock
Copy link
Author

All that said, this could very easily be :erlang.decode_packet not supporting http pipelining

@atavistock
Copy link
Author

So I wrote a little script to verify that :erlang.decode_packet DOES work as expected with HTTP pipelining

defmodule Parser do
  def process(packet) do
    case :erlang.decode_packet(:http_bin, packet, []) do
      {:more, len} -> 
        if len > 0, do: IO.puts("Remaining content '#{inspect(packet)}'")

      {:ok, {:http_request, method, request_target, version}, rest} ->
        {headers, remainder} = process_headers(rest)
        IO.puts("
          Headers #{inspect(headers)}
          Request: #{inspect(method)} #{inspect(request_target)} #{inspect(version)}
        ")
        process(remainder)

      {:error, reason} ->
        IO.puts("Request line unknown error: #{inspect(reason)}")
        nil
    end
  end

  defp process_headers(packet, headers \\ []) do
    case :erlang.decode_packet(:httph_bin, packet, []) do
      {:more, len} ->
        if len > 0, do: IO.puts("Header skipped? '#{inspect(packet)}'")

      {:ok, {:http_header, _, header, _, value}, rest} ->
        header_name = header |> to_string() |> String.downcase(:ascii)
        headers = [{header_name, value} | headers]
        process_headers(rest, headers)

      {:ok, :http_eoh, rest} ->
        {headers, rest}

      {:error, reason} ->
        IO.puts("Unknown header issue: #{inspect(reason)}")
        nil
    end
  end
end

request = "GET /index HTTP/1.1\r\nHost: localhost\r\n\r\n"
packet = String.duplicate(request, 10)
Parser.process(packet)

@atavistock
Copy link
Author

@mtrudel As I've dug deeper I've found this is not a load related issue at all. This specific issue has a lot of things going on. I'm thinking I might close this and make a different issue for targetted to what we now know. Thoughts?

@mtrudel
Copy link
Owner

mtrudel commented Dec 4, 2024

Sorry just catching up on this now. Pipelining is rife with corner cases and (as you say) only spottily supported. I have no plan or intention to support it; if wrk is using it then I'm happy to call that unsupported.

That being said, tighter repro cases are always good! If you feel like you can get repros for other things that we SHOULD be supporting but aren't, I'd love to see the reports! Smaller / more focused / reproducible strongly preferred.

Thanks again for all your work on this!

@mtrudel
Copy link
Owner

mtrudel commented Dec 4, 2024

@atavistock looking closer at your repro case here that's actually something that Bandit should support. Let me dig in a bit and see what I'm missing.

@mtrudel
Copy link
Owner

mtrudel commented Dec 4, 2024

Yep I see the issue, it's on Bandit's side. I've characterized it and should have a fix up soon

@mtrudel
Copy link
Owner

mtrudel commented Dec 4, 2024

@atavistock are you able to check if the work in #437 resolves this?

@atavistock
Copy link
Author

atavistock commented Dec 4, 2024

Woohoo! That definitely fixes it. It does expose another different issue of about 1 in 1_000 request doing a timeout (yay, only like 0.1%). This solves the real problem! Thanks so much for getting this one.

@mtrudel
Copy link
Owner

mtrudel commented Dec 5, 2024

Great to hear! I assume that 0.1% case is the one you just filed on a separate ticket?

@atavistock
Copy link
Author

atavistock commented Dec 5, 2024 via email

@atavistock
Copy link
Author

Thanks so much for chasing this down with me and getting it resolved.

@mtrudel
Copy link
Owner

mtrudel commented Dec 6, 2024

Fixed in 1.6.1, just released

@mtrudel mtrudel mentioned this issue Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants