-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issues/expectations #92
Comments
How difficult would it be to test it using the C# client to have a primary baseline of ESDB itself (to some extent)? |
Thanks for the suggestion, @yordis!
This seems to point to Spear being the bottleneck here. Any advise on how we can proceed? |
I'm a bit busy at the moment so I haven't given this a proper look but 20-40ms does seem quite high. I'm also surprised it's cheaper to write than read. In general I would recommend tuning the I took a look at a network capture of a similar https://github.com/NFIBrokerage/spear/files/12797716/spear.tgz (tar + gzipped pcapng) We send the HTTP/2 HEADERS from the ReadReq but it looks like we're waiting for the TCP ACK from EventStoreDB to send the body of the request. Excluding the time it takes for ACKs the request should be very fast (single-digit milliseconds). So I wonder if this is something that can be fixed by tuning TCP options. I'm not quite sure what is blocking on the ACK: Mint or gen_tcp/ssl, so this needs some further debugging. |
We've captured the network traffic too to compare the C# implementation vs Spear. It does seem like the C# one is less chatty.
A few other (useful?) observations from our production setup:
|
Here's a bit more info from our field testing:
|
@the-mikedavis, do you reckon it is possible to have similar performance characteristics between stream! and read_stream? |
|
@the-mikedavis Assuming that both C# and Elixir clients are using the same endpoint (https://developers.eventstore.com/clients/grpc/reading-events.html#reading-from-a-stream) I would expect them to have the same latency. do you think the C# client reads in a bigger batch by default? |
I believe the C# client reads with effectively no limit and it's set rather low in Spear. You can try tuning that option but I don't think it should have an effect since individual read_stream calls are slow. I suspect this might be solved by tuning some options in gen_tcp/ssl or changing the |
Sorry for the long delay! Things have been hectic lately. I took a look at those packet captures and looking with a filter of
In the C# case the read took twice as long on the server but I see that occasionally happening locally too - it looks like the server has some sort of caching to speed up repeated reads so some take half as much time. It's still looking to me like this latency is coming from the server and isn't really dependent on the client. Could you post a small reproduction repository or gist of the C# code you used in your test? I don't do much dotnet development so that would make it much easier for me to compare |
Sorry for the late reply, @the-mikedavis. It basically does something as simple as this: var stream_name = "Document:134";
var settings = EventStoreClientSettings.Create("esdb://localhost:2113?tls=false");
var client = new EventStoreClient(settings);
var timer = new Stopwatch();
timer.Start();
var result = client.ReadStreamAsync(
Direction.Forwards,
stream_name,
StreamPosition.Start
);
await foreach (var @event in result) {
Console.WriteLine(@event.Event.EventType);
}
timer.Stop(); In the attached project this is actually done twice - the first run seems to be always quite slow (cold-start?). |
I can reproduce the timing differences. With a similar Elixir script: Mix.install [{:spear, path: "."}, {:jason, "~> 1.0"}]
{:ok, conn} = Spear.Connection.start_link(connection_string: "esdb://localhost:2113?tls=false")
stream_name = "Document:134"
start1 = :erlang.monotonic_time()
conn
|> Spear.stream!(stream_name, from: :start, direction: :forwards)
|> Enum.each(fn event ->
IO.puts event.type
end)
end1 = :erlang.monotonic_time()
start2 = :erlang.monotonic_time()
conn
|> Spear.stream!(stream_name, from: :start, direction: :forwards)
|> Enum.each(fn event ->
IO.puts event.type
end)
end2 = :erlang.monotonic_time()
IO.inspect(:erlang.convert_time_unit(end1 - start1, :native, :microsecond) / 1000, label: "read no. 1")
IO.inspect(:erlang.convert_time_unit(end2 - start2, :native, :microsecond) / 1000, label: "read no. 2") I see ~90ms for the first read with dotnet and ~2-3ms for the subsequent read. With Spear I see ~130ms for the first read and ~80ms for the second. I'll push up the change I mentioned earlier:
which improves the numbers a little: 80ms for the first read and ~25ms for the second. I took some new packet captures comparing these and the HTTP/2 requests themselves seem very fast - for both Spear and dotnet I see the network request taking not more than 2ms for the second read. So the slow-down must be somewhere in either Mint, gpb or Spear. |
I was curious about the chatty WINDOW_UPDATE frames so I took a look into Mint. Mint sends a WINDOW_UPDATE both for the request stream (i.e. the ReadReq) and for the whole connection. That's fine and fits the spec but it sends the WINDOW_UPDATE frames eagerly any time you handle a DATA frame in I changed Mint to collect all of the window updates in a single call of |
Feel free to give #93 a try. This is not a very high priority problem for me so I will probably not look at this again for a little while. If you'd like to dig deeper I can recommend checking out packet captures with Wireshark and creating flamegraphs. Nothing stuck out to me as odd but you can look for yourself. Set up a script that reads from a reasonably long stream, have it print out its PID with It would also be interesting to try this with another HTTP/2 / GRPC client (for example gun, maybe https://github.com/elixir-grpc/grpc), protobuf definitions are here: https://github.com/EventStore/EventStore/blob/0af9906b39c9cd9fc6e301d5c4690b60b2fddcb1/src/Protos/Grpc/streams.proto. Or to eliminate protobuf (de)serialization altogether and try to just get the network request to be as fast as possible, to narrow down what exactly is slow. |
Despite what I said above about putting this down for a while, I gave this another look today 😄 Here's a script of what I was thinking about above:
|
@the-mikedavis, thank you for looking into that. |
I took a look into the I also looked at splitting the timing up for
|
Ah I think I know what the issue is. Could you try setting TCP nodelay and see how that fares? Spear.Connection.start_link(connection_string: "esdb://localhost:2113?tls=false", mint_opts: [transport_opts: [nodelay: true]]) We're sending pretty small packets for ReadReq and friends and the TCP socket might be buffering those for a while. I see this reduce the time of the second read (in this script above to ~1ms. If you can reproduce, I would definitely consider setting this by default (within Spear) since we're sending somewhat small packets. |
@the-mikedavis I think that might be it. Just saw https://brooker.co.za/blog/2024/05/09/nagle.html a couple days ago and thought about |
Hi!
We're using Spear in production and are noticing that performance might be less than ideal. We're not sure where exactly the problem resides - Spear, EventStoreDB or the network.
Our flow is as follows:
SpearClient.stream!
)SpearClient.append
)Our
SpearClient
module is literally as simple as this:All this takes consistently approx 100-120 ms on production, and almost as much locally. Which seems rather slow.
When isolated to just the Spear calls (no logic of our own), we see the following on localhost:
SpearClient.stream!("stream_with_300_events") |> Enum.into([])
- between 40 and 60 msSpearClient.stream!("stream_with_20_events") |> Enum.into([])
- between 20 and 40 msIn production
append
takes typically a bit less thanstream! + Enum.into([]
, but still hovering around 50ms in majority of cases.As a side note, relatively simple queries to PostgresSQL (deployed on Google Cloud too) only take around 3-5 ms on average.
Would it be possible to confirm that these timings are higher than expected? Anything else we could debug or try to speed it up?
Thanks in advance for any advise!
The text was updated successfully, but these errors were encountered: