Library does not scale with multiple cores #531

bIgBV · 2021-04-21T15:47:19Z

As demonstrated by the benchmarks in this reddit post, you can see the rust_tonic_mt benchmark falling behind in performance as the number of threads are increased.

The likely cause for this could be that a big portion of the shared state is behind this Mutex.

The text was updated successfully, but these errors were encountered:

bIgBV · 2021-04-21T15:55:05Z

A couple of things to try:

Replace the std::sync::mutex with one from parking_lot
Update hyper to keep all streams from a connection on a single thread.

seanmonstar · 2021-04-21T16:45:24Z

A bigger effort would be to replace the single massive lock to per-stream locks. There might still need to be a large lock, but the goal would be to not need to keep it locked for long or for most operations.

It's not exactly the same here, but grpc-go did a similar change a couple years ago to reduce contention: grpc/grpc-go#1962

bIgBV · 2021-04-30T04:15:40Z

Some initial measurements from replacing the lock in `streams.rs with one from parking lot:

https://gist.github.com/bIgBV/4d6d76773a948734ebef1367ef5221d5

w41ter · 2022-09-07T02:33:11Z

@bIgBV It seems that the comparison results of parking_lot and the original implementation are similar?

notgull · 2022-09-07T13:22:04Z

The libstd Mutex was recently replaced with a new implementation that is both much smaller and significantly faster. There is much less to lose now with per-stream locking.

jeffutter · 2023-11-04T16:51:32Z

Resurrecting this old issue, but I think I'm hitting this bottleneck fairly acutely. I'm experimenting with using tonic to build something like a load-balancing proxy between grpc streams. I have X clients connecting over Y connections each with Z streams. I then load balance the requests (mostly 1-1 request-response type requests) across I connections each with J streams to K downstream servers.

I was seeing fairly disappointing performance. If I have the external clients hit the backends directly I'm requsets are taking ~200μs at a certain load level. With the proxy in play it's closer to 1ms. I started digging into this bottleneck and found this github issue.

To isolate the problem further, I removed the server component and built a little client implementation (named pummel) that hammers the backend with requests across I connections each with J streams. With any appreciable amount of concurrency, the performance shows similar characteristics to the proxy when compared to our external clients (they happen to be written in elixir).

In profiling pummel I see this lock using a significant amount of CPU time:

If I'm reading this correctly, over 11% of the CPU time is dedicated to this mutex.

Currently, this is all running in a single Tokio runtime. I can configure the number of grpc connections and streams used, so I may play with ideas like starting a separate Tokio runtime per core or having more connections with fewer streams in hopes of reducing contention on this lock.

I don't really have any suggestions on how to improve this at the moment. Just wanted to share my findings. I'm glad to do any further testing if anyone has any ideas on how to improve this.

seanmonstar · 2023-11-06T15:04:31Z

@jeffutter thanks for the excellent write-up! A way forward would be to do what I suggested, make per-stream locks so we only need to lock the stream store in-frequently: when adding or removing a stream.

jeffutter · 2023-11-06T16:42:02Z

@seanmonstar Yeah. I think that would help my specific use case greatly, since I create all of the streams up-front and re-use them for many requests. So the global locks wouldn't occur mid-work. I might try to take a stab at making that change in my free time. Although, it'll probably take me a while to get up-to speed on h2 internals.
In the meantime if anyone gives that a try or has any other ideas, I'd be glad to test them out.

jeffutter · 2023-11-18T13:57:53Z

@seanmonstar I’ve been reading through the h2 source code, that grpc-go issue and the HTTP/2 spec. I’d like to take a stab at this. I’ll admit I’m new to h2 and HTTP/2 in any capacity more than a user so it’ll probably take me a bit to ramp up.

My understanding is that ultimately only one Frame can be written to the underlying IO at one time. So there needs to be a single buffer of Frames to send or I suppose a set of buffers and some mechanism to choose which one to take a frame from next. Currently all of the Frames get put in the SendBuffer on the Streams. It looks like each stream has it’s own pending_send Dequeue for it’s own frames. So, Architecturally, do you see those components remaining the same and the idea here being breaking up some of the state in the Store and maybe some of the Actions so that they can be tracked on the stream itself?

Let me know if that’s making any sense 🙃 or if you have any other suggestions as to how you’d go about implementing this.

Also, if you have any general resources for understanding HTTP/2 streams and flow control beyond the spec I’d love to read up more there too.

Thanks again for any help here. Hopefully with a bit of guidance I can help find a solution.

This PR adds a simple benchmark to measure perf improvement changes. E.g., a potential fix for this issue: #531 The benchmark is simple: have a client send `100_000` requests to a server and wait for a response. Output: ``` cargo bench H2 running in current-thread runtime at 127.0.0.1:5928: Overall: 353ms. Fastest: 91ms Slowest: 315ms Avg : 249ms H2 running in multi-thread runtime at 127.0.0.1:5929: Overall: 533ms. Fastest: 88ms Slowest: 511ms Avg : 456ms ```

Noah-Kennedy · 2024-06-30T18:27:39Z

@seanmonstar FYI i'm working on this now

howardjohn · 2024-07-25T20:20:43Z

I think the problem is probably pretty well understood from the information above, but if it helps, I collected some traces that I thought showcase the problem well.

Context: we have a bunch of incoming connections which we forward over a shared h2 connection (one h2 stream per downstream connection). Each row is a thread, and shows what is currently executing.

Here you can see while Connection::poll is running, we are blocked from writing on the streams trying to acquire a lock to call reserve_capacity

Similar picture shows all work in the system is blocked; 1 thread is writing out on the stream, and the rest are all blocking on the mutex freeing up:

trungda mentioned this issue Apr 8, 2024

Add a simple benchmark #762

Merged

Noah-Kennedy self-assigned this Jun 30, 2024

howardjohn mentioned this issue Aug 23, 2024

Ztunnel does not scale up with number of worker threads in expected way due to h2 locking istio/ztunnel#1174

Open

paolobarbolini mentioned this issue Aug 31, 2024

refactor: migrate web server to Actix Web deps-rs/deps.rs#229

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Library does not scale with multiple cores #531

Library does not scale with multiple cores #531

bIgBV commented Apr 21, 2021

bIgBV commented Apr 21, 2021

seanmonstar commented Apr 21, 2021

bIgBV commented Apr 30, 2021

w41ter commented Sep 7, 2022

notgull commented Sep 7, 2022 •

edited

Loading

jeffutter commented Nov 4, 2023

seanmonstar commented Nov 6, 2023

jeffutter commented Nov 6, 2023 •

edited

Loading

jeffutter commented Nov 18, 2023

Noah-Kennedy commented Jun 30, 2024

howardjohn commented Jul 25, 2024

Library does not scale with multiple cores #531

Library does not scale with multiple cores #531

Comments

bIgBV commented Apr 21, 2021

bIgBV commented Apr 21, 2021

seanmonstar commented Apr 21, 2021

bIgBV commented Apr 30, 2021

w41ter commented Sep 7, 2022

notgull commented Sep 7, 2022 • edited Loading

jeffutter commented Nov 4, 2023

seanmonstar commented Nov 6, 2023

jeffutter commented Nov 6, 2023 • edited Loading

jeffutter commented Nov 18, 2023

Noah-Kennedy commented Jun 30, 2024

howardjohn commented Jul 25, 2024

notgull commented Sep 7, 2022 •

edited

Loading

jeffutter commented Nov 6, 2023 •

edited

Loading