Skip to content

Conversation

ggerganov
Copy link
Member

I think this would make the stream slicing discussed in #14924 (comment) less prominent.

@ggerganov
Copy link
Member Author

ggerganov commented Jul 29, 2025

@JohannesGaessler Would be useful to rerun the 128 slot benchmark with this branch and see if this change makes a positive impact.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jul 29, 2025

It doesn't seem to make a difference:

--parallel Runtime PR [s] Runtime PR + LRU patch [s]
8 349.66 349.38
16 254.05 253.76
32 208.52 208.77
64 184.03 183.99
128 176.16 174.51

@ggerganov ggerganov closed this Jul 30, 2025
@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jul 30, 2025

Sorry, I did the benchmark wrong. The default of vllm bench is to front-load all of the requests (infinite rate) rather than to schedule them with a Poisson distribution. This is what the results look like with a Poisson distribution that has a rate of 5, scheduling the requests over 200 seconds:

--parallel Request rate Runtime PR [s] Runtime PR + LRU patch [s]
8 inf 349.66 349.38
16 inf 254.05 253.76
32 inf 208.52 208.77
64 inf 184.03 183.99
128 inf 176.16 174.51
8 5.0 351.84 352.18
16 5.0 249.04 248.38
32 5.0 207.06 206.49
64 5.0 214.43 208.21
128 5.0 224.43 210.03

With the scheduling logic on master there is a regression as the number of slots increases, with the patch in this PR this is greatly reduced.

@ggerganov
Copy link
Member Author

This is inline with my expectation. Thanks.

So the results that we were discussing in the other thread were all performed with front-loading all of the requests? In that case the stream slicing issue is not relevant at all because all of the slots are running through the entire benchmark.

Curious how the vllm bench looks with request rate of 5?

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jul 30, 2025

So the results that we were discussing in the other thread were all performed with front-loading all of the requests?

Yes.

Curious how the vllm bench looks with request rate of 5?

You mean a benchmark of vllm using the vllm tool? I didn't test it so far but think it's not going to be very interesting. As long as the throughput is high enough that a server is basically idele when the last request arrives at ~200 s all servers are going to finish very close to each other unless they're stalling for some reason.

@ggerganov
Copy link
Member Author

Ah yes - makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants