server : test alternative LRU logic #14948

ggerganov · 2025-07-29T18:21:29Z

I think this would make the stream slicing discussed in #14924 (comment) less prominent.

ggerganov · 2025-07-29T18:21:59Z

@JohannesGaessler Would be useful to rerun the 128 slot benchmark with this branch and see if this change makes a positive impact.

JohannesGaessler · 2025-07-29T20:09:59Z

It doesn't seem to make a difference:

--parallel	Runtime PR [s]	Runtime PR + LRU patch [s]
8	349.66	349.38
16	254.05	253.76
32	208.52	208.77
64	184.03	183.99
128	176.16	174.51

JohannesGaessler · 2025-07-30T09:44:03Z

Sorry, I did the benchmark wrong. The default of vllm bench is to front-load all of the requests (infinite rate) rather than to schedule them with a Poisson distribution. This is what the results look like with a Poisson distribution that has a rate of 5, scheduling the requests over 200 seconds:

--parallel	Request rate	Runtime PR [s]	Runtime PR + LRU patch [s]
8	inf	349.66	349.38
16	inf	254.05	253.76
32	inf	208.52	208.77
64	inf	184.03	183.99
128	inf	176.16	174.51
8	5.0	351.84	352.18
16	5.0	249.04	248.38
32	5.0	207.06	206.49
64	5.0	214.43	208.21
128	5.0	224.43	210.03

With the scheduling logic on master there is a regression as the number of slots increases, with the patch in this PR this is greatly reduced.

ggerganov · 2025-07-30T10:19:33Z

This is inline with my expectation. Thanks.

So the results that we were discussing in the other thread were all performed with front-loading all of the requests? In that case the stream slicing issue is not relevant at all because all of the slots are running through the entire benchmark.

Curious how the vllm bench looks with request rate of 5?

JohannesGaessler · 2025-07-30T10:34:46Z

So the results that we were discussing in the other thread were all performed with front-loading all of the requests?

Yes.

Curious how the vllm bench looks with request rate of 5?

You mean a benchmark of vllm using the vllm tool? I didn't test it so far but think it's not going to be very interesting. As long as the throughput is high enough that a server is basically idele when the last request arrives at ~200 s all servers are going to finish very close to each other unless they're stalling for some reason.

ggerganov · 2025-07-30T10:40:40Z

Ah yes - makes sense.

server : test alternative LRU logic

b98f80a

github-actions bot added examples server labels Jul 29, 2025

ggerganov closed this Jul 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server : test alternative LRU logic #14948

server : test alternative LRU logic #14948

Uh oh!

ggerganov commented Jul 29, 2025

Uh oh!

ggerganov commented Jul 29, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Jul 29, 2025 •

edited

Loading

Uh oh!

JohannesGaessler commented Jul 30, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jul 30, 2025

Uh oh!

JohannesGaessler commented Jul 30, 2025 •

edited

Loading

Uh oh!

ggerganov commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

server : test alternative LRU logic #14948

server : test alternative LRU logic #14948

Uh oh!

Conversation

ggerganov commented Jul 29, 2025

Uh oh!

ggerganov commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JohannesGaessler commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 30, 2025

Uh oh!

JohannesGaessler commented Jul 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ggerganov commented Jul 30, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ggerganov commented Jul 29, 2025 •

edited

Loading

JohannesGaessler commented Jul 29, 2025 •

edited

Loading

JohannesGaessler commented Jul 30, 2025 •

edited

Loading

JohannesGaessler commented Jul 30, 2025 •

edited

Loading