[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion #7209

llsj14 · 2024-08-06T16:04:31Z

Summary

I discovered that the eviction logic with the OrderedDict free_table in Evictor V1 and V2 slows down overall performance (especially TTFT) when using prefix caching mode.
In some scenarios, utilizing prefix caching mode makes the system slower compared to when prefix caching is not used.
The evict function is frequently called when allocating a new block, as no block is evicted until the block space is full in prefix caching mode.
The eviction logic was slow because free_table is declared as an OrderedDict, which is a linked list, and it tries to find a block with content hash (Evictor V1) or block ID (Evictor V2) in this free_table.
Utilizing a priority queue and lazy deletion helps find the block faster.

Result Verification

As shown in the following output, the block ID and content hash had the same value between the as-is and to-be states (which is expected).
With this change, I could make the duration of the evict function much faster.

===============================
evicted_block_id compare:  12010   12010
content_hash_compare:  -7334740008364413937   -7334740008364413937
as-is evict duration:  7.0807114243507385 ms
to-be evict duration:  0.012848526239395142 ms
===============================
evicted_block_id compare:  12038   12038
content_hash_compare:  -7008894356950570757   -7008894356950570757
as-is evict duration:  7.1028973907232285 ms
to-be evict duration:  0.008581206202507019 ms
===============================

Performance

I checked the TTFT performance using llmperf and the Llama3-8B model with an A100 GPU.
I benchmarked with 1536 input token length (512 same prefix + 1024 random input) and 512 output token length.
By applying this commit, I can make the system faster while utilizing prefix caching.
The speed-up metric is calculated based on the performance without prefix caching mode.

as-is

Model	Num Clients	Block Manager	Prefix Caching	TTFT (mean)	Speed Up
Llama3-8B	16	v2	X	841 ms
Llama3-8B	32	v2	X	1441 ms
Llama3-8B	64	v2	X	2619 ms
Llama3-8B	128	v2	X	4729 ms
Llama3-8B	16	v2	O	1962 ms	0.43 (slowed down)
Llama3-8B	32	v2	O	8382 ms	0.17 (slowed down)
Llama3-8B	64	v2	O	12665 ms	0.21 (slowed down)
Llama3-8B	128	v2	O	22439 ms	0.21 (slowed down)

to-be

Model	Num Clients	Block Manager	Prefix Caching	TTFT (mean)	Speed Up
Llama3-8B	16	v2	O	541 ms	1.55
Llama3-8B	32	v2	O	901 ms	1.60
Llama3-8B	64	v2	O	1563 ms	1.68
Llama3-8B	128	v2	O	2947 ms	1.60

github-actions · 2024-08-06T16:04:45Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

youkaichao · 2024-08-06T16:55:17Z

thanks for the contribution!

cc @alexm-neuralmagic @cadedaniel for block manager related optimization.

Yard1 · 2024-08-06T17:21:27Z

vllm/core/evictor_v2.py


    def update(self, block_id: int, last_accessed: float):
        self.free_table[block_id].last_accessed = last_accessed

+    def _cleanup_if_necessary(self):
+        if len(self.priority_queue) > 50 * len(self.free_table):


that 50 constant should be a defined global.

@Yard1, thank you for your comments. I have fixed the issue and rebased my code.

Yard1 · 2024-08-06T17:28:09Z

FYI this PR seems to be optimizing the same path #7193

cadedaniel · 2024-08-06T23:57:07Z

At high level these fixes look great, will need evictor folks to review with more detail (sorry for second ping @robertgshaw2-neuralmagic )

robertgshaw2-redhat · 2024-08-07T00:01:45Z

At high level these fixes look great, will need evictor folks to review with more detail (sorry for second ping @robertgshaw2-neuralmagic )

Thanks, Alex is going to take a look from out side, since he most recently has been in this codepath optimizing BMv2

alexm-redhat

Thanks for revealing this bottleneck and fixing it! It is a good idea to use a heap + dict to quickly access an LRU item. Left some minor comments.

alexm-redhat · 2024-08-07T01:28:31Z

vllm/core/evictor_v2.py


    def add(self, block_id: int, content_hash: int, num_hashed_tokens: int,
            last_accessed: float):
        self.free_table[block_id] = BlockMetaData(content_hash,
                                                  num_hashed_tokens,
                                                  last_accessed)
+        heapq.heappush(
+            self.priority_queue,
+            (last_accessed, -num_hashed_tokens, content_hash, block_id))


Nice trick with the -num_hashed_tokens to provide heap sorting.

alexm-redhat · 2024-08-07T01:29:43Z

vllm/core/evictor_v2.py

+        heapq.heappush(
+            self.priority_queue,
+            (last_accessed, -num_hashed_tokens, content_hash, block_id))
+        self._cleanup_if_necessary()


Why it was necessary to delay the cleanup? Did you find it to be too slow?

The reason I applied lazy deletion and event triggered cleanup is that searching specific block and deleting outdated blocks from the heap is O(log n). Thus, I skip and pop outdated blocks by checking the free_table in eviction operation, and only clean up the priority queue when it consumes too much memory with outdated blocks.

Since cleanup itself is O(n log n), calling the cleanup function every time would make the system too slow.

The ideal scenario is when the cleanup function is not needed, as outdated blocks are naturally popped out during the eviction operation.

@alexm-neuralmagic, thanks to your comment, I fixed the data type mistake and optimized the performance of the cleanup operation.

I used only the free_table and heapify to create a new priority queue, achieving O(n) complexity.

alexm-redhat · 2024-08-07T01:31:07Z

vllm/core/evictor_v2.py

@@ -76,7 +79,8 @@ class LRUEvictor(Evictor):
    """

    def __init__(self):
-        self.free_table: OrderedDict[int, BlockMetaData] = OrderedDict()
+        self.free_table: Dict[int, BlockMetaData] = {}


Dict is definitely faster here

alexm-redhat · 2024-08-07T01:32:05Z

vllm/core/evictor_v2.py

-from typing import OrderedDict, Tuple
+from typing import Dict, List, Tuple
+
+CLEANUP_THRESHOLD = 50


I would make this a static class member, since it is used only inside the scope of the class below.

Thank you, I fixed this

alexm-redhat · 2024-08-07T01:39:29Z

btw, I would rename the topic of the PR to "[Performance] ....", since it is not a bugfix

llsj14 · 2024-08-09T00:58:19Z

/ready

llsj14 · 2024-08-26T02:41:58Z

I rebased codes to resolve the conflict

mergify · 2024-11-26T05:50:51Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @llsj14.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

llsj14 · 2024-11-29T04:28:26Z

@alexm-neuralmagic @Yard1
I rebased and tested my code again. I would appreciate your reviews.

llsj14 · 2024-12-11T15:17:44Z

In my local test, the test_eviction_alloc_mixed sometimes passes and sometimes fails.

tests/core/block/test_prefix_caching_block.py ................. [  6%]
............................................................... [ 29%]
............................................................... [ 53%]
............................................................... [ 76%]
............................................................... [100%]
=================== 269 passed, 2 warnings in 6.49s ===================

I believe the assertion in this part is not strictly necessary, because all blocks can be candidates for eviction if they have same last accessed time. The key difference is that the previous code search blocks from the beginning of the free table, while my implementation does not.

@leiwen83 @cadedaniel @comaniac
~~Could you check whether it would be fine to remove the assertion mentioned above and review my PR please?~~
-> I just changed my code to make the test pass. I prioritized the block_id to select the earlier one under the same conditions.

vllm/core/evictor.py

llsj14 · 2024-12-13T12:20:50Z

@comaniac
Could you review this PR, please?
This PR was previously reviewed, and I have been testing its stability by running it locally for several months. It has also successfully passed unit tests and CI checks.

vllm/core/evictor.py

comaniac · 2024-12-13T17:42:53Z

vllm/core/evictor.py

+        while self.priority_queue:
+            # Lazy deletion algorithm is applied.
+            last_accessed, _, block_id, content_hash = heapq.heappop(
+                self.priority_queue)
+            if (block_id in self.free_table and
+                    self.free_table[block_id].last_accessed == last_accessed):
+                self.free_table.pop(block_id)
+                return block_id, content_hash


I'm a bit worry about this lazy deletion algorithm as it is pretty hard to understand for others and easy to introduce bugs in corner cases. Here are some possible questions people may ask by reading this code:

How a block in the heap not in the free table? A related question is why we need to cleanup the heap.

How a block in the heap and the free table could have different last access time?

@comaniac Thank you for the valuable feedback.
I've added comments regarding the lazy deletion process.

I understand your concerns about the lazy deletion algorithm, as it shows O(n log n) time complexity when triggered. However, since outdated entries are also removed through heap pops, I believe cleanup is not an operation that happens frequently.

In fact, I also considered using doubly linked list and dictionary for this optimization. While these structures are generally O(1), I think that if the key value changes(like num_hashed_tokens in this code) from being solely based on the last accessed time (which always increases), adding entries could then take O(n) time (to make doubly linked list sorted). That’s why I opted for a priority queue... Nevertheless, I acknowledge the concerns about lazy deletion holding outdated entries.

Yes I used doubly linked list in v1 prefix caching and it works well, but it would be tedious for v0.

Oh I see. I'll check the v1 implementation later as well.

comaniac

Otherwise LGTM

vllm/core/evictor.py

…eletion Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

…applying a priority queue and lazy deletion (vllm-project#7209)

…applying a priority queue and lazy deletion (vllm-project#7209) Signed-off-by: Bowen Wang <abmfy@icloud.com>

…applying a priority queue and lazy deletion (vllm-project#7209)

Yard1 reviewed Aug 6, 2024

View reviewed changes

llsj14 force-pushed the feat/optimize-evict branch from 8071838 to 95495a7 Compare August 7, 2024 00:05

alexm-redhat reviewed Aug 7, 2024

View reviewed changes

llsj14 changed the title ~~[Bugfix][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion~~ [Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion Aug 7, 2024

github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 9, 2024

llsj14 force-pushed the feat/optimize-evict branch from fd520b2 to 273da1d Compare August 26, 2024 02:41

simon-mo requested review from zhuohan123, youkaichao, comaniac and njhill as code owners November 26, 2024 05:49

mergify bot added the needs-rebase label Nov 26, 2024

llsj14 force-pushed the feat/optimize-evict branch from 273da1d to 5d2bbcc Compare November 29, 2024 03:55

mergify bot removed the needs-rebase label Nov 29, 2024

llsj14 force-pushed the feat/optimize-evict branch from 5d2bbcc to a7ee9c4 Compare November 29, 2024 04:24

llsj14 force-pushed the feat/optimize-evict branch from e5eb212 to 7e6b71c Compare December 11, 2024 14:56

llsj14 commented Dec 13, 2024

View reviewed changes

vllm/core/evictor.py Show resolved Hide resolved

llsj14 force-pushed the feat/optimize-evict branch from e82e821 to 0038286 Compare December 13, 2024 09:13

comaniac reviewed Dec 13, 2024

View reviewed changes

comaniac approved these changes Dec 14, 2024

View reviewed changes

vllm/core/evictor.py Outdated Show resolved Hide resolved

llsj14 and others added 14 commits December 14, 2024 01:59

feat: optimize evictor v2 performance using priority queue and lazy d…

6a28606

…eletion Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

refactor: make format

461c8fd

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

refactor: use global defined variable for cleanup threshold

ad9bf4a

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

refactor: make CLEAN_THRESHOLD as a static class member

a1ef9ec

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

refactor: make format

c505a93

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

fix: optimize priority queue cleanup operation

02e92f7

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

trigger test

76e4665

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

prioritize block_id in priority queue

840612a

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

make format

add810e

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

retrigger test

1c8c2b8

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

add comment

e1d7d7a

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

make format

0d554e4

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

update comments

b923060

Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

make format

46798ad

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>

llsj14 force-pushed the feat/optimize-evict branch from dd3165c to 46798ad Compare December 14, 2024 01:59

comaniac merged commit 8869368 into vllm-project:main Dec 14, 2024
51 checks passed

xiangyuT mentioned this pull request Dec 24, 2024

Refine evictor based on #7209 analytics-zoo/vllm#70

Merged

BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024

[Performance][Core] Optimize the performance of evictor v1 and v2 by …

05be266

…applying a priority queue and lazy deletion (vllm-project#7209)

joennlae pushed a commit to 44ai-labs/vllm that referenced this pull request Jan 19, 2025

[Performance][Core] Optimize the performance of evictor v1 and v2 by …

f3138ac

…applying a priority queue and lazy deletion (vllm-project#7209)

abmfy pushed a commit to abmfy/vllm-flashinfer that referenced this pull request Jan 24, 2025

[Performance][Core] Optimize the performance of evictor v1 and v2 by …

4ecd8d4

…applying a priority queue and lazy deletion (vllm-project#7209) Signed-off-by: Bowen Wang <abmfy@icloud.com>

abmfy pushed a commit to abmfy/vllm-flashinfer that referenced this pull request Jan 24, 2025

[Performance][Core] Optimize the performance of evictor v1 and v2 by …

a79fc04

…applying a priority queue and lazy deletion (vllm-project#7209)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion #7209

[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion #7209

llsj14 commented Aug 6, 2024 •

edited

Loading

github-actions bot commented Aug 6, 2024

youkaichao commented Aug 6, 2024

Yard1 Aug 6, 2024

llsj14 Aug 7, 2024

Yard1 commented Aug 6, 2024

cadedaniel commented Aug 6, 2024

robertgshaw2-redhat commented Aug 7, 2024

alexm-redhat left a comment

alexm-redhat Aug 7, 2024

alexm-redhat Aug 7, 2024

llsj14 Aug 7, 2024 •

edited

Loading

llsj14 Aug 7, 2024 •

edited

Loading

llsj14 Aug 7, 2024

alexm-redhat Aug 7, 2024

alexm-redhat Aug 7, 2024

llsj14 Aug 7, 2024

alexm-redhat commented Aug 7, 2024

llsj14 commented Aug 9, 2024

llsj14 commented Aug 26, 2024

mergify bot commented Nov 26, 2024

llsj14 commented Nov 29, 2024

llsj14 commented Dec 11, 2024 •

edited

Loading

llsj14 commented Dec 13, 2024

comaniac Dec 13, 2024

llsj14 Dec 14, 2024 •

edited

Loading

comaniac Dec 14, 2024

llsj14 Dec 14, 2024

comaniac left a comment

[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion #7209

[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion #7209

Conversation

llsj14 commented Aug 6, 2024 • edited Loading

Summary

Result Verification

Performance

as-is

to-be

github-actions bot commented Aug 6, 2024

youkaichao commented Aug 6, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Yard1 commented Aug 6, 2024

cadedaniel commented Aug 6, 2024

robertgshaw2-redhat commented Aug 7, 2024

alexm-redhat left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

llsj14 Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

llsj14 Aug 7, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexm-redhat commented Aug 7, 2024

llsj14 commented Aug 9, 2024

llsj14 commented Aug 26, 2024

mergify bot commented Nov 26, 2024

llsj14 commented Nov 29, 2024

llsj14 commented Dec 11, 2024 • edited Loading

llsj14 commented Dec 13, 2024

Choose a reason for hiding this comment

llsj14 Dec 14, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

comaniac left a comment

Choose a reason for hiding this comment

llsj14 commented Aug 6, 2024 •

edited

Loading

llsj14 Aug 7, 2024 •

edited

Loading

llsj14 Aug 7, 2024 •

edited

Loading

llsj14 commented Dec 11, 2024 •

edited

Loading

llsj14 Dec 14, 2024 •

edited

Loading