Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion #7209

Merged
merged 14 commits into from
Dec 14, 2024

Conversation

llsj14
Copy link
Contributor

@llsj14 llsj14 commented Aug 6, 2024

FIX #6923

Summary

  • I discovered that the eviction logic with the OrderedDict free_table in Evictor V1 and V2 slows down overall performance (especially TTFT) when using prefix caching mode.
  • In some scenarios, utilizing prefix caching mode makes the system slower compared to when prefix caching is not used.
  • The evict function is frequently called when allocating a new block, as no block is evicted until the block space is full in prefix caching mode.
  • The eviction logic was slow because free_table is declared as an OrderedDict, which is a linked list, and it tries to find a block with content hash (Evictor V1) or block ID (Evictor V2) in this free_table.
  • Utilizing a priority queue and lazy deletion helps find the block faster.

Result Verification

  • As shown in the following output, the block ID and content hash had the same value between the as-is and to-be states (which is expected).
  • With this change, I could make the duration of the evict function much faster.
===============================
evicted_block_id compare:  12010   12010
content_hash_compare:  -7334740008364413937   -7334740008364413937
as-is evict duration:  7.0807114243507385 ms
to-be evict duration:  0.012848526239395142 ms
===============================
evicted_block_id compare:  12038   12038
content_hash_compare:  -7008894356950570757   -7008894356950570757
as-is evict duration:  7.1028973907232285 ms
to-be evict duration:  0.008581206202507019 ms
===============================

Performance

  • I checked the TTFT performance using llmperf and the Llama3-8B model with an A100 GPU.
  • I benchmarked with 1536 input token length (512 same prefix + 1024 random input) and 512 output token length.
  • By applying this commit, I can make the system faster while utilizing prefix caching.
  • The speed-up metric is calculated based on the performance without prefix caching mode.

as-is

Model Num Clients Block Manager Prefix Caching TTFT (mean) Speed Up
Llama3-8B 16 v2 X 841 ms  
Llama3-8B 32 v2 X 1441 ms  
Llama3-8B 64 v2 X 2619 ms  
Llama3-8B 128 v2 X 4729 ms  
Llama3-8B 16 v2 O 1962 ms 0.43 (slowed down)
Llama3-8B 32 v2 O 8382 ms 0.17 (slowed down)
Llama3-8B 64 v2 O 12665 ms 0.21 (slowed down)
Llama3-8B 128 v2 O 22439 ms 0.21 (slowed down)

to-be

Model Num Clients Block Manager Prefix Caching TTFT (mean) Speed Up
Llama3-8B 16 v2 O 541 ms 1.55
Llama3-8B 32 v2 O 901 ms 1.60
Llama3-8B 64 v2 O 1563 ms 1.68
Llama3-8B 128 v2 O 2947 ms 1.60

Copy link

github-actions bot commented Aug 6, 2024

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@youkaichao
Copy link
Member

thanks for the contribution!

cc @alexm-neuralmagic @cadedaniel for block manager related optimization.


def update(self, block_id: int, last_accessed: float):
self.free_table[block_id].last_accessed = last_accessed

def _cleanup_if_necessary(self):
if len(self.priority_queue) > 50 * len(self.free_table):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that 50 constant should be a defined global.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Yard1, thank you for your comments. I have fixed the issue and rebased my code.

@Yard1
Copy link
Collaborator

Yard1 commented Aug 6, 2024

FYI this PR seems to be optimizing the same path #7193

@cadedaniel
Copy link
Collaborator

At high level these fixes look great, will need evictor folks to review with more detail (sorry for second ping @robertgshaw2-neuralmagic )

@robertgshaw2-redhat
Copy link
Collaborator

At high level these fixes look great, will need evictor folks to review with more detail (sorry for second ping @robertgshaw2-neuralmagic )

Thanks, Alex is going to take a look from out side, since he most recently has been in this codepath optimizing BMv2

@llsj14 llsj14 force-pushed the feat/optimize-evict branch from 8071838 to 95495a7 Compare August 7, 2024 00:05
Copy link
Collaborator

@alexm-redhat alexm-redhat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for revealing this bottleneck and fixing it! It is a good idea to use a heap + dict to quickly access an LRU item. Left some minor comments.


def add(self, block_id: int, content_hash: int, num_hashed_tokens: int,
last_accessed: float):
self.free_table[block_id] = BlockMetaData(content_hash,
num_hashed_tokens,
last_accessed)
heapq.heappush(
self.priority_queue,
(last_accessed, -num_hashed_tokens, content_hash, block_id))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice trick with the -num_hashed_tokens to provide heap sorting.

heapq.heappush(
self.priority_queue,
(last_accessed, -num_hashed_tokens, content_hash, block_id))
self._cleanup_if_necessary()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why it was necessary to delay the cleanup? Did you find it to be too slow?

Copy link
Contributor Author

@llsj14 llsj14 Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I applied lazy deletion and event triggered cleanup is that searching specific block and deleting outdated blocks from the heap is O(log n). Thus, I skip and pop outdated blocks by checking the free_table in eviction operation, and only clean up the priority queue when it consumes too much memory with outdated blocks.

Since cleanup itself is O(n log n), calling the cleanup function every time would make the system too slow.

Copy link
Contributor Author

@llsj14 llsj14 Aug 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ideal scenario is when the cleanup function is not needed, as outdated blocks are naturally popped out during the eviction operation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alexm-neuralmagic, thanks to your comment, I fixed the data type mistake and optimized the performance of the cleanup operation.

I used only the free_table and heapify to create a new priority queue, achieving O(n) complexity.

@@ -76,7 +79,8 @@ class LRUEvictor(Evictor):
"""

def __init__(self):
self.free_table: OrderedDict[int, BlockMetaData] = OrderedDict()
self.free_table: Dict[int, BlockMetaData] = {}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dict is definitely faster here

from typing import OrderedDict, Tuple
from typing import Dict, List, Tuple

CLEANUP_THRESHOLD = 50
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would make this a static class member, since it is used only inside the scope of the class below.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, I fixed this

@alexm-redhat
Copy link
Collaborator

btw, I would rename the topic of the PR to "[Performance] ....", since it is not a bugfix

@llsj14 llsj14 changed the title [Bugfix][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion [Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion Aug 7, 2024
@llsj14
Copy link
Contributor Author

llsj14 commented Aug 9, 2024

/ready

@github-actions github-actions bot added the ready ONLY add when PR is ready to merge/full CI is needed label Aug 9, 2024
@llsj14 llsj14 force-pushed the feat/optimize-evict branch from fd520b2 to 273da1d Compare August 26, 2024 02:41
@llsj14
Copy link
Contributor Author

llsj14 commented Aug 26, 2024

I rebased codes to resolve the conflict

Copy link

mergify bot commented Nov 26, 2024

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @llsj14.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Nov 26, 2024
@llsj14 llsj14 force-pushed the feat/optimize-evict branch from 273da1d to 5d2bbcc Compare November 29, 2024 03:55
@mergify mergify bot removed the needs-rebase label Nov 29, 2024
@llsj14 llsj14 force-pushed the feat/optimize-evict branch from 5d2bbcc to a7ee9c4 Compare November 29, 2024 04:24
@llsj14
Copy link
Contributor Author

llsj14 commented Nov 29, 2024

@alexm-neuralmagic @Yard1
I rebased and tested my code again. I would appreciate your reviews.

@llsj14 llsj14 force-pushed the feat/optimize-evict branch from e5eb212 to 7e6b71c Compare December 11, 2024 14:56
@llsj14
Copy link
Contributor Author

llsj14 commented Dec 11, 2024

In my local test, the test_eviction_alloc_mixed sometimes passes and sometimes fails.

tests/core/block/test_prefix_caching_block.py ................. [  6%]
............................................................... [ 29%]
............................................................... [ 53%]
............................................................... [ 76%]
............................................................... [100%]
=================== 269 passed, 2 warnings in 6.49s ===================

I believe the assertion in this part is not strictly necessary, because all blocks can be candidates for eviction if they have same last accessed time. The key difference is that the previous code search blocks from the beginning of the free table, while my implementation does not.

@leiwen83 @cadedaniel @comaniac
Could you check whether it would be fine to remove the assertion mentioned above and review my PR please?
-> I just changed my code to make the test pass. I prioritized the block_id to select the earlier one under the same conditions.

@llsj14 llsj14 force-pushed the feat/optimize-evict branch from e82e821 to 0038286 Compare December 13, 2024 09:13
@llsj14
Copy link
Contributor Author

llsj14 commented Dec 13, 2024

@comaniac
Could you review this PR, please?
This PR was previously reviewed, and I have been testing its stability by running it locally for several months. It has also successfully passed unit tests and CI checks.

Comment on lines 92 to 106
while self.priority_queue:
# Lazy deletion algorithm is applied.
last_accessed, _, block_id, content_hash = heapq.heappop(
self.priority_queue)
if (block_id in self.free_table and
self.free_table[block_id].last_accessed == last_accessed):
self.free_table.pop(block_id)
return block_id, content_hash
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worry about this lazy deletion algorithm as it is pretty hard to understand for others and easy to introduce bugs in corner cases. Here are some possible questions people may ask by reading this code:

  1. How a block in the heap not in the free table? A related question is why we need to cleanup the heap.
  2. How a block in the heap and the free table could have different last access time?

Copy link
Contributor Author

@llsj14 llsj14 Dec 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@comaniac Thank you for the valuable feedback.
I've added comments regarding the lazy deletion process.

I understand your concerns about the lazy deletion algorithm, as it shows O(n log n) time complexity when triggered. However, since outdated entries are also removed through heap pops, I believe cleanup is not an operation that happens frequently.

In fact, I also considered using doubly linked list and dictionary for this optimization. While these structures are generally O(1), I think that if the key value changes(like num_hashed_tokens in this code) from being solely based on the last accessed time (which always increases), adding entries could then take O(n) time (to make doubly linked list sorted). That’s why I opted for a priority queue... Nevertheless, I acknowledge the concerns about lazy deletion holding outdated entries.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I used doubly linked list in v1 prefix caching and it works well, but it would be tedious for v0.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I see. I'll check the v1 implementation later as well.

Copy link
Collaborator

@comaniac comaniac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM

llsj14 and others added 14 commits December 14, 2024 01:59
…eletion

Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
@llsj14 llsj14 force-pushed the feat/optimize-evict branch from dd3165c to 46798ad Compare December 14, 2024 01:59
@comaniac comaniac merged commit 8869368 into vllm-project:main Dec 14, 2024
51 checks passed
BKitor pushed a commit to BKitor/vllm that referenced this pull request Dec 30, 2024
joennlae pushed a commit to 44ai-labs/vllm that referenced this pull request Jan 19, 2025
abmfy pushed a commit to abmfy/vllm-flashinfer that referenced this pull request Jan 24, 2025
…applying a priority queue and lazy deletion (vllm-project#7209)

Signed-off-by: Bowen Wang <abmfy@icloud.com>
abmfy pushed a commit to abmfy/vllm-flashinfer that referenced this pull request Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
7 participants