-
-
Notifications
You must be signed in to change notification settings - Fork 5.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Performance][Core] Optimize the performance of evictor v1 and v2 by applying a priority queue and lazy deletion #7209
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
thanks for the contribution! cc @alexm-neuralmagic @cadedaniel for block manager related optimization. |
vllm/core/evictor_v2.py
Outdated
|
||
def update(self, block_id: int, last_accessed: float): | ||
self.free_table[block_id].last_accessed = last_accessed | ||
|
||
def _cleanup_if_necessary(self): | ||
if len(self.priority_queue) > 50 * len(self.free_table): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that 50
constant should be a defined global.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Yard1, thank you for your comments. I have fixed the issue and rebased my code.
FYI this PR seems to be optimizing the same path #7193 |
At high level these fixes look great, will need evictor folks to review with more detail (sorry for second ping @robertgshaw2-neuralmagic ) |
Thanks, Alex is going to take a look from out side, since he most recently has been in this codepath optimizing BMv2 |
8071838
to
95495a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for revealing this bottleneck and fixing it! It is a good idea to use a heap + dict to quickly access an LRU item. Left some minor comments.
vllm/core/evictor_v2.py
Outdated
|
||
def add(self, block_id: int, content_hash: int, num_hashed_tokens: int, | ||
last_accessed: float): | ||
self.free_table[block_id] = BlockMetaData(content_hash, | ||
num_hashed_tokens, | ||
last_accessed) | ||
heapq.heappush( | ||
self.priority_queue, | ||
(last_accessed, -num_hashed_tokens, content_hash, block_id)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice trick with the -num_hashed_tokens to provide heap sorting.
vllm/core/evictor_v2.py
Outdated
heapq.heappush( | ||
self.priority_queue, | ||
(last_accessed, -num_hashed_tokens, content_hash, block_id)) | ||
self._cleanup_if_necessary() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why it was necessary to delay the cleanup? Did you find it to be too slow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I applied lazy deletion and event triggered cleanup is that searching specific block and deleting outdated blocks from the heap is O(log n)
. Thus, I skip and pop outdated blocks by checking the free_table
in eviction operation, and only clean up the priority queue when it consumes too much memory with outdated blocks.
Since cleanup itself is O(n log n)
, calling the cleanup function every time would make the system too slow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ideal scenario is when the cleanup function is not needed, as outdated blocks are naturally popped out during the eviction operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alexm-neuralmagic, thanks to your comment, I fixed the data type mistake and optimized the performance of the cleanup operation.
I used only the free_table
and heapify
to create a new priority queue, achieving O(n)
complexity.
vllm/core/evictor_v2.py
Outdated
@@ -76,7 +79,8 @@ class LRUEvictor(Evictor): | |||
""" | |||
|
|||
def __init__(self): | |||
self.free_table: OrderedDict[int, BlockMetaData] = OrderedDict() | |||
self.free_table: Dict[int, BlockMetaData] = {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dict is definitely faster here
vllm/core/evictor_v2.py
Outdated
from typing import OrderedDict, Tuple | ||
from typing import Dict, List, Tuple | ||
|
||
CLEANUP_THRESHOLD = 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make this a static class member, since it is used only inside the scope of the class below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, I fixed this
btw, I would rename the topic of the PR to "[Performance] ....", since it is not a bugfix |
/ready |
fd520b2
to
273da1d
Compare
I rebased codes to resolve the conflict |
This pull request has merge conflicts that must be resolved before it can be |
273da1d
to
5d2bbcc
Compare
5d2bbcc
to
a7ee9c4
Compare
@alexm-neuralmagic @Yard1 |
e5eb212
to
7e6b71c
Compare
In my local test, the
I believe the assertion in this part is not strictly necessary, because all blocks can be candidates for eviction if they have same last accessed time. The key difference is that the previous code search blocks from the beginning of the free table, while my implementation does not. @leiwen83 @cadedaniel @comaniac |
e82e821
to
0038286
Compare
@comaniac |
vllm/core/evictor.py
Outdated
while self.priority_queue: | ||
# Lazy deletion algorithm is applied. | ||
last_accessed, _, block_id, content_hash = heapq.heappop( | ||
self.priority_queue) | ||
if (block_id in self.free_table and | ||
self.free_table[block_id].last_accessed == last_accessed): | ||
self.free_table.pop(block_id) | ||
return block_id, content_hash |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worry about this lazy deletion algorithm as it is pretty hard to understand for others and easy to introduce bugs in corner cases. Here are some possible questions people may ask by reading this code:
- How a block in the heap not in the free table? A related question is why we need to cleanup the heap.
- How a block in the heap and the free table could have different last access time?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@comaniac Thank you for the valuable feedback.
I've added comments regarding the lazy deletion process.
I understand your concerns about the lazy deletion algorithm, as it shows O(n log n) time complexity when triggered. However, since outdated entries are also removed through heap pops, I believe cleanup is not an operation that happens frequently.
In fact, I also considered using doubly linked list and dictionary for this optimization. While these structures are generally O(1), I think that if the key value changes(like num_hashed_tokens in this code) from being solely based on the last accessed time (which always increases), adding entries could then take O(n) time (to make doubly linked list sorted). That’s why I opted for a priority queue... Nevertheless, I acknowledge the concerns about lazy deletion holding outdated entries.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I used doubly linked list in v1 prefix caching and it works well, but it would be tedious for v0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I see. I'll check the v1 implementation later as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Otherwise LGTM
…eletion Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Co-authored-by: Cody Yu <hao.yu.cody@gmail.com> Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
Signed-off-by: Sungjae Lee <33976427+llsj14@users.noreply.github.com>
dd3165c
to
46798ad
Compare
…applying a priority queue and lazy deletion (vllm-project#7209)
…applying a priority queue and lazy deletion (vllm-project#7209)
…applying a priority queue and lazy deletion (vllm-project#7209) Signed-off-by: Bowen Wang <abmfy@icloud.com>
…applying a priority queue and lazy deletion (vllm-project#7209)
FIX #6923
Summary
OrderedDict
free_table
in Evictor V1 and V2 slows down overall performance (especiallyTTFT
) when using prefix caching mode.content hash
(Evictor V1) orblock ID
(Evictor V2) in this free_table.Result Verification
Performance
as-is
to-be