Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Optimization] use a pool to reuse LogicalTokenBlock.token_ids #5584

Merged
merged 5 commits into from
Jun 17, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
41 changes: 39 additions & 2 deletions vllm/block.py
Original file line number Diff line number Diff line change
@@ -1,12 +1,43 @@
"""Token blocks."""
from typing import List
import weakref
from collections import defaultdict
from typing import Dict, List

from vllm.utils import Device

_BLANK_TOKEN_ID = -1

DEFAULT_LAST_ACCESSED_TIME = -1

TokensBlock = List[int]


class BlockPool:
"""A pool of physical blocks.
youkaichao marked this conversation as resolved.
Show resolved Hide resolved
youkaichao marked this conversation as resolved.
Show resolved Hide resolved
When requests come, we create a lot of logical blocks;
when requests are done, we destroy a lot of logical blocks.
It turns out that creating and destroying logical blocks can be expensive,
especially for the `token_ids` field, which is a list of integers.
To avoid this overhead, we use a pool to manage the logical blocks.
When an old request is done and a new request comes, we can reuse the
logical blocks from the old request to feed the new request.
"""

def __init__(self) -> None:
# block size to list of token blocks
self.pool: Dict[int, List[TokensBlock]] = defaultdict(list)

def alloc_block(self, block_size: int) -> TokensBlock:
if block_size in self.pool and self.pool[block_size]:
return self.pool[block_size].pop()
return [_BLANK_TOKEN_ID] * block_size

def del_block(self, block: TokensBlock) -> None:
self.pool[len(block)].append(block)


_BLOCK_POOL = BlockPool()


class LogicalTokenBlock:
"""A block that stores a contiguous chunk of tokens from left to right.
Expand All @@ -23,7 +54,13 @@ def __init__(
self.block_number = block_number
self.block_size = block_size

self.token_ids = [_BLANK_TOKEN_ID] * block_size
self.token_ids = _BLOCK_POOL.alloc_block(block_size)
# this finalizer is used to return the block to the pool when the object is deleted # noqa
# NOTE: don't use __del__ because it cannot guarantee the order of finalization, # noqa
# i.e. `self.token_ids` may be deleted before `self`, and we lose
# the opportunity to return the block to the pool
self._finalizer = weakref.finalize(self, _BLOCK_POOL.del_block,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although not as "automatic", it may be more efficient to call a free method explicitly when a LogicalTokenBlock is finished with. Since like you say here there is no guarantee about when finalization happens. In my experience it's better to avoid relying on finalization wherever possible.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be more efficient to call a free method explicitly when a LogicalTokenBlock is finished with

adding a LogicalTokenBlock.release() method is easy, but it would be difficult to find all the places to call it. Essentially we need to mimic and track the gc information. Given the complexity, I prefer the finalization method, this will guarantee that self is destructed before self.token_ids

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't checked but maybe there aren't that many places? In any case it doesn't matter if some place is missed. Either it can be dropped in that case or keep the existing logic to let the finalizer get it...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have the bandwidth to check it yet. Feel free to add it later if you figure it out. My intuition is this would need control over the gc system (which is difficult in Python).

self.token_ids)
self.num_tokens = 0

def is_empty(self) -> bool:
Expand Down
Loading