You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This PR does some refactoring primarily on spyre_model_runner. This
changes tries to reduce code deduplication between static batching and
continuous batching. However, the intention of this work will not be
complete until a next PR has as goal remove kv cache manager from the
spyre model runner.
Summary of changes:
- Reduce code deduplication in spyre model runner, some methods are
common in `SpyreMoldeRunner` class, while
`StaticBatchingSpyreModelRunner` and
`ContinuousBatchingSpyreModelRunner` override few of them to do their
specific logic
- Changed `ContinuousBatchingFmsModel` class to get the attention
metadata via forward context, and changed the model runner to pass to
use the `with set_forward_context` to pass the attention metadata. This
is the way vLLM does to support multiple attention backends
[[REF](vllm-project/vllm#10558)]
- Moved the left pads to the CachedRequestState.
- Bugfix: The `execute_model` in CB model runner was inconsistent with
the data of input batch when it outputs the resul in
`CBSpyreModelRunnerOutput`. Changed it with prepare_prompt to use the
data of input batch.
- Misc: few renamed variables, more comments, and TODOs
---------
Signed-off-by: Wallas Santos <wallashss@ibm.com>
0 commit comments