-
Notifications
You must be signed in to change notification settings - Fork 543
Closed
Labels
RFCRequest For CommentsRequest For CommentsdocumentationImprovements or additions to documentationImprovements or additions to documentation
Description
Doc: https://docs.google.com/document/d/1F4mnGa8XDmj37vCbNS6Zso6sg4TzXr4RTkihTbhjLYw/
Motivation
To achieve ultimate performance on vllm-asend v0.7.3 with mindie-turbo 2.0rc1, we have make efforts to optimize our codes, configs, etc.
Performance test:
- Qwen 2.5 7B
Separate single item optimizations
- v0.7.3 base image (vllm-ascend) + Ubuntu: m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3
- Each improvement
- Performance test
- Qwen2.5-7B-Instruct
- Doc
Overall optimizations
- e2e
1. Compiler Optimization (@MaskerPRC)
- python
- pytorch
- torch-npu
Depends on specific model:
- tiny enhancement (LTO)
- about 27% enhancement (PGO for specific model)
- (today) Step 1: Doc (must be ready)
- Step 2: Can be reproduce in dockerfile
2. OS Optimization (@celestialli)
- Mem allocator etc, performance
- (0514) Step 1: Doc (must be ready)
- Step 2: Can be reproduce in dockerfile (host / container)
3. torch-npu Optimization (@Potabk)
- Memory
- Scheduler
4. CANN Optimization (@Potabk)
- HCCL ENV
- mindie-turbo ENV
5. vllm-ascend Optimization (@shen-shanshan)
V1 Ascend Scheduler:
- Implementation: [V1] Add v0 style schedule into v1 engine. #512
- Usage: [Guide]: Usage on AscendScheduler in vLLM Ascend #788
Offline inference test using Qwen2.5-7B-Instruct:
- V1 without Ascend Scheduler: speed input: 8.05 toks/s, output: 146.39 toks/s
- V1 with Ascend Scheduler: speed input: 8.86 toks/s, output: 161.15 toks/s, but have accuracy problem need to be fixed.
6. Dockerfile (@MaskerPRC)
...
Metadata
Metadata
Assignees
Labels
RFCRequest For CommentsRequest For CommentsdocumentationImprovements or additions to documentationImprovements or additions to documentation