Skip to content

[Performance]: vllm-ascend + mindie-turbo Performance Optimization #815

@shen-shanshan

Description

@shen-shanshan

Doc: https://docs.google.com/document/d/1F4mnGa8XDmj37vCbNS6Zso6sg4TzXr4RTkihTbhjLYw/

Motivation

To achieve ultimate performance on vllm-asend v0.7.3 with mindie-turbo 2.0rc1, we have make efforts to optimize our codes, configs, etc.

Performance test:

  • Qwen 2.5 7B

Separate single item optimizations

  • v0.7.3 base image (vllm-ascend) + Ubuntu: m.daocloud.io/quay.io/ascend/vllm-ascend:v0.7.3
  • Each improvement
  • Performance test
    • Qwen2.5-7B-Instruct
  • Doc

Overall optimizations

  • e2e

1. Compiler Optimization (@MaskerPRC)

  • python
  • pytorch
  • torch-npu

Depends on specific model:

  • tiny enhancement (LTO)
  • about 27% enhancement (PGO for specific model)
  • (today) Step 1: Doc (must be ready)
  • Step 2: Can be reproduce in dockerfile

2. OS Optimization (@celestialli)

  • Mem allocator etc, performance
  • (0514) Step 1: Doc (must be ready)
  • Step 2: Can be reproduce in dockerfile (host / container)

3. torch-npu Optimization (@Potabk)

  • Memory
  • Scheduler

4. CANN Optimization (@Potabk)

  • HCCL ENV
  • mindie-turbo ENV

5. vllm-ascend Optimization (@shen-shanshan)

V1 Ascend Scheduler:

Offline inference test using Qwen2.5-7B-Instruct:

  • V1 without Ascend Scheduler: speed input: 8.05 toks/s, output: 146.39 toks/s
  • V1 with Ascend Scheduler: speed input: 8.86 toks/s, output: 161.15 toks/s, but have accuracy problem need to be fixed.

6. Dockerfile (@MaskerPRC)

...

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCRequest For CommentsdocumentationImprovements or additions to documentation

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions