Skip to content

Releases: noooop/light-vllm

Support prefill only models

10 Oct 09:19
Compare
Choose a tag to compare

请移步 [RFC]: Support encode only models by Workflow Defined Engine

祝好
light-vllm v0.2.2

Warning

Not rigorously tested.
For research and experimentation only.

Use vllm for production environment

Support encode only models

13 Sep 08:52
Compare
Choose a tag to compare

xlm-roberta-base
xlm-roberta-large
bge-m3

支持 xlm-roberta、bge-m3

Flash Attention 比较快
double buffer 加速并不明显

有意思
light-vllm v0.2.1

Warning

Not rigorously tested.
For research and experimentation only.

Use vllm for production environment

light-vllm v0.2 Modularization + Workflow

23 Aug 10:13
Compare
Choose a tag to compare

将工程拆分成可以即插即用的模型,并提过Workflow配置

抽象 Workflow::

Input(request_id, prompt, params, arrival_time) -> InputProcessor -> Request
scheduler.add_request(request: Request)

engine.step
    Request -> RequestProcessor -> SequenceGroup (lazy RequestProcessor)
    seq_group_metadata_list, scheduler_outputs = scheduler.schedule()

    List[SequenceGroupMetadata], SchedulerOutputs -> ModelPreProcessor -> ExecuteInput

    ExecuteInput -> Executor -> List[ExecuteOutput]

    List[ExecuteOutput] -> OutputProcessor -> RequestOutput
    RequestOutput -> return to downstream

定义chat模型的ChatWorkflow

class ChatWorkflow(Workflow):
    InputProcessor: str = "light_vllm.task.chat.processor.input_processor:ChatModelInputProcessor"
    RequestProcessor: str = "light_vllm.task.chat.processor.input_processor:ChatModelRequestProcessor"
    OutputProcessor: str = "light_vllm.task.chat.processor.output_processor:ChatModelOutputProcessor"
    ModelPreProcessor: str = "light_vllm.task.chat.processor.model_pre_processor:ChatModelPreProcessor"
    Worker: str = "light_vllm.task.chat.worker.gpu_worker:Worker"
    
    Executor: str = "light_vllm.task.base.executor.gpu_executor:GPUExecutor"
    Scheduler: str = "light_vllm.core.scheduler:Scheduler"
    Tokenizer: str = "light_vllm.inputs.tokenizer:Tokenizer"

既然拆分成模块,就可以对其分别做性能测试

Qwen/Qwen2-7B-Instruct GPUExecutor占总时间的90+%

7B

Qwen/Qwen2-1.5B-Instruct GPUExecutor占总时间的70~80%
1 5B

有意思
light-vllm v0.2

Warning

Not rigorously tested.
For research and experimentation only.

Use vllm for production environment

light-vllm v0.1 Baseline

19 Aug 09:46
Compare
Choose a tag to compare

这个项目是个人实验项目

  1. 首先 fork from vllm v0.5.4
  2. 删除以下模块
  • distributed ray
  • adapter_commons prompt_adapter lora
  • multimodal
  • spec_decode guided_decoding
  • async
  • usage metrics tracing observability

性能基线

Baseline

作为尝试自己一些想法基础看上去不错。

祝好
light-vllm v0.1

Warning

Not rigorously tested.
For research and experimentation only.

Use vllm for production environment