Releases: noooop/light-vllm
Support prefill only models
请移步 [RFC]: Support encode only models by Workflow Defined Engine
祝好
light-vllm v0.2.2
Warning
Not rigorously tested.
For research and experimentation only.
Use vllm for production environment
Support encode only models
支持 xlm-roberta、bge-m3
Flash Attention 比较快
double buffer 加速并不明显
有意思
light-vllm v0.2.1
Warning
Not rigorously tested.
For research and experimentation only.
Use vllm for production environment
light-vllm v0.2 Modularization + Workflow
将工程拆分成可以即插即用的模型,并提过Workflow配置
抽象 Workflow::
Input(request_id, prompt, params, arrival_time) -> InputProcessor -> Request
scheduler.add_request(request: Request)
engine.step
Request -> RequestProcessor -> SequenceGroup (lazy RequestProcessor)
seq_group_metadata_list, scheduler_outputs = scheduler.schedule()
List[SequenceGroupMetadata], SchedulerOutputs -> ModelPreProcessor -> ExecuteInput
ExecuteInput -> Executor -> List[ExecuteOutput]
List[ExecuteOutput] -> OutputProcessor -> RequestOutput
RequestOutput -> return to downstream
定义chat模型的ChatWorkflow
class ChatWorkflow(Workflow):
InputProcessor: str = "light_vllm.task.chat.processor.input_processor:ChatModelInputProcessor"
RequestProcessor: str = "light_vllm.task.chat.processor.input_processor:ChatModelRequestProcessor"
OutputProcessor: str = "light_vllm.task.chat.processor.output_processor:ChatModelOutputProcessor"
ModelPreProcessor: str = "light_vllm.task.chat.processor.model_pre_processor:ChatModelPreProcessor"
Worker: str = "light_vllm.task.chat.worker.gpu_worker:Worker"
Executor: str = "light_vllm.task.base.executor.gpu_executor:GPUExecutor"
Scheduler: str = "light_vllm.core.scheduler:Scheduler"
Tokenizer: str = "light_vllm.inputs.tokenizer:Tokenizer"
既然拆分成模块,就可以对其分别做性能测试
Qwen/Qwen2-7B-Instruct GPUExecutor占总时间的90+%
Qwen/Qwen2-1.5B-Instruct GPUExecutor占总时间的70~80%
有意思
light-vllm v0.2
Warning
Not rigorously tested.
For research and experimentation only.
Use vllm for production environment
light-vllm v0.1 Baseline
这个项目是个人实验项目
- 首先 fork from vllm v0.5.4
- 删除以下模块
- distributed ray
- adapter_commons prompt_adapter lora
- multimodal
- spec_decode guided_decoding
- async
- usage metrics tracing observability
性能基线
作为尝试自己一些想法基础看上去不错。
祝好
light-vllm v0.1
Warning
Not rigorously tested.
For research and experimentation only.
Use vllm for production environment