-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Spec infer with EAGLE2 #1498
base: main
Are you sure you want to change the base?
Conversation
hello, whether this code supports the multiple request sepc? |
Hi @yukavio Is there any recent progress or plan for this? Do you plan to support deepseek-v2? |
Yes, I will support it. |
I have implemented the draft and verify stages and tested them on a single request. I am trying to migrate my code to the main branch due to the main branch has some significant changes about the controller and worker which are very important for my implementation. My plan: |
THX, yukavio.I have some suggestions for this pr: 1. further more support more models, e.g. i think we should pop the eagle head from draft_extend_input_queue so we dont modify the origin llama model file. 2.i dont understand why we need so many SpecInfoPipline queue, spec only in decoding stage,if we dont need the draft_extend_input_queue at least. |
Also i have another question, in the pr model_runner.py init kv cache twice in different tpworker, this results in the oom in gpu, if we merge the draft and tagret kv cache to increase the gpu utilization? |
I have migrated the code to another branch :https://github.com/yukavio/sglang/tree/new_spec_infer and I will update the code to this PR lately. In the new implementation, I choose to run the draft worker and target model worker in one process instead of using many queues in SpecInfoPipline to communicate with draft work process and target process. For memory management, I've fixed this bug in the new branch to ensure it won't raise an error during testing. But it may not be very efficient and I will improve it after I have finish the remained work in the plan. |
@yukavio Hi yukavio Recently, SGLang has undergone some refactoring work. You need to merge the latest main to resolve the corresponding conflicts. Thanks! |
@yukavio Hi, when is this PR expected to be merged? I've trained a draft model and am eager to try it out. |
OK, I am fixing some bugs in batch inference now. I will update the code to main branch after fixing them. Personally, I think the updated code can be used as the first version. The community could review this version of the implementation. |
If all goes well I will finish the first version of development this week. When to merge into the main branch depends on community review and opinions. |
This comment was marked as duplicate.
This comment was marked as duplicate.
@yukavio Is CLI startup not supported currently? I encountered this error: File "/opt/tiger/sglang/python/sglang/srt/server_args.py", line 613, in <dictcomp>
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^
AttributeError: 'Namespace' object has no attribute 'draft_runner_cache_size' |
Sorry, I haven't tested it as a service before. You can However, I found that there seems to be some problems with the return value in this way. I will fix this error later. If you want to test, please run it first through the form in offline_batch_inference.py. This PR is not ready now. I'm doing more testing and fixing the bugs I find. |
@yukavio When using the offline_batch_inference.py, I encountered the following error: [23:23:54 TP7] Traceback (most recent call last):
File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1085, in run_scheduler_process
scheduler.event_loop()
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 272, in event_loop
self.run_step()
File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 459, in run_step
result = self.run_batch(new_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 669, in run_batch
logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 67, in forward_batch_speculative_generate
self.forward_draft_extend(batch)
File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 42, in forward_draft_extend
logits_output = self.model_runner.forward(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 536, in forward
return self.forward_extend(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 520, in forward_extend
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
torch.cat(
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 44 but got size 1 for tensor number 1 in the list. |
Sorry, I'm fixing some bugs. This may cause the code to be temporarily unavailable, I will let you know here once I have fixed it and tested it. |
Polite, may I ask if the build_tree function does not have a Triton version implementation! |
This kernel is difficult to implement with triton because triton don't support control the behavior of cuda threads. |
@fengyang95 You can pull the new code and run offline_batch_inference.py again.
You can change the argument of num_speculative_steps / eagle_topk / num_draft_tokens to get better performance. |
@yukavio thanks, I will try it asap. Additionally, is fp8 supported? From the code, it seems that FP8 is supported, although I'm uncertain about how to properly quantify the draft model. |
I haven't considered quantification yet in my implementation and testing. But I think this shouldn't be a difficult problem to solve. I'm currently busy merging the master branch and resolving conflicts. I may have to look at this issue later. Or you can implement it by yourself if you want😄. |
My target model is quantized, is there a way to keep the draft model in bf16 while the target model remains quantized? |
Currently, it has not been supported. But I think it is not very difficult |
@yukavio I conducted tests on DeepSeek-v2 + eagle (which I trained myself, using the llama architecture for the AR layer) with 8 H20 cards. Here are the errors when using cuda graph: File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 143, in __init__
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 141, in __init__
self.init_cuda_graphs()
File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 504, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 143, in __init__
self.model_runner.attn_backend.init_cuda_graph_state(self.max_num_token)
File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 72, in init_cuda_graph_state
self.cuda_graph_attn_logits = torch.empty(
^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 GiB. GPU 1 has a total capacity of 95.00 GiB of which 16.12 GiB is free. Process 1714898 has 78.87 GiB memory in use. Of the allocated memory 76.60 GiB is allocated by PyTorch, and 42.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation. See documentation for Memory Management (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables) When not using cuda graph, the errors are as follows: [17:28:42 TP6] Traceback (most recent call last):
File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1092, in run_scheduler_process
scheduler.event_loop()
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 274, in event_loop
self.run_step()
File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 461, in run_step
result = self.run_batch(new_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 674, in run_batch
logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 69, in forward_batch_speculative_generate
self.forward_draft_extend(batch)
File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 44, in forward_draft_extend
logits_output = self.model_runner.forward(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in forward
return self.forward_extend(forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 518, in forward_extend
return self.model.forward(
^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
torch.cat(
TypeError: expected Tensor as element 1 in argument 0, but got NoneType |
For the first case, you should try to use flash-infer as attention backend to fix it. For the last case, It sames your target model is failed to capture the hidden states. You can add code like to your target model (DeepSeek-V2 or other) like: https://github.com/sgl-project/sglang/pull/1498/files#diff-f6f7943965e41f2d4081018071c87bc1e9f806d5d639579688eb5f6c02f250cdR320. |
I am using the latest code for offline testing。
|
Thanks for the report. I will check and fix it later. |
@yukavio got another issue: [19:16:21 TP5] Traceback (most recent call last):
File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1090, in run_scheduler_process
scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 143, in __init__
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
self.model_runner = ModelRunner(
^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 141, in __init__
self.init_cuda_graphs()
File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 504, in init_cuda_graphs
self.cuda_graph_runner = CudaGraphRunner(self)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 167, in __init__
self.capture()
File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 196, in capture
) = self.capture_one_batch_size(bs, num_token, forward)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 255, in capture_one_batch_size
run_once(self.capture_forward_mode)
File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 249, in run_once
return forward(input_ids, forward_batch.positions, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 662, in forward
hidden_states = self.model(input_ids, positions, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 631, in forward
hidden_states, residual = layer(
^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 578, in forward
hidden_states = self.self_attn(
^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 473, in forward
attn_output = self.attn(q_input, k_input, v_input, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/layers/radix_attention.py", line 60, in forward
return forward_batch.attn_backend.forward(q, k, v, self, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/layers/attention/__init__.py", line 39, in forward
return self.forward_extend(q, k, v, layer, forward_batch)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 112, in forward_extend
self.extend_attention_fwd(
File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_ops/extend_attention.py", line 308, in extend_attention_fwd
grid = (batch_size, head_num, triton.cdiv(max_len_extend, BLOCK_M))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/triton/__init__.py", line 60, in cdiv
return (x + y - 1) // y
~~^~~
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int' |
FYI. Due to the large number of modifications to the attention backend part of the main branch. I need some time to fix the conflicts caused by code migration. |
Motivation
Accelerate the model inference by speculative inference (EAGLE2).
Modifications
It will be provided soon.
Checklist