[WIP] Spec infer with EAGLE2 #1498

yukavio · 2024-09-24T07:46:54Z

Motivation

Accelerate the model inference by speculative inference (EAGLE2).

Modifications

It will be provided soon.

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

Qiubo1 · 2024-09-26T05:45:16Z

hello, whether this code supports the multiple request sepc?

fengyang95 · 2024-10-09T13:33:50Z

Hi @yukavio Is there any recent progress or plan for this? Do you plan to support deepseek-v2?

yukavio · 2024-10-11T12:49:41Z

hello, whether this code supports the multiple request sepc?

Yes, I will support it.

yukavio · 2024-10-11T12:58:29Z

Hi @yukavio Is there any recent progress or plan for this? Do you plan to support deepseek-v2?

I have implemented the draft and verify stages and tested them on a single request. I am trying to migrate my code to the main branch due to the main branch has some significant changes about the controller and worker which are very important for my implementation.
I do not plan to support deepseek-v2 due to there is no open-source draft model of deepseek-v2 with eagle2 for testing.
I plan to implement this feature based on llama currently.

My plan:
Migrate code and test it: 1-2 days.
Implement remained code of single request speculative decoding: half or one week.
Implement remained code of speculative decoding with batch: one or two week.

Qiubo1 · 2024-10-16T03:47:42Z

THX, yukavio.I have some suggestions for this pr: 1. further more support more models, e.g. i think we should pop the eagle head from draft_extend_input_queue so we dont modify the origin llama model file. 2.i dont understand why we need so many SpecInfoPipline queue, spec only in decoding stage,if we dont need the draft_extend_input_queue at least.

Qiubo1 · 2024-10-16T08:40:21Z

Also i have another question, in the pr model_runner.py init kv cache twice in different tpworker, this results in the oom in gpu, if we merge the draft and tagret kv cache to increase the gpu utilization?

yukavio · 2024-10-16T10:57:32Z

Also i have another question, in the pr model_runner.py init kv cache twice in different tpworker, this results in the oom in gpu, if we merge the draft and tagret kv cache to increase the gpu utilization?

I have migrated the code to another branch :https://github.com/yukavio/sglang/tree/new_spec_infer and I will update the code to this PR lately. In the new implementation, I choose to run the draft worker and target model worker in one process instead of using many queues in SpecInfoPipline to communicate with draft work process and target process.

For memory management, I've fixed this bug in the new branch to ensure it won't raise an error during testing. But it may not be very efficient and I will improve it after I have finish the remained work in the plan.

zhyncs · 2024-10-21T04:09:32Z

@yukavio Hi yukavio Recently, SGLang has undergone some refactoring work. You need to merge the latest main to resolve the corresponding conflicts. Thanks!

fengyang95 · 2024-10-21T07:02:36Z

@yukavio Hi, when is this PR expected to be merged? I've trained a draft model and am eager to try it out.

yukavio · 2024-10-21T11:22:46Z

@yukavio Hi yukavio Recently, SGLang has undergone some refactoring work. You need to merge the latest main to resolve the corresponding conflicts. Thanks!

OK, I am fixing some bugs in batch inference now. I will update the code to main branch after fixing them. Personally, I think the updated code can be used as the first version. The community could review this version of the implementation.

yukavio · 2024-10-21T11:24:23Z

@yukavio Hi, when is this PR expected to be merged? I've trained a draft model and am eager to try it out.

If all goes well I will finish the first version of development this week. When to merge into the main branch depends on community review and opinions.

fengyang95 · 2024-10-23T15:07:20Z

@yukavio Is CLI startup not supported currently? I encountered this error:

File "/opt/tiger/sglang/python/sglang/srt/server_args.py", line 613, in <dictcomp>
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^
AttributeError: 'Namespace' object has no attribute 'draft_runner_cache_size'

yukavio · 2024-10-25T02:48:00Z

python3 -m sglang.launch_server --model $LOCAL_PATH --stream-interval 1 --max-prefill-tokens 16384  --trust-remote-code  --mem-frac $MEM_FRAC --tp $TP_SIZE --dp $DP_SIZE --kv-cache-dtype fp8_e5m2 --port $PORT0  --mem-fraction-static $MEM_FRACTION_STATIC --schedule-conservativeness $SCHEDULE_CONSERVATIVENESS --context-length $MODEL_LEN  --chunked-prefill-size $CHUNKED_PREFILL_SIZE \
        --draft-model-path /opt/tiger/eagle --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE >> server.log 2>&1

Sorry, I haven't tested it as a service before. You can
Download the draft model(https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B) first and edit the config (change architectures from "LlamaForCausalLM" to "LlamaForCausalLMEagle".
After that, you can start the service with this command:
python3 -m sglang.launch_server --model Llama-2-7b-chat-hf --stream-interval 1 --max-prefill-tokens 16384 --trust-remote-code --mem-frac 0.8 --tp 1 --dp 1 --mem-fraction-static 0.8 --draft-model-path EAGLE-llama2-chat-7B --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE

However, I found that there seems to be some problems with the return value in this way. I will fix this error later. If you want to test, please run it first through the form in offline_batch_inference.py.

This PR is not ready now. I'm doing more testing and fixing the bugs I find.

fengyang95 · 2024-10-27T02:50:57Z

python3 -m sglang.launch_server --model $LOCAL_PATH --stream-interval 1 --max-prefill-tokens 16384  --trust-remote-code  --mem-frac $MEM_FRAC --tp $TP_SIZE --dp $DP_SIZE --kv-cache-dtype fp8_e5m2 --port $PORT0  --mem-fraction-static $MEM_FRACTION_STATIC --schedule-conservativeness $SCHEDULE_CONSERVATIVENESS --context-length $MODEL_LEN  --chunked-prefill-size $CHUNKED_PREFILL_SIZE \
        --draft-model-path /opt/tiger/eagle --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE >> server.log 2>&1
Sorry, I haven't tested it as a service before. You can Download the draft model(https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B) first and edit the config (change architectures from "LlamaForCausalLM" to "LlamaForCausalLMEagle". After that, you can start the service with this command: python3 -m sglang.launch_server --model Llama-2-7b-chat-hf --stream-interval 1 --max-prefill-tokens 16384 --trust-remote-code --mem-frac 0.8 --tp 1 --dp 1 --mem-fraction-static 0.8 --draft-model-path EAGLE-llama2-chat-7B --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE

However, I found that there seems to be some problems with the return value in this way. I will fix this error later. If you want to test, please run it first through the form in offline_batch_inference.py.

This PR is not ready now. I'm doing more testing and fixing the bugs I find.

@yukavio When using the offline_batch_inference.py, I encountered the following error:

[23:23:54 TP7] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1085, in run_scheduler_process
    scheduler.event_loop()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 272, in event_loop
    self.run_step()
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 459, in run_step
    result = self.run_batch(new_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 669, in run_batch
    logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 67, in forward_batch_speculative_generate
    self.forward_draft_extend(batch)
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 42, in forward_draft_extend
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 536, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 520, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
    torch.cat(
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 44 but got size 1 for tensor number 1 in the list.

yukavio · 2024-10-28T06:56:30Z

python3 -m sglang.launch_server --model $LOCAL_PATH --stream-interval 1 --max-prefill-tokens 16384  --trust-remote-code  --mem-frac $MEM_FRAC --tp $TP_SIZE --dp $DP_SIZE --kv-cache-dtype fp8_e5m2 --port $PORT0  --mem-fraction-static $MEM_FRACTION_STATIC --schedule-conservativeness $SCHEDULE_CONSERVATIVENESS --context-length $MODEL_LEN  --chunked-prefill-size $CHUNKED_PREFILL_SIZE \
        --draft-model-path /opt/tiger/eagle --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE >> server.log 2>&1
Sorry, I haven't tested it as a service before. You can Download the draft model(https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B) first and edit the config (change architectures from "LlamaForCausalLM" to "LlamaForCausalLMEagle". After that, you can start the service with this command: python3 -m sglang.launch_server --model Llama-2-7b-chat-hf --stream-interval 1 --max-prefill-tokens 16384 --trust-remote-code --mem-frac 0.8 --tp 1 --dp 1 --mem-fraction-static 0.8 --draft-model-path EAGLE-llama2-chat-7B --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE
However, I found that there seems to be some problems with the return value in this way. I will fix this error later. If you want to test, please run it first through the form in offline_batch_inference.py.
This PR is not ready now. I'm doing more testing and fixing the bugs I find.

@yukavio When using the offline_batch_inference.py, I encountered the following error:

[23:23:54 TP7] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1085, in run_scheduler_process
    scheduler.event_loop()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 272, in event_loop
    self.run_step()
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 459, in run_step
    result = self.run_batch(new_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 669, in run_batch
    logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 67, in forward_batch_speculative_generate
    self.forward_draft_extend(batch)
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 42, in forward_draft_extend
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 536, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 520, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
    torch.cat(
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 44 but got size 1 for tensor number 1 in the list.

Sorry, I'm fixing some bugs. This may cause the code to be temporarily unavailable, I will let you know here once I have fixed it and tested it.

coolhok · 2024-10-28T14:28:34Z

Polite, may I ask if the build_tree function does not have a Triton version implementation！

yukavio · 2024-10-28T16:07:41Z

Polite, may I ask if the build_tree function does not have a Triton version implementation！

This kernel is difficult to implement with triton because triton don't support control the behavior of cuda threads.

yukavio · 2024-10-29T06:43:12Z

@fengyang95 You can pull the new code and run offline_batch_inference.py again.
It is my code to create Engine with EAGLE:

sgl.Engine(model_path="Llama-2-7b-chat-hf", draft_model_path='EAGLE-llama2-chat-7B', disable_cuda_graph=False, 
                     num_speculative_steps=2, eagle_topk=2, num_draft_tokens=2,  speculative_algorithm='EAGLE', mem_fraction_static=0.75, 
                     dtype='bfloat16')

You can change the argument of num_speculative_steps / eagle_topk / num_draft_tokens to get better performance.
The eagle_topk and num_draft_tokens should be power of 2 and eagle_topk should be less than 10, num_draft_tokens should be less or equal to 64.

fengyang95 · 2024-10-29T08:20:36Z

@fengyang95 You can pull the new code and run offline_batch_inference.py again. It is my code to create Engine with EAGLE:
sgl.Engine(model_path="Llama-2-7b-chat-hf", draft_model_path='EAGLE-llama2-chat-7B', disable_cuda_graph=False, 
                     num_speculative_steps=2, eagle_topk=2, num_draft_tokens=2,  speculative_algorithm='EAGLE', mem_fraction_static=0.75, 
                     dtype='bfloat16')
You can change the argument of num_speculative_steps / eagle_topk / num_draft_tokens to get better performance. The eagle_topk and num_draft_tokens should be power of 2 and eagle_topk should be less than 10, num_draft_tokens should be less or equal to 64.

@yukavio thanks, I will try it asap. Additionally, is fp8 supported? From the code, it seems that FP8 is supported, although I'm uncertain about how to properly quantify the draft model.

yukavio · 2024-10-29T08:59:43Z

@fengyang95 You can pull the new code and run offline_batch_inference.py again. It is my code to create Engine with EAGLE:
sgl.Engine(model_path="Llama-2-7b-chat-hf", draft_model_path='EAGLE-llama2-chat-7B', disable_cuda_graph=False, 
                     num_speculative_steps=2, eagle_topk=2, num_draft_tokens=2,  speculative_algorithm='EAGLE', mem_fraction_static=0.75, 
                     dtype='bfloat16')
You can change the argument of num_speculative_steps / eagle_topk / num_draft_tokens to get better performance. The eagle_topk and num_draft_tokens should be power of 2 and eagle_topk should be less than 10, num_draft_tokens should be less or equal to 64.
@yukavio thanks, I will try it asap. Additionally, is fp8 supported? From the code, it seems that FP8 is supported, although I'm uncertain about how to properly quantify the draft model.

I haven't considered quantification yet in my implementation and testing. But I think this shouldn't be a difficult problem to solve. I'm currently busy merging the master branch and resolving conflicts. I may have to look at this issue later. Or you can implement it by yourself if you want😄.
ps. The time-consuming ratio of draft model inference is very low. The benefits of quantifying this may be modest.

fengyang95 · 2024-10-29T09:04:40Z

@fengyang95 You can pull the new code and run offline_batch_inference.py again. It is my code to create Engine with EAGLE:
sgl.Engine(model_path="Llama-2-7b-chat-hf", draft_model_path='EAGLE-llama2-chat-7B', disable_cuda_graph=False, 
                     num_speculative_steps=2, eagle_topk=2, num_draft_tokens=2,  speculative_algorithm='EAGLE', mem_fraction_static=0.75, 
                     dtype='bfloat16')
You can change the argument of num_speculative_steps / eagle_topk / num_draft_tokens to get better performance. The eagle_topk and num_draft_tokens should be power of 2 and eagle_topk should be less than 10, num_draft_tokens should be less or equal to 64.
@yukavio thanks, I will try it asap. Additionally, is fp8 supported? From the code, it seems that FP8 is supported, although I'm uncertain about how to properly quantify the draft model.
I haven't considered quantification yet in my implementation and testing. But I think this shouldn't be a difficult problem to solve. I'm currently busy merging the master branch and resolving conflicts. I may have to look at this issue later. Or you can implement it by yourself if you want😄. ps. The time-consuming ratio of draft model inference is very low. The benefits of quantifying this may be modest.

My target model is quantized, is there a way to keep the draft model in bf16 while the target model remains quantized?

yukavio · 2024-10-29T09:14:47Z

@fengyang95 You can pull the new code and run offline_batch_inference.py again. It is my code to create Engine with EAGLE:
sgl.Engine(model_path="Llama-2-7b-chat-hf", draft_model_path='EAGLE-llama2-chat-7B', disable_cuda_graph=False, 
                     num_speculative_steps=2, eagle_topk=2, num_draft_tokens=2,  speculative_algorithm='EAGLE', mem_fraction_static=0.75, 
                     dtype='bfloat16')
You can change the argument of num_speculative_steps / eagle_topk / num_draft_tokens to get better performance. The eagle_topk and num_draft_tokens should be power of 2 and eagle_topk should be less than 10, num_draft_tokens should be less or equal to 64.
@yukavio thanks, I will try it asap. Additionally, is fp8 supported? From the code, it seems that FP8 is supported, although I'm uncertain about how to properly quantify the draft model.
I haven't considered quantification yet in my implementation and testing. But I think this shouldn't be a difficult problem to solve. I'm currently busy merging the master branch and resolving conflicts. I may have to look at this issue later. Or you can implement it by yourself if you want😄. ps. The time-consuming ratio of draft model inference is very low. The benefits of quantifying this may be modest.
My target model is quantized, is there a way to keep the draft model in bf16 while the target model remains quantized?

Currently, it has not been supported. But I think it is not very difficult
to support it. Personally, I hope to implement these features step by step
after reviewing and merging this PR. This PR is already too large, making
it difficult to maintain.

fengyang95 · 2024-10-29T11:10:14Z

@yukavio I conducted tests on DeepSeek-v2 + eagle (which I trained myself, using the llama architecture for the AR layer) with 8 H20 cards.
My draft model can produce normal results using VLLM, indicating that there should be no issues with the model structure.

Here are the errors when using cuda graph:

 File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 143, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 141, in __init__
    self.init_cuda_graphs()
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 504, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 143, in __init__
    self.model_runner.attn_backend.init_cuda_graph_state(self.max_num_token)
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 72, in init_cuda_graph_state
    self.cuda_graph_attn_logits = torch.empty(
                                  ^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 GiB. GPU 1 has a total capacity of 95.00 GiB of which 16.12 GiB is free. Process 1714898 has 78.87 GiB memory in use. Of the allocated memory 76.60 GiB is allocated by PyTorch, and 42.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

When not using cuda graph, the errors are as follows:

[17:28:42 TP6] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1092, in run_scheduler_process
    scheduler.event_loop()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 274, in event_loop
    self.run_step()
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 461, in run_step
    result = self.run_batch(new_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 674, in run_batch
    logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 69, in forward_batch_speculative_generate
    self.forward_draft_extend(batch)
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 44, in forward_draft_extend
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 518, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
    torch.cat(
TypeError: expected Tensor as element 1 in argument 0, but got NoneType

yukavio · 2024-10-30T02:56:54Z

@yukavio I conducted tests on DeepSeek-v2 + eagle (which I trained myself, using the llama architecture for the AR layer) with 8 H20 cards. My draft model can produce normal results using VLLM, indicating that there should be no issues with the model structure.

Here are the errors when using cuda graph:

 File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 143, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 141, in __init__
    self.init_cuda_graphs()
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 504, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 143, in __init__
    self.model_runner.attn_backend.init_cuda_graph_state(self.max_num_token)
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 72, in init_cuda_graph_state
    self.cuda_graph_attn_logits = torch.empty(
                                  ^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 GiB. GPU 1 has a total capacity of 95.00 GiB of which 16.12 GiB is free. Process 1714898 has 78.87 GiB memory in use. Of the allocated memory 76.60 GiB is allocated by PyTorch, and 42.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

When not using cuda graph, the errors are as follows:

[17:28:42 TP6] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1092, in run_scheduler_process
    scheduler.event_loop()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 274, in event_loop
    self.run_step()
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 461, in run_step
    result = self.run_batch(new_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 674, in run_batch
    logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 69, in forward_batch_speculative_generate
    self.forward_draft_extend(batch)
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 44, in forward_draft_extend
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 518, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
    torch.cat(
TypeError: expected Tensor as element 1 in argument 0, but got NoneType

For the first case, you should try to use flash-infer as attention backend to fix it.

For the last case, It sames your target model is failed to capture the hidden states. You can add code like to your target model (DeepSeek-V2 or other) like: https://github.com/sgl-project/sglang/pull/1498/files#diff-f6f7943965e41f2d4081018071c87bc1e9f806d5d639579688eb5f6c02f250cdR320.
I will refine this implementation later to avoid the editing of model implementation.

coolhok · 2024-10-30T03:21:08Z

I am using the latest code for offline testing。
At the beginning of the cycle，tps = 180 tokens/s，
But after a few cycles, there will be a serious decrease in speed，tps 180 tokens/s -> 57 tokens/s -> 38 tokens/s

import sglang as sgl
import time
import json

def main():
    # Sample prompts.
    prompts = [
        # "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Where is the capital city of France? ASSISTANT:",
        "[INST] <<SYS>>\\nYou are a helpful assistant.\\n<</SYS>>\\n如何赚取 10000w？ [/INST]"
        # "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: 你是谁呢？ ASSISTANT:"
    ]
    # Create a sampling params object.
    sampling_params = {"temperature": 0, "max_new_tokens": 256,}


    draft_model_path = "/mnt/data/model_hub/EAGLE-llama2-chat-7B"
    model_path = "/mnt/data/model_hub/Llama-2-7b-chat-hf"
    # Create an LLM.
    llm = sgl.Engine(model_path=model_path, draft_model_path=draft_model_path, disable_cuda_graph=False, num_speculative_steps=4, eagle_topk=4, num_draft_tokens=32, speculative_algorithm='EAGLE', mem_fraction_static=0.60)
    # llm = sgl.Engine(model_path=model_path, disable_cuda_graph=False, mem_fraction_static=0.60)
    #outputs = llm.generate(prompts, sampling_params)

    for _ in range(100):
        start = time.time()
        outputs = llm.generate(prompts, sampling_params)
        cos = time.time()-start
        # print(f"!!!! {json.dumps(outputs)}")
        completion_tokens = outputs[0]["meta_info"]["completion_tokens"]
        # Print the outputs.
        for prompt, output in zip(prompts, outputs):
            print(f"!!!!!!!!! tps =: {completion_tokens/cos}")

# The __main__ condition is necessary here because we use "spawn" to create subprocesses
# Spawn starts a fresh program every time, if there is no __main__, it will run into infinite loop to keep spawning processes from sgl.Engine
if __name__ == "__main__":
    main()

yukavio · 2024-10-30T08:55:49Z

I am using the latest code for offline testing。 At the beginning of the cycle，tps = 180 tokens/s， But after a few cycles, there will be a serious decrease in speed，tps 180 tokens/s -> 57 tokens/s -> 38 tokens/s

import sglang as sgl
import time
import json

def main():
    # Sample prompts.
    prompts = [
        # "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Where is the capital city of France? ASSISTANT:",
        "[INST] <<SYS>>\\nYou are a helpful assistant.\\n<</SYS>>\\n如何赚取 10000w？ [/INST]"
        # "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: 你是谁呢？ ASSISTANT:"
    ]
    # Create a sampling params object.
    sampling_params = {"temperature": 0, "max_new_tokens": 256,}


    draft_model_path = "/mnt/data/model_hub/EAGLE-llama2-chat-7B"
    model_path = "/mnt/data/model_hub/Llama-2-7b-chat-hf"
    # Create an LLM.
    llm = sgl.Engine(model_path=model_path, draft_model_path=draft_model_path, disable_cuda_graph=False, num_speculative_steps=4, eagle_topk=4, num_draft_tokens=32, speculative_algorithm='EAGLE', mem_fraction_static=0.60)
    # llm = sgl.Engine(model_path=model_path, disable_cuda_graph=False, mem_fraction_static=0.60)
    #outputs = llm.generate(prompts, sampling_params)

    for _ in range(100):
        start = time.time()
        outputs = llm.generate(prompts, sampling_params)
        cos = time.time()-start
        # print(f"!!!! {json.dumps(outputs)}")
        completion_tokens = outputs[0]["meta_info"]["completion_tokens"]
        # Print the outputs.
        for prompt, output in zip(prompts, outputs):
            print(f"!!!!!!!!! tps =: {completion_tokens/cos}")

# The __main__ condition is necessary here because we use "spawn" to create subprocesses
# Spawn starts a fresh program every time, if there is no __main__, it will run into infinite loop to keep spawning processes from sgl.Engine
if __name__ == "__main__":
    main()

Thanks for the report. I will check and fix it later.

fengyang95 · 2024-10-30T11:56:32Z

@yukavio I conducted tests on DeepSeek-v2 + eagle (which I trained myself, using the llama architecture for the AR layer) with 8 H20 cards. My draft model can produce normal results using VLLM, indicating that there should be no issues with the model structure.
Here are the errors when using cuda graph:

 File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 143, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 141, in __init__
    self.init_cuda_graphs()
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 504, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 143, in __init__
    self.model_runner.attn_backend.init_cuda_graph_state(self.max_num_token)
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 72, in init_cuda_graph_state
    self.cuda_graph_attn_logits = torch.empty(
                                  ^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 GiB. GPU 1 has a total capacity of 95.00 GiB of which 16.12 GiB is free. Process 1714898 has 78.87 GiB memory in use. Of the allocated memory 76.60 GiB is allocated by PyTorch, and 42.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

When not using cuda graph, the errors are as follows:

[17:28:42 TP6] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1092, in run_scheduler_process
    scheduler.event_loop()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 274, in event_loop
    self.run_step()
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 461, in run_step
    result = self.run_batch(new_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 674, in run_batch
    logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 69, in forward_batch_speculative_generate
    self.forward_draft_extend(batch)
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 44, in forward_draft_extend
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 518, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
    torch.cat(
TypeError: expected Tensor as element 1 in argument 0, but got NoneType

For the first case, you should try to use flash-infer as attention backend to fix it.

For the last case, It sames your target model is failed to capture the hidden states. You can add code like to your target model (DeepSeek-V2 or other) like: https://github.com/sgl-project/sglang/pull/1498/files#diff-f6f7943965e41f2d4081018071c87bc1e9f806d5d639579688eb5f6c02f250cdR320. I will refine this implementation later to avoid the editing of model implementation.

@yukavio got another issue:

[19:16:21 TP5] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1090, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 143, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 141, in __init__
    self.init_cuda_graphs()
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 504, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 167, in __init__
    self.capture()
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 196, in capture
    ) = self.capture_one_batch_size(bs, num_token, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 255, in capture_one_batch_size
    run_once(self.capture_forward_mode)
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 249, in run_once
    return forward(input_ids, forward_batch.positions, forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 662, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 631, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 578, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 473, in forward
    attn_output = self.attn(q_input, k_input, v_input, forward_batch)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/layers/radix_attention.py", line 60, in forward
    return forward_batch.attn_backend.forward(q, k, v, self, forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/__init__.py", line 39, in forward
    return self.forward_extend(q, k, v, layer, forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 112, in forward_extend
    self.extend_attention_fwd(
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_ops/extend_attention.py", line 308, in extend_attention_fwd
    grid = (batch_size, head_num, triton.cdiv(max_len_extend, BLOCK_M))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/triton/__init__.py", line 60, in cdiv
    return (x + y - 1) // y
            ~~^~~
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

yukavio · 2024-10-31T08:56:42Z

FYI. Due to the large number of modifications to the attention backend part of the main branch. I need some time to fix the conflicts caused by code migration.

merrymercy mentioned this pull request Oct 6, 2024

Development Roadmap (2024 Q4) #1487

Open

31 tasks

kavioyu added 3 commits October 13, 2024 15:49

temp

70135d6

migrated to new upstream, need implement evict memory

65fae7b

prove single req

064cca6

fix bug for long generate due to eagle_verify_retrive kernel

cb01c64

kavioyu added 2 commits October 16, 2024 19:21

fix bug of eagle spec verify

df3de9d

support cuda graph

b7628f2

kavioyu added 2 commits October 22, 2024 19:28

support batch inference

e2634e9

temp

f557a06

yukavio force-pushed the spec_infer branch from 67126cb to f557a06 Compare October 23, 2024 02:29

yukavio requested review from merrymercy, Ying1123, hnyls2002, zhyncs, ispobock and ByronHsu as code owners October 23, 2024 02:29

This comment was marked as duplicate.

Sign in to view

fix memeory leak

9987741

kavioyu added 6 commits October 25, 2024 11:15

fix cuda graph and split prefill

4faaa31

optimize generate attn arg

33d8aef

fix parent list dtype

11d6e86

fix draft worker memory problem

2b3cb22

need to fix decode error when request retract happend

7aa0aff

remove debug info

404c5ab

fix bug

e095ec0

fix some bug and support target model use cuda graph

9f0a0c2

kavioyu added 2 commits November 1, 2024 18:30

fix conflict, should solve scheduler and cuda graph problem

35c5678

fix naive cuda graph

b647a70

merrymercy force-pushed the main branch from 55311eb to 2134f08 Compare November 2, 2024 01:26

kavioyu added 3 commits November 2, 2024 15:48

fix cuda graph

7226987

support split prefill batch

aaf1cae

fix cuda graph padding

dbeaa2c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Spec infer with EAGLE2 #1498

[WIP] Spec infer with EAGLE2 #1498

yukavio commented Sep 24, 2024

Qiubo1 commented Sep 26, 2024

fengyang95 commented Oct 9, 2024 •

edited

Loading

yukavio commented Oct 11, 2024

yukavio commented Oct 11, 2024 •

edited

Loading

Qiubo1 commented Oct 16, 2024

Qiubo1 commented Oct 16, 2024

yukavio commented Oct 16, 2024

zhyncs commented Oct 21, 2024

fengyang95 commented Oct 21, 2024

yukavio commented Oct 21, 2024

yukavio commented Oct 21, 2024

This comment was marked as duplicate.

fengyang95 commented Oct 23, 2024

yukavio commented Oct 25, 2024 •

edited

Loading

fengyang95 commented Oct 27, 2024 •

edited

Loading

yukavio commented Oct 28, 2024

coolhok commented Oct 28, 2024

yukavio commented Oct 28, 2024 •

edited

Loading

yukavio commented Oct 29, 2024 •

edited

Loading

fengyang95 commented Oct 29, 2024 •

edited

Loading

yukavio commented Oct 29, 2024

fengyang95 commented Oct 29, 2024

yukavio commented Oct 29, 2024

fengyang95 commented Oct 29, 2024

yukavio commented Oct 30, 2024 •

edited

Loading

coolhok commented Oct 30, 2024

yukavio commented Oct 30, 2024

fengyang95 commented Oct 30, 2024

yukavio commented Oct 31, 2024

[WIP] Spec infer with EAGLE2 #1498

Are you sure you want to change the base?

[WIP] Spec infer with EAGLE2 #1498

Conversation

yukavio commented Sep 24, 2024

Motivation

Modifications

Checklist

Qiubo1 commented Sep 26, 2024

fengyang95 commented Oct 9, 2024 • edited Loading

yukavio commented Oct 11, 2024

yukavio commented Oct 11, 2024 • edited Loading

Qiubo1 commented Oct 16, 2024

Qiubo1 commented Oct 16, 2024

yukavio commented Oct 16, 2024

zhyncs commented Oct 21, 2024

fengyang95 commented Oct 21, 2024

yukavio commented Oct 21, 2024

yukavio commented Oct 21, 2024

This comment was marked as duplicate.

fengyang95 commented Oct 23, 2024

yukavio commented Oct 25, 2024 • edited Loading

fengyang95 commented Oct 27, 2024 • edited Loading

yukavio commented Oct 28, 2024

coolhok commented Oct 28, 2024

yukavio commented Oct 28, 2024 • edited Loading

yukavio commented Oct 29, 2024 • edited Loading

fengyang95 commented Oct 29, 2024 • edited Loading

yukavio commented Oct 29, 2024

fengyang95 commented Oct 29, 2024

yukavio commented Oct 29, 2024

fengyang95 commented Oct 29, 2024

yukavio commented Oct 30, 2024 • edited Loading

coolhok commented Oct 30, 2024

yukavio commented Oct 30, 2024

fengyang95 commented Oct 30, 2024

yukavio commented Oct 31, 2024

fengyang95 commented Oct 9, 2024 •

edited

Loading

yukavio commented Oct 11, 2024 •

edited

Loading

yukavio commented Oct 25, 2024 •

edited

Loading

fengyang95 commented Oct 27, 2024 •

edited

Loading

yukavio commented Oct 28, 2024 •

edited

Loading

yukavio commented Oct 29, 2024 •

edited

Loading

fengyang95 commented Oct 29, 2024 •

edited

Loading

yukavio commented Oct 30, 2024 •

edited

Loading