Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Spec infer with EAGLE2 #1498

Open
wants to merge 28 commits into
base: main
Choose a base branch
from
Open

Conversation

yukavio
Copy link

@yukavio yukavio commented Sep 24, 2024

Motivation

Accelerate the model inference by speculative inference (EAGLE2).

Modifications

It will be provided soon.

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@Qiubo1
Copy link

Qiubo1 commented Sep 26, 2024

hello, whether this code supports the multiple request sepc?

@merrymercy merrymercy mentioned this pull request Oct 6, 2024
31 tasks
@fengyang95
Copy link

fengyang95 commented Oct 9, 2024

Hi @yukavio Is there any recent progress or plan for this? Do you plan to support deepseek-v2?

@yukavio
Copy link
Author

yukavio commented Oct 11, 2024

hello, whether this code supports the multiple request sepc?

Yes, I will support it.

@yukavio
Copy link
Author

yukavio commented Oct 11, 2024

Hi @yukavio Is there any recent progress or plan for this? Do you plan to support deepseek-v2?

I have implemented the draft and verify stages and tested them on a single request. I am trying to migrate my code to the main branch due to the main branch has some significant changes about the controller and worker which are very important for my implementation.
I do not plan to support deepseek-v2 due to there is no open-source draft model of deepseek-v2 with eagle2 for testing.
I plan to implement this feature based on llama currently.

My plan:
Migrate code and test it: 1-2 days.
Implement remained code of single request speculative decoding: half or one week.
Implement remained code of speculative decoding with batch: one or two week.

@Qiubo1
Copy link

Qiubo1 commented Oct 16, 2024

THX, yukavio.I have some suggestions for this pr: 1. further more support more models, e.g. i think we should pop the eagle head from draft_extend_input_queue so we dont modify the origin llama model file. 2.i dont understand why we need so many SpecInfoPipline queue, spec only in decoding stage,if we dont need the draft_extend_input_queue at least.

@Qiubo1
Copy link

Qiubo1 commented Oct 16, 2024

Also i have another question, in the pr model_runner.py init kv cache twice in different tpworker, this results in the oom in gpu, if we merge the draft and tagret kv cache to increase the gpu utilization?

@yukavio
Copy link
Author

yukavio commented Oct 16, 2024

Also i have another question, in the pr model_runner.py init kv cache twice in different tpworker, this results in the oom in gpu, if we merge the draft and tagret kv cache to increase the gpu utilization?

I have migrated the code to another branch :https://github.com/yukavio/sglang/tree/new_spec_infer and I will update the code to this PR lately. In the new implementation, I choose to run the draft worker and target model worker in one process instead of using many queues in SpecInfoPipline to communicate with draft work process and target process.

For memory management, I've fixed this bug in the new branch to ensure it won't raise an error during testing. But it may not be very efficient and I will improve it after I have finish the remained work in the plan.

@zhyncs
Copy link
Member

zhyncs commented Oct 21, 2024

@yukavio Hi yukavio Recently, SGLang has undergone some refactoring work. You need to merge the latest main to resolve the corresponding conflicts. Thanks!

@fengyang95
Copy link

@yukavio Hi, when is this PR expected to be merged? I've trained a draft model and am eager to try it out.

@yukavio
Copy link
Author

yukavio commented Oct 21, 2024

@yukavio Hi yukavio Recently, SGLang has undergone some refactoring work. You need to merge the latest main to resolve the corresponding conflicts. Thanks!

OK, I am fixing some bugs in batch inference now. I will update the code to main branch after fixing them. Personally, I think the updated code can be used as the first version. The community could review this version of the implementation.

@yukavio
Copy link
Author

yukavio commented Oct 21, 2024

@yukavio Hi, when is this PR expected to be merged? I've trained a draft model and am eager to try it out.

If all goes well I will finish the first version of development this week. When to merge into the main branch depends on community review and opinions.

@yukavio

This comment was marked as duplicate.

@fengyang95
Copy link

@yukavio Is CLI startup not supported currently? I encountered this error:

File "/opt/tiger/sglang/python/sglang/srt/server_args.py", line 613, in <dictcomp>
return cls(**{attr: getattr(args, attr) for attr in attrs})
^^^^^^^^^^^^^^^^^^^
AttributeError: 'Namespace' object has no attribute 'draft_runner_cache_size'

@yukavio
Copy link
Author

yukavio commented Oct 25, 2024

python3 -m sglang.launch_server --model $LOCAL_PATH --stream-interval 1 --max-prefill-tokens 16384  --trust-remote-code  --mem-frac $MEM_FRAC --tp $TP_SIZE --dp $DP_SIZE --kv-cache-dtype fp8_e5m2 --port $PORT0  --mem-fraction-static $MEM_FRACTION_STATIC --schedule-conservativeness $SCHEDULE_CONSERVATIVENESS --context-length $MODEL_LEN  --chunked-prefill-size $CHUNKED_PREFILL_SIZE \
        --draft-model-path /opt/tiger/eagle --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE >> server.log 2>&1

Sorry, I haven't tested it as a service before. You can
Download the draft model(https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B) first and edit the config (change architectures from "LlamaForCausalLM" to "LlamaForCausalLMEagle".
After that, you can start the service with this command:
python3 -m sglang.launch_server --model Llama-2-7b-chat-hf --stream-interval 1 --max-prefill-tokens 16384 --trust-remote-code --mem-frac 0.8 --tp 1 --dp 1 --mem-fraction-static 0.8 --draft-model-path EAGLE-llama2-chat-7B --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE

However, I found that there seems to be some problems with the return value in this way. I will fix this error later. If you want to test, please run it first through the form in offline_batch_inference.py.

This PR is not ready now. I'm doing more testing and fixing the bugs I find.

@fengyang95
Copy link

fengyang95 commented Oct 27, 2024

python3 -m sglang.launch_server --model $LOCAL_PATH --stream-interval 1 --max-prefill-tokens 16384  --trust-remote-code  --mem-frac $MEM_FRAC --tp $TP_SIZE --dp $DP_SIZE --kv-cache-dtype fp8_e5m2 --port $PORT0  --mem-fraction-static $MEM_FRACTION_STATIC --schedule-conservativeness $SCHEDULE_CONSERVATIVENESS --context-length $MODEL_LEN  --chunked-prefill-size $CHUNKED_PREFILL_SIZE \
        --draft-model-path /opt/tiger/eagle --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE >> server.log 2>&1

Sorry, I haven't tested it as a service before. You can Download the draft model(https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B) first and edit the config (change architectures from "LlamaForCausalLM" to "LlamaForCausalLMEagle". After that, you can start the service with this command: python3 -m sglang.launch_server --model Llama-2-7b-chat-hf --stream-interval 1 --max-prefill-tokens 16384 --trust-remote-code --mem-frac 0.8 --tp 1 --dp 1 --mem-fraction-static 0.8 --draft-model-path EAGLE-llama2-chat-7B --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE

However, I found that there seems to be some problems with the return value in this way. I will fix this error later. If you want to test, please run it first through the form in offline_batch_inference.py.

This PR is not ready now. I'm doing more testing and fixing the bugs I find.

@yukavio When using the offline_batch_inference.py, I encountered the following error:

[23:23:54 TP7] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1085, in run_scheduler_process
    scheduler.event_loop()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 272, in event_loop
    self.run_step()
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 459, in run_step
    result = self.run_batch(new_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 669, in run_batch
    logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 67, in forward_batch_speculative_generate
    self.forward_draft_extend(batch)
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 42, in forward_draft_extend
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 536, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 520, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
    torch.cat(
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 44 but got size 1 for tensor number 1 in the list.

@yukavio
Copy link
Author

yukavio commented Oct 28, 2024

python3 -m sglang.launch_server --model $LOCAL_PATH --stream-interval 1 --max-prefill-tokens 16384  --trust-remote-code  --mem-frac $MEM_FRAC --tp $TP_SIZE --dp $DP_SIZE --kv-cache-dtype fp8_e5m2 --port $PORT0  --mem-fraction-static $MEM_FRACTION_STATIC --schedule-conservativeness $SCHEDULE_CONSERVATIVENESS --context-length $MODEL_LEN  --chunked-prefill-size $CHUNKED_PREFILL_SIZE \
        --draft-model-path /opt/tiger/eagle --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE >> server.log 2>&1

Sorry, I haven't tested it as a service before. You can Download the draft model(https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B) first and edit the config (change architectures from "LlamaForCausalLM" to "LlamaForCausalLMEagle". After that, you can start the service with this command: python3 -m sglang.launch_server --model Llama-2-7b-chat-hf --stream-interval 1 --max-prefill-tokens 16384 --trust-remote-code --mem-frac 0.8 --tp 1 --dp 1 --mem-fraction-static 0.8 --draft-model-path EAGLE-llama2-chat-7B --num-speculative-steps 5 --eagle-topk 8 --num-draft-tokens 64 --speculative-algorithm EAGLE
However, I found that there seems to be some problems with the return value in this way. I will fix this error later. If you want to test, please run it first through the form in offline_batch_inference.py.
This PR is not ready now. I'm doing more testing and fixing the bugs I find.

@yukavio When using the offline_batch_inference.py, I encountered the following error:

[23:23:54 TP7] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1085, in run_scheduler_process
    scheduler.event_loop()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 272, in event_loop
    self.run_step()
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 459, in run_step
    result = self.run_batch(new_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 669, in run_batch
    logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 67, in forward_batch_speculative_generate
    self.forward_draft_extend(batch)
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 42, in forward_draft_extend
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 536, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 520, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
    torch.cat(
RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 44 but got size 1 for tensor number 1 in the list.

Sorry, I'm fixing some bugs. This may cause the code to be temporarily unavailable, I will let you know here once I have fixed it and tested it.

@coolhok
Copy link

coolhok commented Oct 28, 2024

Polite, may I ask if the build_tree function does not have a Triton version implementation!

@yukavio
Copy link
Author

yukavio commented Oct 28, 2024

Polite, may I ask if the build_tree function does not have a Triton version implementation!

This kernel is difficult to implement with triton because triton don't support control the behavior of cuda threads.

@yukavio
Copy link
Author

yukavio commented Oct 29, 2024

@fengyang95 You can pull the new code and run offline_batch_inference.py again.
It is my code to create Engine with EAGLE:

sgl.Engine(model_path="Llama-2-7b-chat-hf", draft_model_path='EAGLE-llama2-chat-7B', disable_cuda_graph=False, 
                     num_speculative_steps=2, eagle_topk=2, num_draft_tokens=2,  speculative_algorithm='EAGLE', mem_fraction_static=0.75, 
                     dtype='bfloat16')

You can change the argument of num_speculative_steps / eagle_topk / num_draft_tokens to get better performance.
The eagle_topk and num_draft_tokens should be power of 2 and eagle_topk should be less than 10, num_draft_tokens should be less or equal to 64.

@fengyang95
Copy link

fengyang95 commented Oct 29, 2024

@fengyang95 You can pull the new code and run offline_batch_inference.py again. It is my code to create Engine with EAGLE:

sgl.Engine(model_path="Llama-2-7b-chat-hf", draft_model_path='EAGLE-llama2-chat-7B', disable_cuda_graph=False, 
                     num_speculative_steps=2, eagle_topk=2, num_draft_tokens=2,  speculative_algorithm='EAGLE', mem_fraction_static=0.75, 
                     dtype='bfloat16')

You can change the argument of num_speculative_steps / eagle_topk / num_draft_tokens to get better performance. The eagle_topk and num_draft_tokens should be power of 2 and eagle_topk should be less than 10, num_draft_tokens should be less or equal to 64.

@yukavio thanks, I will try it asap. Additionally, is fp8 supported? From the code, it seems that FP8 is supported, although I'm uncertain about how to properly quantify the draft model.

@yukavio
Copy link
Author

yukavio commented Oct 29, 2024

@fengyang95 You can pull the new code and run offline_batch_inference.py again. It is my code to create Engine with EAGLE:

sgl.Engine(model_path="Llama-2-7b-chat-hf", draft_model_path='EAGLE-llama2-chat-7B', disable_cuda_graph=False, 
                     num_speculative_steps=2, eagle_topk=2, num_draft_tokens=2,  speculative_algorithm='EAGLE', mem_fraction_static=0.75, 
                     dtype='bfloat16')

You can change the argument of num_speculative_steps / eagle_topk / num_draft_tokens to get better performance. The eagle_topk and num_draft_tokens should be power of 2 and eagle_topk should be less than 10, num_draft_tokens should be less or equal to 64.

@yukavio thanks, I will try it asap. Additionally, is fp8 supported? From the code, it seems that FP8 is supported, although I'm uncertain about how to properly quantify the draft model.

I haven't considered quantification yet in my implementation and testing. But I think this shouldn't be a difficult problem to solve. I'm currently busy merging the master branch and resolving conflicts. I may have to look at this issue later. Or you can implement it by yourself if you want😄.
ps. The time-consuming ratio of draft model inference is very low. The benefits of quantifying this may be modest.

@fengyang95
Copy link

@fengyang95 You can pull the new code and run offline_batch_inference.py again. It is my code to create Engine with EAGLE:

sgl.Engine(model_path="Llama-2-7b-chat-hf", draft_model_path='EAGLE-llama2-chat-7B', disable_cuda_graph=False, 
                     num_speculative_steps=2, eagle_topk=2, num_draft_tokens=2,  speculative_algorithm='EAGLE', mem_fraction_static=0.75, 
                     dtype='bfloat16')

You can change the argument of num_speculative_steps / eagle_topk / num_draft_tokens to get better performance. The eagle_topk and num_draft_tokens should be power of 2 and eagle_topk should be less than 10, num_draft_tokens should be less or equal to 64.

@yukavio thanks, I will try it asap. Additionally, is fp8 supported? From the code, it seems that FP8 is supported, although I'm uncertain about how to properly quantify the draft model.

I haven't considered quantification yet in my implementation and testing. But I think this shouldn't be a difficult problem to solve. I'm currently busy merging the master branch and resolving conflicts. I may have to look at this issue later. Or you can implement it by yourself if you want😄. ps. The time-consuming ratio of draft model inference is very low. The benefits of quantifying this may be modest.

My target model is quantized, is there a way to keep the draft model in bf16 while the target model remains quantized?

@yukavio
Copy link
Author

yukavio commented Oct 29, 2024

@fengyang95 You can pull the new code and run offline_batch_inference.py again. It is my code to create Engine with EAGLE:

sgl.Engine(model_path="Llama-2-7b-chat-hf", draft_model_path='EAGLE-llama2-chat-7B', disable_cuda_graph=False, 
                     num_speculative_steps=2, eagle_topk=2, num_draft_tokens=2,  speculative_algorithm='EAGLE', mem_fraction_static=0.75, 
                     dtype='bfloat16')

You can change the argument of num_speculative_steps / eagle_topk / num_draft_tokens to get better performance. The eagle_topk and num_draft_tokens should be power of 2 and eagle_topk should be less than 10, num_draft_tokens should be less or equal to 64.

@yukavio thanks, I will try it asap. Additionally, is fp8 supported? From the code, it seems that FP8 is supported, although I'm uncertain about how to properly quantify the draft model.

I haven't considered quantification yet in my implementation and testing. But I think this shouldn't be a difficult problem to solve. I'm currently busy merging the master branch and resolving conflicts. I may have to look at this issue later. Or you can implement it by yourself if you want😄. ps. The time-consuming ratio of draft model inference is very low. The benefits of quantifying this may be modest.

My target model is quantized, is there a way to keep the draft model in bf16 while the target model remains quantized?

Currently, it has not been supported. But I think it is not very difficult
to support it. Personally, I hope to implement these features step by step
after reviewing and merging this PR. This PR is already too large, making
it difficult to maintain.

@fengyang95
Copy link

@yukavio I conducted tests on DeepSeek-v2 + eagle (which I trained myself, using the llama architecture for the AR layer) with 8 H20 cards.
My draft model can produce normal results using VLLM, indicating that there should be no issues with the model structure.

Here are the errors when using cuda graph:

 File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 143, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 141, in __init__
    self.init_cuda_graphs()
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 504, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 143, in __init__
    self.model_runner.attn_backend.init_cuda_graph_state(self.max_num_token)
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 72, in init_cuda_graph_state
    self.cuda_graph_attn_logits = torch.empty(
                                  ^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 GiB. GPU 1 has a total capacity of 95.00 GiB of which 16.12 GiB is free. Process 1714898 has 78.87 GiB memory in use. Of the allocated memory 76.60 GiB is allocated by PyTorch, and 42.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

When not using cuda graph, the errors are as follows:

[17:28:42 TP6] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1092, in run_scheduler_process
    scheduler.event_loop()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 274, in event_loop
    self.run_step()
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 461, in run_step
    result = self.run_batch(new_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 674, in run_batch
    logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 69, in forward_batch_speculative_generate
    self.forward_draft_extend(batch)
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 44, in forward_draft_extend
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 518, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
    torch.cat(
TypeError: expected Tensor as element 1 in argument 0, but got NoneType

@yukavio
Copy link
Author

yukavio commented Oct 30, 2024

@yukavio I conducted tests on DeepSeek-v2 + eagle (which I trained myself, using the llama architecture for the AR layer) with 8 H20 cards. My draft model can produce normal results using VLLM, indicating that there should be no issues with the model structure.

Here are the errors when using cuda graph:

 File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 143, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 141, in __init__
    self.init_cuda_graphs()
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 504, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 143, in __init__
    self.model_runner.attn_backend.init_cuda_graph_state(self.max_num_token)
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 72, in init_cuda_graph_state
    self.cuda_graph_attn_logits = torch.empty(
                                  ^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 GiB. GPU 1 has a total capacity of 95.00 GiB of which 16.12 GiB is free. Process 1714898 has 78.87 GiB memory in use. Of the allocated memory 76.60 GiB is allocated by PyTorch, and 42.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

When not using cuda graph, the errors are as follows:

[17:28:42 TP6] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1092, in run_scheduler_process
    scheduler.event_loop()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 274, in event_loop
    self.run_step()
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 461, in run_step
    result = self.run_batch(new_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 674, in run_batch
    logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 69, in forward_batch_speculative_generate
    self.forward_draft_extend(batch)
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 44, in forward_draft_extend
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 518, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
    torch.cat(
TypeError: expected Tensor as element 1 in argument 0, but got NoneType

For the first case, you should try to use flash-infer as attention backend to fix it.

For the last case, It sames your target model is failed to capture the hidden states. You can add code like to your target model (DeepSeek-V2 or other) like: https://github.com/sgl-project/sglang/pull/1498/files#diff-f6f7943965e41f2d4081018071c87bc1e9f806d5d639579688eb5f6c02f250cdR320.
I will refine this implementation later to avoid the editing of model implementation.

@coolhok
Copy link

coolhok commented Oct 30, 2024

I am using the latest code for offline testing。
At the beginning of the cycle,tps = 180 tokens/s,
But after a few cycles, there will be a serious decrease in speed,tps 180 tokens/s -> 57 tokens/s -> 38 tokens/s

import sglang as sgl
import time
import json

def main():
    # Sample prompts.
    prompts = [
        # "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Where is the capital city of France? ASSISTANT:",
        "[INST] <<SYS>>\\nYou are a helpful assistant.\\n<</SYS>>\\n如何赚取 10000w? [/INST]"
        # "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: 你是谁呢? ASSISTANT:"
    ]
    # Create a sampling params object.
    sampling_params = {"temperature": 0, "max_new_tokens": 256,}


    draft_model_path = "/mnt/data/model_hub/EAGLE-llama2-chat-7B"
    model_path = "/mnt/data/model_hub/Llama-2-7b-chat-hf"
    # Create an LLM.
    llm = sgl.Engine(model_path=model_path, draft_model_path=draft_model_path, disable_cuda_graph=False, num_speculative_steps=4, eagle_topk=4, num_draft_tokens=32, speculative_algorithm='EAGLE', mem_fraction_static=0.60)
    # llm = sgl.Engine(model_path=model_path, disable_cuda_graph=False, mem_fraction_static=0.60)
    #outputs = llm.generate(prompts, sampling_params)

    for _ in range(100):
        start = time.time()
        outputs = llm.generate(prompts, sampling_params)
        cos = time.time()-start
        # print(f"!!!! {json.dumps(outputs)}")
        completion_tokens = outputs[0]["meta_info"]["completion_tokens"]
        # Print the outputs.
        for prompt, output in zip(prompts, outputs):
            print(f"!!!!!!!!! tps =: {completion_tokens/cos}")

# The __main__ condition is necessary here because we use "spawn" to create subprocesses
# Spawn starts a fresh program every time, if there is no __main__, it will run into infinite loop to keep spawning processes from sgl.Engine
if __name__ == "__main__":
    main()

@yukavio
Copy link
Author

yukavio commented Oct 30, 2024

I am using the latest code for offline testing。 At the beginning of the cycle,tps = 180 tokens/s, But after a few cycles, there will be a serious decrease in speed,tps 180 tokens/s -> 57 tokens/s -> 38 tokens/s

import sglang as sgl
import time
import json

def main():
    # Sample prompts.
    prompts = [
        # "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: Where is the capital city of France? ASSISTANT:",
        "[INST] <<SYS>>\\nYou are a helpful assistant.\\n<</SYS>>\\n如何赚取 10000w? [/INST]"
        # "A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: 你是谁呢? ASSISTANT:"
    ]
    # Create a sampling params object.
    sampling_params = {"temperature": 0, "max_new_tokens": 256,}


    draft_model_path = "/mnt/data/model_hub/EAGLE-llama2-chat-7B"
    model_path = "/mnt/data/model_hub/Llama-2-7b-chat-hf"
    # Create an LLM.
    llm = sgl.Engine(model_path=model_path, draft_model_path=draft_model_path, disable_cuda_graph=False, num_speculative_steps=4, eagle_topk=4, num_draft_tokens=32, speculative_algorithm='EAGLE', mem_fraction_static=0.60)
    # llm = sgl.Engine(model_path=model_path, disable_cuda_graph=False, mem_fraction_static=0.60)
    #outputs = llm.generate(prompts, sampling_params)

    for _ in range(100):
        start = time.time()
        outputs = llm.generate(prompts, sampling_params)
        cos = time.time()-start
        # print(f"!!!! {json.dumps(outputs)}")
        completion_tokens = outputs[0]["meta_info"]["completion_tokens"]
        # Print the outputs.
        for prompt, output in zip(prompts, outputs):
            print(f"!!!!!!!!! tps =: {completion_tokens/cos}")

# The __main__ condition is necessary here because we use "spawn" to create subprocesses
# Spawn starts a fresh program every time, if there is no __main__, it will run into infinite loop to keep spawning processes from sgl.Engine
if __name__ == "__main__":
    main()

Thanks for the report. I will check and fix it later.

@fengyang95
Copy link

@yukavio I conducted tests on DeepSeek-v2 + eagle (which I trained myself, using the llama architecture for the AR layer) with 8 H20 cards. My draft model can produce normal results using VLLM, indicating that there should be no issues with the model structure.
Here are the errors when using cuda graph:

 File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 143, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 141, in __init__
    self.init_cuda_graphs()
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 504, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 143, in __init__
    self.model_runner.attn_backend.init_cuda_graph_state(self.max_num_token)
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 72, in init_cuda_graph_state
    self.cuda_graph_attn_logits = torch.empty(
                                  ^^^^^^^^^^^^
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 40.00 GiB. GPU 1 has a total capacity of 95.00 GiB of which 16.12 GiB is free. Process 1714898 has 78.87 GiB memory in use. Of the allocated memory 76.60 GiB is allocated by PyTorch, and 42.97 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

When not using cuda graph, the errors are as follows:

[17:28:42 TP6] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1092, in run_scheduler_process
    scheduler.event_loop()
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 274, in event_loop
    self.run_step()
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 461, in run_step
    result = self.run_batch(new_batch)
             ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 674, in run_batch
    logits_output, next_token_ids = self.draft_worker.forward_batch_speculative_generate(
                                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 69, in forward_batch_speculative_generate
    self.forward_draft_extend(batch)
  File "/opt/tiger/sglang/python/sglang/srt/speculative/eagle_worker.py", line 44, in forward_draft_extend
    logits_output = self.model_runner.forward(forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 534, in forward
    return self.forward_extend(forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 518, in forward_extend
    return self.model.forward(
           ^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 328, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/llama_eagle.py", line 289, in forward
    torch.cat(
TypeError: expected Tensor as element 1 in argument 0, but got NoneType

For the first case, you should try to use flash-infer as attention backend to fix it.

For the last case, It sames your target model is failed to capture the hidden states. You can add code like to your target model (DeepSeek-V2 or other) like: https://github.com/sgl-project/sglang/pull/1498/files#diff-f6f7943965e41f2d4081018071c87bc1e9f806d5d639579688eb5f6c02f250cdR320. I will refine this implementation later to avoid the editing of model implementation.

@yukavio got another issue:

[19:16:21 TP5] Traceback (most recent call last):
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 1090, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/scheduler.py", line 143, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/managers/tp_worker.py", line 62, in __init__
    self.model_runner = ModelRunner(
                        ^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 141, in __init__
    self.init_cuda_graphs()
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/model_runner.py", line 504, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
                             ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 167, in __init__
    self.capture()
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 196, in capture
    ) = self.capture_one_batch_size(bs, num_token, forward)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 255, in capture_one_batch_size
    run_once(self.capture_forward_mode)
  File "/opt/tiger/sglang/python/sglang/srt/model_executor/cuda_graph_runner.py", line 249, in run_once
    return forward(input_ids, forward_batch.positions, forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 662, in forward
    hidden_states = self.model(input_ids, positions, forward_batch)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 631, in forward
    hidden_states, residual = layer(
                              ^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 578, in forward
    hidden_states = self.self_attn(
                    ^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/models/deepseek_v2.py", line 473, in forward
    attn_output = self.attn(q_input, k_input, v_input, forward_batch)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/layers/radix_attention.py", line 60, in forward
    return forward_batch.attn_backend.forward(q, k, v, self, forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/__init__.py", line 39, in forward
    return self.forward_extend(q, k, v, layer, forward_batch)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_backend.py", line 112, in forward_extend
    self.extend_attention_fwd(
  File "/opt/tiger/sglang/python/sglang/srt/layers/attention/triton_ops/extend_attention.py", line 308, in extend_attention_fwd
    grid = (batch_size, head_num, triton.cdiv(max_len_extend, BLOCK_M))
                                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/tiger/.pyenv/versions/3.11.2/lib/python3.11/site-packages/triton/__init__.py", line 60, in cdiv
    return (x + y - 1) // y
            ~~^~~
TypeError: unsupported operand type(s) for +: 'NoneType' and 'int'

@yukavio
Copy link
Author

yukavio commented Oct 31, 2024

FYI. Due to the large number of modifications to the attention backend part of the main branch. I need some time to fix the conflicts caused by code migration.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants