Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

minor: support flashinfer nightly #2295

Merged
merged 2 commits into from
Dec 1, 2024
Merged

minor: support flashinfer nightly #2295

merged 2 commits into from
Dec 1, 2024

Conversation

zhyncs
Copy link
Member

@zhyncs zhyncs commented Dec 1, 2024

Motivation

Currently, it can only be triggered manually.

Modifications

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs zhyncs requested a review from merrymercy December 1, 2024 09:18
@zhyncs zhyncs marked this pull request as draft December 1, 2024 09:26
@zhyncs
Copy link
Member Author

zhyncs commented Dec 1, 2024

ref #2179

@zhyncs zhyncs marked this pull request as ready for review December 1, 2024 10:09
@zhyncs
Copy link
Member Author

zhyncs commented Dec 1, 2024

@zhyncs zhyncs requested a review from merrymercy December 1, 2024 10:38
"""
Install the dependency in CI.
"""
# Install the dependency in CI.
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments in bash use #


./killall_sglang.sh
# Use repo from environment variable, passed from GitHub Actions
FLASHINFER_REPO="${FLASHINFER_REPO:-https://flashinfer.ai/whl/cu121/torch2.4}"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Index URL does not need to include the flashinfer directory

# Use repo from environment variable, passed from GitHub Actions
FLASHINFER_REPO="${FLASHINFER_REPO:-https://flashinfer.ai/whl/cu121/torch2.4}"

SCRIPT_DIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resolve the issue of not finding the execution script

required: true
type: choice
default: 'release'
options:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the choice method

@zhyncs
Copy link
Member Author

zhyncs commented Dec 1, 2024

/usr/local/lib/python3.10/dist-packages/websockets/legacy/__init__.py:6: DeprecationWarning: websockets.legacy is deprecated; see https://websockets.readthedocs.io/en/stable/howto/upgrade.html for upgrade instructions
  warnings.warn(  # deprecated in 14.0 - 2024-11-09
/usr/local/lib/python3.10/dist-packages/uvicorn/protocols/websockets/websockets_impl.py:16: DeprecationWarning: websockets.server.WebSocketServerProtocol is deprecated
  from websockets.server import WebSocketServerProtocol
[2024-12-01 10:39:32 TP0] TpModelWorkerClient hit an exception: Traceback (most recent call last):
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 99, in forward_thread_func
    self.forward_thread_func_()
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/managers/tp_worker_overlap_thread.py", line 130, in forward_thread_func_
    logits_output, next_token_ids = self.worker.forward_batch_generation(
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/managers/tp_worker.py", line 149, in forward_batch_generation
    logits_output = self.model_runner.forward(forward_batch)
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/model_executor/model_runner.py", line 664, in forward
    return self.forward_extend(forward_batch)
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/model_executor/model_runner.py", line 633, in forward_extend
    return self.model.forward(
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/models/llama.py", line 336, in forward
    hidden_states = self.model(input_ids, positions, forward_batch, input_embeds)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/models/llama.py", line [28](https://github.com/sgl-project/sglang/actions/runs/12104687316/job/33748388459#step:4:29)8, in forward
    hidden_states, residual = layer(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/models/llama.py", line 237, in forward
    hidden_states = self.self_attn(
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1736, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/models/llama.py", line 174, in forward
    attn_output = self.attn(q, k, v, forward_batch)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 17[36](https://github.com/sgl-project/sglang/actions/runs/12104687316/job/33748388459#step:4:37), in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1747, in _call_impl
    return forward_call(*args, **kwargs)
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/layers/radix_attention.py", line 58, in forward
    return forward_batch.attn_backend.forward(q, k, v, self, forward_batch)
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/layers/attention/__init__.py", line 60, in forward
    return self.forward_extend(q, k, v, layer, forward_batch)
  File "/actions-runner/_work/sglang/sglang/python/sglang/srt/layers/attention/flashinfer_backend.py", line 2[42](https://github.com/sgl-project/sglang/actions/runs/12104687316/job/33748388459#step:4:43), in forward_extend
    o = prefill_wrapper_paged.forward(
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 1144, in forward
    return self.run(q, paged_kv_cache, k_scale=k_scale, v_scale=v_scale)
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/prefill.py", line 1211, in run
    _check_cached_qkv_data_type(
  File "/usr/local/lib/python3.10/dist-packages/flashinfer/utils.py", line 199, in _check_cached_qkv_data_type
    raise ValueError(
ValueError: The dtype of q torch.bfloat16 does not match the q_data_type torch.float16 specified in plan function.

nightly version issue cc @yzh119

@zhyncs
Copy link
Member Author

zhyncs commented Dec 1, 2024

This PR currently provides the option to manually enable nightly flashinfer, with the default still being the release version. There is an issue with dtype inconsistency in nightly flashinfer that needs to be fixed by flashinfer. This PR is safe to merge. cc @merrymercy @yzh119

@zhyncs zhyncs merged commit fc78640 into main Dec 1, 2024
6 of 30 checks passed
@zhyncs zhyncs deleted the zhyncs/nightly branch December 1, 2024 10:55
@yzh119
Copy link
Collaborator

yzh119 commented Dec 1, 2024

For bf16 models, the data type needs to be specified explicitly in plan functions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants