Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

hotfix: revert sampler CUDA Graph #1242

Merged
merged 2 commits into from
Aug 28, 2024
Merged

hotfix: revert sampler CUDA Graph #1242

merged 2 commits into from
Aug 28, 2024

Conversation

zhyncs
Copy link
Member

@zhyncs zhyncs commented Aug 28, 2024

Motivation

Modifications

Checklist

  • Format your code according to the Contributor Guide.
  • Add unit tests as outlined in the Contributor Guide.
  • Update documentation as needed, including docstrings or example tutorials.

@zhyncs
Copy link
Member Author

zhyncs commented Aug 28, 2024

  1. sampler CUDA Graph
  2. bench latency
  3. new release v0.2.14.post1

@zhyncs
Copy link
Member Author

zhyncs commented Aug 28, 2024

python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3.1-8B-Instruct  --correct --output-len 16 --trust-remote-code
Init nccl begin.
Load weight begin. avail mem=78.73 GB
INFO 08-28 11:11:49 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  3.92it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.48it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.10it/s]

Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.65 GB
Memory pool end. avail mem=8.39 GB
Capture cuda graph begin. This can take up to several minutes.
max_total_num_tokens=443991
input_ids=[[128000, 791, 6864, 315, 9822, 374], [128000, 791, 6864, 315, 279, 3723, 17262, 316, 374], [128000, 15724, 374, 264, 40798, 1938, 323, 358, 1093]]
prefill logits (first half) tensor([[ 2.4688,  1.4844,  0.5117,  ..., -2.5000, -2.5000, -2.5000],
        [ 2.4688,  1.4844,  0.5117,  ..., -2.5000, -2.5000, -2.5000],
        [ 2.0312,  2.1562,  2.2344,  ..., -6.6875, -6.6875, -6.6875]],
       device='cuda:0')
prefill logits (final) tensor([[ 5.1875,  2.0156,  1.0469,  ..., -5.2188, -5.2188, -5.2188],
        [ 4.9688,  3.5156,  2.2812,  ..., -3.7812, -3.7812, -3.7812],
        [10.2500,  3.1719,  2.1719,  ..., -2.6875, -2.6875, -2.6875]],
       device='cuda:0')
<|begin_of_text|>The capital of France is a city of romance, art, fashion, and cuisine. Paris is a must-
<|begin_of_text|>The capital of the United Kindom is London. London is a global city and a major financial center. It is the most
<|begin_of_text|>Today is a sunny day and I like to take a walk in the park. I put on my sunglasses and my favorite hat
python3 scripts/playground/reference_hf.py --model  meta-llama/Meta-Llama-3.1-8B-InstructLoading checkpoint shards: 100%|██████████████| 4/4 [00:30<00:00,  7.69s/it]
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:567: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generationmodes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:572: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
The attention mask and the pad token id were not set. As a consequence, youmay observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because padtoken is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
prefill logits tensor([ 5.2227,  2.0332,  1.0615,  ..., -5.2305, -5.2305, -5.2305],
       device='cuda:0')
<|begin_of_text|>The capital of France is a city of romance, art, fashion, and cuisine. Paris is a must
The attention mask and the pad token id were not set. As a consequence, youmay observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
prefill logits tensor([ 4.9688,  3.5391,  2.2910,  ..., -3.7285, -3.7285, -3.7285],
       device='cuda:0')
<|begin_of_text|>The capital of the United Kindom is London. London is a global city and a major financial center. It is the
The attention mask and the pad token id were not set. As a consequence, youmay observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
prefill logits tensor([10.2422,  3.1367,  2.1719,  ..., -2.6777, -2.6777, -2.6777],
       device='cuda:0')
<|begin_of_text|>Today is a sunny day and I like to take a walk in the park. I put on my sunglasses and my favorite

@zhyncs zhyncs merged commit f25f4df into main Aug 28, 2024
9 checks passed
@zhyncs zhyncs deleted the hotfix branch August 28, 2024 11:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant