hotfix: revert sampler CUDA Graph #1242

zhyncs · 2024-08-28T11:05:04Z

Motivation

Modifications

Checklist

Format your code according to the Contributor Guide.
Add unit tests as outlined in the Contributor Guide.
Update documentation as needed, including docstrings or example tutorials.

zhyncs · 2024-08-28T11:06:45Z

sampler CUDA Graph
bench latency
new release v0.2.14.post1

zhyncs · 2024-08-28T11:14:21Z

python3 -m sglang.bench_latency --model meta-llama/Meta-Llama-3.1-8B-Instruct  --correct --output-len 16 --trust-remote-code
Init nccl begin.
Load weight begin. avail mem=78.73 GB
INFO 08-28 11:11:49 weight_utils.py:236] Using model weights format ['*.safetensors']
Loading safetensors checkpoint shards:   0% Completed | 0/4 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  25% Completed | 1/4 [00:00<00:00,  3.92it/s]
Loading safetensors checkpoint shards:  50% Completed | 2/4 [00:01<00:01,  1.48it/s]
Loading safetensors checkpoint shards:  75% Completed | 3/4 [00:02<00:00,  1.07it/s]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.03s/it]
Loading safetensors checkpoint shards: 100% Completed | 4/4 [00:03<00:00,  1.10it/s]

Load weight end. type=LlamaForCausalLM, dtype=torch.bfloat16, avail mem=63.65 GB
Memory pool end. avail mem=8.39 GB
Capture cuda graph begin. This can take up to several minutes.
max_total_num_tokens=443991
input_ids=[[128000, 791, 6864, 315, 9822, 374], [128000, 791, 6864, 315, 279, 3723, 17262, 316, 374], [128000, 15724, 374, 264, 40798, 1938, 323, 358, 1093]]
prefill logits (first half) tensor([[ 2.4688,  1.4844,  0.5117,  ..., -2.5000, -2.5000, -2.5000],
        [ 2.4688,  1.4844,  0.5117,  ..., -2.5000, -2.5000, -2.5000],
        [ 2.0312,  2.1562,  2.2344,  ..., -6.6875, -6.6875, -6.6875]],
       device='cuda:0')
prefill logits (final) tensor([[ 5.1875,  2.0156,  1.0469,  ..., -5.2188, -5.2188, -5.2188],
        [ 4.9688,  3.5156,  2.2812,  ..., -3.7812, -3.7812, -3.7812],
        [10.2500,  3.1719,  2.1719,  ..., -2.6875, -2.6875, -2.6875]],
       device='cuda:0')
<|begin_of_text|>The capital of France is a city of romance, art, fashion, and cuisine. Paris is a must-
<|begin_of_text|>The capital of the United Kindom is London. London is a global city and a major financial center. It is the most
<|begin_of_text|>Today is a sunny day and I like to take a walk in the park. I put on my sunglasses and my favorite hat

python3 scripts/playground/reference_hf.py --model  meta-llama/Meta-Llama-3.1-8B-InstructLoading checkpoint shards: 100%|██████████████| 4/4 [00:30<00:00,  7.69s/it]
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:567: UserWarning: `do_sample` is set to `False`. However, `temperature` is set to `0.6` -- this flag is only used in sample-based generationmodes. You should set `do_sample=True` or unset `temperature`.
  warnings.warn(
/usr/local/lib/python3.10/dist-packages/transformers/generation/configuration_utils.py:572: UserWarning: `do_sample` is set to `False`. However, `top_p` is set to `0.9` -- this flag is only used in sample-based generation modes. You should set `do_sample=True` or unset `top_p`.
  warnings.warn(
The attention mask and the pad token id were not set. As a consequence, youmay observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
The attention mask is not set and cannot be inferred from input because padtoken is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
We detected that you are passing `past_key_values` as a tuple and this is deprecated and will be removed in v4.43. Please use an appropriate `Cache` class (https://huggingface.co/docs/transformers/v4.41.3/en/internal/generation_utils#transformers.Cache)
prefill logits tensor([ 5.2227,  2.0332,  1.0615,  ..., -5.2305, -5.2305, -5.2305],
       device='cuda:0')
<|begin_of_text|>The capital of France is a city of romance, art, fashion, and cuisine. Paris is a must
The attention mask and the pad token id were not set. As a consequence, youmay observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
prefill logits tensor([ 4.9688,  3.5391,  2.2910,  ..., -3.7285, -3.7285, -3.7285],
       device='cuda:0')
<|begin_of_text|>The capital of the United Kindom is London. London is a global city and a major financial center. It is the
The attention mask and the pad token id were not set. As a consequence, youmay observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:128001 for open-end generation.
prefill logits tensor([10.2422,  3.1367,  2.1719,  ..., -2.6777, -2.6777, -2.6777],
       device='cuda:0')
<|begin_of_text|>Today is a sunny day and I like to take a walk in the park. I put on my sunglasses and my favorite

zhyncs added 2 commits August 28, 2024 21:02

hotfix: revert #1201

c534a1a

v0.2.14.post1

b44f02f

zhyncs requested review from Ying1123, merrymercy and hnyls2002 August 28, 2024 11:05

zhyncs merged commit f25f4df into main Aug 28, 2024
9 checks passed

zhyncs deleted the hotfix branch August 28, 2024 11:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

hotfix: revert sampler CUDA Graph #1242

hotfix: revert sampler CUDA Graph #1242

zhyncs commented Aug 28, 2024

zhyncs commented Aug 28, 2024

zhyncs commented Aug 28, 2024

hotfix: revert sampler CUDA Graph #1242

hotfix: revert sampler CUDA Graph #1242

Conversation

zhyncs commented Aug 28, 2024

Motivation

Modifications

Checklist

zhyncs commented Aug 28, 2024

zhyncs commented Aug 28, 2024