Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Speculative Decoding] Support draft model on different tensor-parallel size than target model #5414

Merged
merged 131 commits into from
Jun 25, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
131 commits
Select commit Hold shift + click to select a range
f5b5f94
tp1 draft worker
wooyeonlee0 Jun 10, 2024
709de21
refactor singlt_tp_worker
wooyeonlee0 Jun 10, 2024
0eacc96
update execute_model logic
wooyeonlee0 Jun 10, 2024
2011ed0
fix
wooyeonlee0 Jun 11, 2024
2e16c4e
DummyProposerWorker
wooyeonlee0 Jun 11, 2024
b412a51
fix
wooyeonlee0 Jun 11, 2024
593ccfa
init only partial workers
wooyeonlee0 Jun 11, 2024
c5d3476
Use multi_step_worker logic
wooyeonlee0 Jun 12, 2024
44e623b
self._patch_tp_group
wooyeonlee0 Jun 12, 2024
98caf17
refactor it to support other draft-tp than 1
wooyeonlee0 Jun 12, 2024
7fc4ff5
spec-tp configuarable
wooyeonlee0 Jun 12, 2024
a96e720
ngram worker support test
wooyeonlee0 Jun 12, 2024
db39576
minor refine
wooyeonlee0 Jun 12, 2024
b2e8595
cleanup
wooyeonlee0 Jun 12, 2024
756442a
return type fix
wooyeonlee0 Jun 12, 2024
32094f1
cleanup
wooyeonlee0 Jun 12, 2024
7890191
cleanup
wooyeonlee0 Jun 12, 2024
53b2ea9
typo
wooyeonlee0 Jun 12, 2024
a29c9c5
verify arg
wooyeonlee0 Jun 12, 2024
52ba09d
remove testing code
wooyeonlee0 Jun 12, 2024
d26ef08
cleanup
wooyeonlee0 Jun 12, 2024
80c4994
rename module
wooyeonlee0 Jun 12, 2024
0f16f3f
cleanup
wooyeonlee0 Jun 12, 2024
140f478
cleanup
wooyeonlee0 Jun 12, 2024
3fd7e91
remove unnecessary methods
wooyeonlee0 Jun 12, 2024
495aa30
fix
wooyeonlee0 Jun 12, 2024
3a5a47f
undo unrelated changes
wooyeonlee0 Jun 12, 2024
07ddbb8
minor fix
wooyeonlee0 Jun 12, 2024
b0a677d
fix ruff errors
wooyeonlee0 Jun 12, 2024
96782a2
Merge branch 'main' into spec-tp1-draft
wooyeonlee0 Jun 12, 2024
9998b9c
typo
wooyeonlee0 Jun 12, 2024
e92ecdc
temporal fix
wooyeonlee0 Jun 12, 2024
b421607
formatting
wooyeonlee0 Jun 12, 2024
386ab9b
isort
wooyeonlee0 Jun 12, 2024
b25f74e
line length
wooyeonlee0 Jun 12, 2024
8b51f08
fix
wooyeonlee0 Jun 13, 2024
d4b283c
Merge remote-tracking branch 'origin' into spec-tp1-draft
wooyeonlee0 Jun 13, 2024
dfc90cb
line length
wooyeonlee0 Jun 13, 2024
9bef5e4
comment
wooyeonlee0 Jun 13, 2024
85d087d
add type hint
wooyeonlee0 Jun 13, 2024
9af36b7
isort
wooyeonlee0 Jun 13, 2024
5a0bf45
add more type hints
wooyeonlee0 Jun 13, 2024
531c9f0
fix
wooyeonlee0 Jun 13, 2024
287da20
test
wooyeonlee0 Jun 13, 2024
08d1b2a
nit
wooyeonlee0 Jun 13, 2024
237c966
fix yapf
wooyeonlee0 Jun 13, 2024
0bb38c2
fix
wooyeonlee0 Jun 13, 2024
c097d6c
fix
wooyeonlee0 Jun 13, 2024
957a325
fix
wooyeonlee0 Jun 13, 2024
3ec8cb5
Merge remote-tracking branch 'origin' into spec-tp1-draft
wooyeonlee0 Jun 14, 2024
8a8a1e4
add comments
wooyeonlee0 Jun 14, 2024
7f06f64
combine smaller_tp_worker logic into multi_step_worker
wooyeonlee0 Jun 14, 2024
1e87579
fix
wooyeonlee0 Jun 14, 2024
abc546c
fix
wooyeonlee0 Jun 14, 2024
7880cb0
add small_tp correctness test
wooyeonlee0 Jun 14, 2024
2ebe6f3
nit
wooyeonlee0 Jun 14, 2024
90d46ee
fix
wooyeonlee0 Jun 14, 2024
7e1426c
refactor. remove log
wooyeonlee0 Jun 14, 2024
ad52d93
remove return
wooyeonlee0 Jun 14, 2024
355475b
fix
wooyeonlee0 Jun 14, 2024
9cfdb5b
fix about context managing
wooyeonlee0 Jun 14, 2024
6a6c5ff
nit
wooyeonlee0 Jun 14, 2024
ddef229
consistent condition. if self._is_dummy:
wooyeonlee0 Jun 14, 2024
965f648
fix ruff errors
wooyeonlee0 Jun 14, 2024
1bb5534
isort
wooyeonlee0 Jun 14, 2024
ea6b8f5
fix yapf
wooyeonlee0 Jun 14, 2024
71977d2
undo ngramworker support
wooyeonlee0 Jun 14, 2024
bc5f77a
add comment
wooyeonlee0 Jun 14, 2024
5655a49
remove smaller_tp_proposer_worker
wooyeonlee0 Jun 14, 2024
eabc16a
ruff
wooyeonlee0 Jun 14, 2024
f748edf
remove ranks arg
wooyeonlee0 Jun 17, 2024
c099c94
Merge remote-tracking branch 'origin' into spec-tp1-draft
wooyeonlee0 Jun 17, 2024
4b74a45
undo
wooyeonlee0 Jun 17, 2024
c9786ad
add dist test
wooyeonlee0 Jun 17, 2024
a42664a
nit
wooyeonlee0 Jun 17, 2024
ac7701a
fix
wooyeonlee0 Jun 17, 2024
eea6a7e
test fix
wooyeonlee0 Jun 17, 2024
a648f5d
yapf fix
wooyeonlee0 Jun 17, 2024
f23ba8c
update comment
wooyeonlee0 Jun 17, 2024
aa9af93
require 2 gpus
wooyeonlee0 Jun 17, 2024
56c8927
restore draft_ranks arg in MultiStepWorker.__init__
wooyeonlee0 Jun 18, 2024
385b4f8
comment
wooyeonlee0 Jun 18, 2024
43f37eb
ruff mypy
wooyeonlee0 Jun 18, 2024
99350e2
isort
wooyeonlee0 Jun 18, 2024
a9f3e23
yapf
wooyeonlee0 Jun 18, 2024
6ba250d
allow None for draft_ranks
wooyeonlee0 Jun 18, 2024
3e78613
spec-tp arg in benchmark_latency
wooyeonlee0 Jun 18, 2024
6532af7
yapf
wooyeonlee0 Jun 18, 2024
6839797
yapf
wooyeonlee0 Jun 18, 2024
aac586b
Merge remote-tracking branch 'origin' into spec-tp1-draft
wooyeonlee0 Jun 19, 2024
98e584d
remove is_dummy check from sampler_output
wooyeonlee0 Jun 19, 2024
2d5e64d
add comment
wooyeonlee0 Jun 20, 2024
ba88bd4
yapf
wooyeonlee0 Jun 20, 2024
46e5274
resolve cade comments
wooyeonlee0 Jun 21, 2024
85f4f25
refactoring patch_tp_group
wooyeonlee0 Jun 21, 2024
c1b5373
cleanup patch_tp_group logic
wooyeonlee0 Jun 21, 2024
4a58617
speculative_draft_tensor_parallel_size
wooyeonlee0 Jun 21, 2024
b09e7be
ruff, yapf
wooyeonlee0 Jun 21, 2024
7168d78
remove world group patch
wooyeonlee0 Jun 21, 2024
fe0bd5b
isort, yapf
wooyeonlee0 Jun 21, 2024
2e0d170
yield fix
wooyeonlee0 Jun 21, 2024
36f8aa5
debugging
wooyeonlee0 Jun 21, 2024
54bf514
log
wooyeonlee0 Jun 21, 2024
bfd7d2f
reintroduce smaller_tp_proposer_worker
wooyeonlee0 Jun 21, 2024
f337428
add lora methods
wooyeonlee0 Jun 21, 2024
4654b9f
missing method
wooyeonlee0 Jun 21, 2024
e39926e
remove world group related logics
wooyeonlee0 Jun 21, 2024
1c6eefd
Always wrapping MultiStepWorker
wooyeonlee0 Jun 21, 2024
f2d2ee5
remove unused logger
wooyeonlee0 Jun 21, 2024
302955c
isort. minor rename
wooyeonlee0 Jun 21, 2024
3d4754e
LoraNotSupported. return type
wooyeonlee0 Jun 21, 2024
620b224
yapf, ruff
wooyeonlee0 Jun 21, 2024
b245d3c
add skip_spec_test
wooyeonlee0 Jun 21, 2024
1e71e98
remove spec-tp 3 case
wooyeonlee0 Jun 21, 2024
a01c00d
spec-draft-tp
wooyeonlee0 Jun 21, 2024
debffc2
_TP_STATE_PATCHED
wooyeonlee0 Jun 24, 2024
39fe67f
remove stale comment
wooyeonlee0 Jun 24, 2024
af1b0be
dist_tp2, dist_tp4 tests
wooyeonlee0 Jun 24, 2024
834c6e0
remove unnecessary overriding methods
wooyeonlee0 Jun 24, 2024
5bc2bc3
comment
wooyeonlee0 Jun 24, 2024
8740369
yapf
wooyeonlee0 Jun 24, 2024
4d82ca1
comment
wooyeonlee0 Jun 24, 2024
7bf831c
undo change in test utils
wooyeonlee0 Jun 24, 2024
3fccc76
remove test_skip_speculation
wooyeonlee0 Jun 24, 2024
e8d0e93
tp4 test only for spec_tp1
wooyeonlee0 Jun 25, 2024
91c2e43
allow only value 1 for spec_tp
wooyeonlee0 Jun 25, 2024
fac7e68
yapf
wooyeonlee0 Jun 25, 2024
271822e
add todo comment
wooyeonlee0 Jun 25, 2024
ae0d7f1
add tests for check that test_skip fails even there's no spec_draft_t…
wooyeonlee0 Jun 25, 2024
b84a070
remove test_skip_speculation from dist tests
wooyeonlee0 Jun 25, 2024
86fda24
yapf
wooyeonlee0 Jun 25, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ steps:
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- pytest -v -s spec_decode/e2e/test_integration_dist.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp2.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s distributed/test_utils.py

Expand All @@ -60,6 +60,7 @@ steps:
# See https://github.com/vllm-project/vllm/pull/5473#issuecomment-2166601837 for context.
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- pytest -v -s spec_decode/e2e/test_integration_dist_tp4.py

- label: Engine Test
mirror_hardwares: [amd]
Expand Down
6 changes: 6 additions & 0 deletions benchmarks/benchmark_latency.py
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@ def main(args: argparse.Namespace):
model=args.model,
speculative_model=args.speculative_model,
num_speculative_tokens=args.num_speculative_tokens,
speculative_draft_tensor_parallel_size=\
args.speculative_draft_tensor_parallel_size,
tokenizer=args.tokenizer,
quantization=args.quantization,
tensor_parallel_size=args.tensor_parallel_size,
Expand Down Expand Up @@ -125,6 +127,10 @@ def run_to_completion(profile_dir: Optional[str] = None):
parser.add_argument('--model', type=str, default='facebook/opt-125m')
parser.add_argument('--speculative-model', type=str, default=None)
parser.add_argument('--num-speculative-tokens', type=int, default=None)
parser.add_argument('--speculative-draft-tensor-parallel-size',
'-spec-draft-tp',
type=int,
default=None)
parser.add_argument('--tokenizer', type=str, default=None)
parser.add_argument('--quantization',
'-q',
Expand Down
111 changes: 111 additions & 0 deletions tests/spec_decode/e2e/test_integration_dist_tp2.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,111 @@
"""Tests which cover integration of the speculative decoding framework with
tensor parallelism.
"""

import pytest
import torch

from vllm.utils import is_hip

from .conftest import run_greedy_equality_correctness_test


@pytest.mark.skipif(torch.cuda.device_count() < 2,
reason="Need at least 2 GPUs to run the test.")
@pytest.mark.parametrize(
"common_llm_kwargs",
[{
"model": "JackFram/llama-68m",

# Skip cuda graph recording for fast test.
"enforce_eager": True,

# Required for spec decode.
"use_v2_block_manager": True,
"tensor_parallel_size": 2,

# Use AsyncLLM engine, so that the engine runs in its own process.
# Otherwise, since vLLM does not follow true SPMD, the test runner
# process will have both the engine and the rank0 worker. NCCL is not
# cleaned up properly, and its server host thread leaks, causing the
# second run of the test to fail with internal NCCL error.
"use_async": True,
}])
@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}])
@pytest.mark.parametrize("baseline_llm_kwargs", [{}])
@pytest.mark.parametrize("test_llm_kwargs", [
{
"speculative_model": "JackFram/llama-68m",
"num_speculative_tokens": 3,
},
{
"speculative_model": "[ngram]",
"num_speculative_tokens": 5,
"ngram_prompt_lookup_max": 3,
},
])
@pytest.mark.parametrize("batch_size", [2])
@pytest.mark.parametrize(
"output_len",
[
# Use smaller output len for fast test.
32,
])
@pytest.mark.parametrize("seed", [1])
def test_target_model_tp_gt_1(baseline_llm_generator, test_llm_generator,
batch_size: int, output_len: int):
"""Verify greedy equality when tensor parallelism is used.
"""
if is_hip():
pytest.skip("hip is not well-supported yet")
run_greedy_equality_correctness_test(baseline_llm_generator,
test_llm_generator,
batch_size,
max_output_len=output_len,
force_output_len=True)


@pytest.mark.skipif(torch.cuda.device_count() < 2,
reason="Need at least 2 GPUs to run the test.")
@pytest.mark.parametrize(
"common_llm_kwargs",
[{
# Use a small model for a fast test.
# Note this is repeated in the test body; to initialize a tokenizer.
"model": "JackFram/llama-68m",

# Skip cuda graph recording for fast test.
"enforce_eager": True,

# Required for spec decode.
"use_v2_block_manager": True,
"tensor_parallel_size": 2,

# Use AsyncLLM engine, so that the engine runs in its own process.
# Otherwise, since vLLM does not follow true SPMD, the test runner
# process will have both the engine and the rank0 worker. NCCL is not
# cleaned up properly, and its server host thread leaks, causing the
# second run of the test to fail with internal NCCL error.
"use_async": True,
}])
@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}])
@pytest.mark.parametrize("baseline_llm_kwargs", [{}])
@pytest.mark.parametrize("test_llm_kwargs", [
{
"speculative_model": "JackFram/llama-68m",
"num_speculative_tokens": 5,
"speculative_draft_tensor_parallel_size": 1,
},
])
@pytest.mark.parametrize("batch_size", [2])
@pytest.mark.parametrize("seed", [1])
def test_draft_model_tp_lt_target_model_tp2(test_llm_generator,
baseline_llm_generator,
batch_size: int):
"""Verify spec decode works well with smaller tp for draft models.
"""
run_greedy_equality_correctness_test(baseline_llm_generator,
test_llm_generator,
batch_size,
max_output_len=32,
force_output_len=True)
Original file line number Diff line number Diff line change
Expand Up @@ -5,24 +5,24 @@
import pytest
import torch

from vllm.utils import is_hip

from .conftest import run_greedy_equality_correctness_test


@pytest.mark.skipif(torch.cuda.device_count() < 2,
reason="Need at least 2 GPUs to run the test.")
@pytest.mark.skipif(torch.cuda.device_count() < 4,
reason="Need at least 4 GPUs to run the test.")
@pytest.mark.parametrize(
"common_llm_kwargs",
[{
# Use a small model for a fast test.
# Note this is repeated in the test body; to initialize a tokenizer.
"model": "JackFram/llama-68m",

# Skip cuda graph recording for fast test.
"enforce_eager": True,

# Required for spec decode.
"use_v2_block_manager": True,
"tensor_parallel_size": 2,
"tensor_parallel_size": 4,

# Use AsyncLLM engine, so that the engine runs in its own process.
# Otherwise, since vLLM does not follow true SPMD, the test runner
Expand All @@ -31,35 +31,30 @@
# second run of the test to fail with internal NCCL error.
"use_async": True,
}])
@pytest.mark.parametrize("per_test_common_llm_kwargs", [{}])
@pytest.mark.parametrize("baseline_llm_kwargs", [{}])
@pytest.mark.parametrize("test_llm_kwargs", [
@pytest.mark.parametrize("per_test_common_llm_kwargs", [
{
"speculative_model": "JackFram/llama-68m",
"num_speculative_tokens": 3,
},
{
"speculative_model": "[ngram]",
"num_speculative_tokens": 5,
"ngram_prompt_lookup_max": 3,
},
])
@pytest.mark.parametrize("batch_size", [2])
@pytest.mark.parametrize("baseline_llm_kwargs", [{}])
@pytest.mark.parametrize(
"output_len",
"test_llm_kwargs",
[
# Use smaller output len for fast test.
32,
#TODO(wooyeon): add spec_draft_dp=2 case
{
"speculative_draft_tensor_parallel_size": 1,
},
])
@pytest.mark.parametrize("batch_size", [2])
@pytest.mark.parametrize("seed", [1])
def test_target_model_tp_gt_1(baseline_llm_generator, test_llm_generator,
batch_size: int, output_len: int):
"""Verify greedy equality when tensor parallelism is used.
def test_draft_model_tp_lt_target_model_tp4(test_llm_generator,
baseline_llm_generator,
batch_size: int):
"""Verify spec decode works well with smaller tp for draft models.
"""
if is_hip():
pytest.skip("hip is not well-supported yet")
run_greedy_equality_correctness_test(baseline_llm_generator,
test_llm_generator,
batch_size,
max_output_len=output_len,
max_output_len=32,
force_output_len=True)
24 changes: 19 additions & 5 deletions vllm/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -795,6 +795,7 @@ def maybe_create_spec_config(
target_parallel_config: ParallelConfig,
target_dtype: str,
speculative_model: Optional[str],
speculative_draft_tensor_parallel_size: Optional[int],
num_speculative_tokens: Optional[int],
speculative_max_model_len: Optional[int],
enable_chunked_prefill: bool,
Expand All @@ -817,6 +818,8 @@ def maybe_create_spec_config(
target_dtype (str): The data type used for the target model.
speculative_model (Optional[str]): The name of the speculative
model, if provided.
speculative_draft_tensor_parallel_size (Optional[int]): The degree
of the tensor parallelism for the draft model.
num_speculative_tokens (Optional[int]): The number of speculative
tokens, if provided.
speculative_max_model_len (Optional[int]): The maximum model len of
Expand Down Expand Up @@ -921,7 +924,8 @@ def maybe_create_spec_config(

draft_parallel_config = (
SpeculativeConfig.create_draft_parallel_config(
target_parallel_config))
target_parallel_config,
speculative_draft_tensor_parallel_size))

return SpeculativeConfig(
draft_model_config,
Expand Down Expand Up @@ -969,16 +973,26 @@ def _maybe_override_draft_max_model_len(

@staticmethod
def create_draft_parallel_config(
target_parallel_config: ParallelConfig) -> ParallelConfig:
target_parallel_config: ParallelConfig,
speculative_draft_tensor_parallel_size: Optional[int]
) -> ParallelConfig:
"""Create a parallel config for use by the draft worker.

This is mostly a copy of the target parallel config. In the future the
draft worker can have a different parallel strategy, e.g. TP=1.
This is mostly a copy of the target parallel config, except the tp_size.
"""
if speculative_draft_tensor_parallel_size is None:
speculative_draft_tensor_parallel_size = \
target_parallel_config.tensor_parallel_size
elif speculative_draft_tensor_parallel_size != 1:
# TODO(wooyeon): allow tp values larger than 1
raise ValueError(
f"{speculative_draft_tensor_parallel_size=} cannot be"
f"other value than 1")

draft_parallel_config = ParallelConfig(
pipeline_parallel_size=target_parallel_config.
pipeline_parallel_size,
tensor_parallel_size=target_parallel_config.tensor_parallel_size,
tensor_parallel_size=speculative_draft_tensor_parallel_size,
distributed_executor_backend=target_parallel_config.
distributed_executor_backend,
max_parallel_loading_workers=target_parallel_config.
Expand Down
Loading
Loading