[Model] Pipeline parallel support for Mixtral #6516

comaniac · 2024-07-17T18:45:55Z

Take from #6403. Co-authored by @binxuan

github-actions · 2024-07-17T18:46:10Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only trigger fastcheck CI to run, which consists only a small and essential subset of tests to quickly catch errors with the flexibility to run extra individual tests on top (you can do this by unblocking test steps in the Buildkite run).

Full CI run is still required to merge this PR so once the PR is ready to go, please make sure to run it. If you need all test signals in between PR commits, you can trigger full CI as well.

To run full CI, you can do one of these:

Comment /ready on the PR
Add ready label to the PR
Enable auto-merge.

🚀

comaniac · 2024-07-17T21:54:22Z

Tested locally with PP=8 and worked.

youkaichao · 2024-07-17T22:43:22Z

can you test the correctness locally, using https://github.com/vllm-project/vllm/blob/main/tests/distributed/test_pipeline_parallel.py ?

comaniac · 2024-07-18T00:29:12Z

Passed with the following configures. Note that I tested it on 8xL4 so I have to use 8 GPUs to host the model.

    "TP_SIZE, PP_SIZE, EAGER_MODE, CHUNKED_PREFILL, MODEL_NAME",
    [
        (2, 4, 0, 1, "mistralai/Mixtral-8x7B-Instruct-v0.1"),
        (2, 4, 1, 0, "mistralai/Mixtral-8x7B-Instruct-v0.1"),
        (1, 8, 0, 1, "mistralai/Mixtral-8x7B-Instruct-v0.1"),
        (1, 8, 1, 0, "mistralai/Mixtral-8x7B-Instruct-v0.1"),
    ])

Also fixed some issues in the test file:

Use TP_SIZE x PP_SIZE as the TP size of the reference. The current max(TP_SIZE, 2) doesn't work for larger models.
Do not use 0's as the token ID. This may generate random outputs for certain models/tokenizers.

youkaichao · 2024-07-18T00:35:33Z

tests/distributed/test_pipeline_parallel.py

+        # Use the same number or at most 8 GPUs to hold the model.
+        # In this test we assume the model can fit in 8 GPUs.
+        str(min(TP_SIZE * PP_SIZE, 8)),


it's not going to work. this will run in multi-node tests with mp backend, and we can use at most 2 GPUs.

you can revert this change, keep it only for your local testing.

Reverted with comments.

youkaichao

LGTM if tests pass

Signed-off-by: Alvant <alvasian@yandex.ru>

comaniac added the ready ONLY add when PR is ready to merge/full CI is needed label Jul 17, 2024

comaniac force-pushed the mixtral-pp branch 2 times, most recently from f83603e to d74f2e6 Compare July 17, 2024 21:50

comaniac added 3 commits July 17, 2024 16:22

Mixtral PP

1335d3f

fix

a4c7f16

test

edd2018

comaniac force-pushed the mixtral-pp branch from d74f2e6 to edd2018 Compare July 18, 2024 00:29

youkaichao reviewed Jul 18, 2024

View reviewed changes

revert

2afde67

youkaichao approved these changes Jul 18, 2024

View reviewed changes

youkaichao merged commit b5af8c2 into vllm-project:main Jul 18, 2024
69 of 72 checks passed

youkaichao mentioned this pull request Jul 18, 2024

[ Misc ] non-uniform quantization via compressed-tensors for Llama #6515

Merged

comaniac deleted the mixtral-pp branch July 18, 2024 02:27

fialhocoelho pushed a commit to opendatahub-io/vllm that referenced this pull request Jul 19, 2024

[Model] Pipeline parallel support for Mixtral (vllm-project#6516)

e57a59c

xjpang pushed a commit to xjpang/vllm that referenced this pull request Jul 24, 2024

[Model] Pipeline parallel support for Mixtral (vllm-project#6516)

1f670fc

gnpinkert pushed a commit to gnpinkert/vllm that referenced this pull request Jul 26, 2024

[Model] Pipeline parallel support for Mixtral (vllm-project#6516)

d578e73

gnpinkert pushed a commit to gnpinkert/vllm that referenced this pull request Jul 26, 2024

[Model] Pipeline parallel support for Mixtral (vllm-project#6516)

5a7708d

gnpinkert pushed a commit to gnpinkert/vllm that referenced this pull request Jul 27, 2024

[Model] Pipeline parallel support for Mixtral (vllm-project#6516)

8fe34aa

gnpinkert pushed a commit to gnpinkert/vllm that referenced this pull request Aug 26, 2024

[Model] Pipeline parallel support for Mixtral (vllm-project#6516)

a7b8118

DarkLight1337 mentioned this pull request Sep 13, 2024

[Model] Pipeline parallel support for Mixtral #6403

Closed

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024

[Model] Pipeline parallel support for Mixtral (vllm-project#6516)

f239a29

Signed-off-by: Alvant <alvasian@yandex.ru>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Model] Pipeline parallel support for Mixtral #6516

[Model] Pipeline parallel support for Mixtral #6516

comaniac commented Jul 17, 2024 •

edited

Loading

github-actions bot commented Jul 17, 2024

comaniac commented Jul 17, 2024

youkaichao commented Jul 17, 2024

comaniac commented Jul 18, 2024

youkaichao Jul 18, 2024

comaniac Jul 18, 2024

youkaichao left a comment

[Model] Pipeline parallel support for Mixtral #6516

[Model] Pipeline parallel support for Mixtral #6516

Conversation

comaniac commented Jul 17, 2024 • edited Loading

github-actions bot commented Jul 17, 2024

comaniac commented Jul 17, 2024

youkaichao commented Jul 17, 2024

comaniac commented Jul 18, 2024

youkaichao Jul 18, 2024

Choose a reason for hiding this comment

comaniac Jul 18, 2024

Choose a reason for hiding this comment

youkaichao left a comment

Choose a reason for hiding this comment

comaniac commented Jul 17, 2024 •

edited

Loading