[train] Update vicuna release test example to use V2 #57767

justinvyu · 2025-10-15T23:32:31Z

Description

Updates the vicuna lightning deepspeed example to run w/ Train V2.

Related issues

Types of change

Checklist

Does this PR introduce breaking changes?

Yes ⚠️
No

Testing:

Added/updated tests for my changes
Tested the changes manually
This PR is not tested ❌ (please explain why)

Code Quality:

Signed off every commit (git commit -s)
Ran pre-commit hooks (setup guide)

Documentation:

Updated documentation (if applicable) (contribution guide)
Added new APIs to doc/source/ (if applicable)

Additional context

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

gemini-code-assist

Code Review

This PR updates the cluster configuration for the Vicuna example to use 16 worker GPUs for training, which is a valid change. However, this introduces an inconsistency with the documentation in the accompanying Jupyter Notebook, which has not been updated. I've left a specific comment on the configuration file change. Please update the notebook documentation to match the new cluster setup to avoid user confusion.

gemini-code-assist · 2025-10-15T23:33:42Z

...air_examples/vicuna_13b_lightning_deepspeed_finetuning/vicuna_13b_deepspeed_compute_aws.yaml

+    instance_type: m5.4xlarge

 worker_node_types:
    - name: worker_node
      instance_type: g5.4xlarge
-      min_workers: 15
-      max_workers: 15
+      min_workers: 16
+      max_workers: 16


These changes correctly adjust the cluster to use 16 worker GPUs for training. However, this makes the documentation in the corresponding notebook (vicuna_13b_lightning_deepspeed_finetune.ipynb) outdated and misleading.

The notebook's 'Cluster Setting' section needs to be updated to reflect:

The head node is now m5.4xlarge (a CPU instance).

There are now 16 worker nodes.

The tip about using a GPU head node for inference is no longer accurate and should be revised, as inference will now run on a worker node.

Please update the notebook to ensure the example remains consistent and clear for users.

JasonLi1909

Looks good to me! I kicked off the release test so we should make sure it passes there too.

JasonLi1909 · 2025-10-15T23:46:36Z

doc/source/train/examples/lightning/vicuna_13b_lightning_deepspeed_finetune.ipynb

+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "2025-10-15 15:50:45,333\tINFO worker.py:1833 -- Connecting to existing Ray cluster at address: 10.0.171.127:6379...\n",


We should delete these output cells here and elsewhere so they don't show up in the docs.

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…tevicuna

Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Cherrypicking in e7a79ba Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>

Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>

Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

justinvyu added 2 commits October 15, 2025 16:31

update test

71dc9af

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

update compute config

fe7ee60

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

justinvyu requested review from a team as code owners October 15, 2025 23:32

justinvyu assigned JasonLi1909 Oct 15, 2025

gemini-code-assist bot reviewed Oct 15, 2025

View reviewed changes

JasonLi1909 approved these changes Oct 16, 2025

View reviewed changes

clear cells

8f35ef5

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

ray-gardener bot added docs An issue or change related to documentation train Ray Train Related Issue release-test release test labels Oct 16, 2025

fix

0c2d354

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

This comment was marked as outdated.

Sign in to view

fix

b73baa9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

This comment was marked as outdated.

Sign in to view

justinvyu added 2 commits October 16, 2025 14:34

please

8cec7ec

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

revert compute config changes

7898604

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

This comment was marked as outdated.

Sign in to view

justinvyu added 3 commits October 17, 2025 13:03

reduce num batches

7b215e9

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

no sigkill

0650901

Signed-off-by: Justin Yu <justinvyu@anyscale.com>

Merge branch 'master' of https://github.com/ray-project/ray into upda…

30159ad

…tevicuna

justinvyu enabled auto-merge (squash) October 22, 2025 21:46

github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 22, 2025

justinvyu merged commit e7a79ba into ray-project:master Oct 22, 2025
8 checks passed

aslonnie pushed a commit that referenced this pull request Oct 23, 2025

[train] Update vicuna release test example to use V2 (#57767) (#58053)

801bd72

Cherrypicking in e7a79ba Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[train] Update vicuna release test example to use V2 #57767

[train] Update vicuna release test example to use V2 #57767

Uh oh!

justinvyu commented Oct 15, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 15, 2025

Uh oh!

JasonLi1909 left a comment •

edited

Loading

Uh oh!

JasonLi1909 Oct 15, 2025

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[train] Update vicuna release test example to use V2 #57767

[train] Update vicuna release test example to use V2 #57767

Uh oh!

Conversation

justinvyu commented Oct 15, 2025

Description

Related issues

Types of change

Checklist

Additional context

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JasonLi1909 Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JasonLi1909 left a comment •

edited

Loading