[train][release] Add training ingest soak test #57120

TimothySeah · 2025-10-02T02:11:18Z

Summary

Run full_training.parquet for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold.

Testing

OOD run: https://buildkite.com/ray-project/release/builds/61277#0199a61f-b2aa-4782-8eeb-83429cc50a36.

head node cpu 0 run: https://console.anyscale-staging.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_92c7b71w55flm6gv6imv4m6vqg/jobs/prodjob_m5r3bthpdpf4x3gfpn4fcmyxvh/data?job-logs-section-tabs=application_logs&job-tab=metrics, https://buildkite.com/ray-project/release/builds/62401#_. No more disk growth but there is some memory growth that tapers off and fortunately doesn't OOM the test

head node cpu 0, gpu worker cpu 0, cpu worker run: https://console.anyscale-staging.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_92c7b71w55flm6gv6imv4m6vqg/jobs/prodjob_1akz6zmrr3mtgf1cpqyhc7mwgm?job-logs-section-tabs=application_logs&job-tab=metrics, https://buildkite.com/ray-project/release/builds/62479#0199c5cf-56fa-44c2-9895-3b8a863fe33e. Still no disk growth, and memory growth also stops much earlier

Note

Adds a weekly GPU soak test that runs the image_classification parquet ingest benchmark for 1000 epochs with Ray Data.

Train tests:
- New soak test training_ingest_benchmark-soak-test in release/release_tests.yaml:
  - Weekly; team ml; GPU cluster compute_configs/compute_gpu_4x4_aws.yaml.
  - BYOD runtime env enables RAY_DATA_DEBUG_RESOURCE_MANAGER=1.
  - Runs train_tests/benchmark/train_benchmark.py with --task=image_classification --dataloader_type=ray_data --image_classification_data_format=parquet --num_workers=16 --num_epochs=1000; timeout: 86400.

^{Written by Cursor Bugbot for commit d7e6e97. This will update automatically on new commits. Configure here.}

Signed-off-by: Timothy Seah <tseah@anyscale.com>

gemini-code-assist

Code Review

This pull request adds a new soak test for training ingest. The test runs an image classification task for 1000 epochs on a 16-GPU cluster. My review focuses on the configuration of this new test. I've suggested removing a debug flag to prevent excessive logging and recommended adjusting the test timeout to be more conservative to avoid tying up resources unnecessarily. Overall, the changes are straightforward and look good with these minor adjustments.

release/release_tests.yaml

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…test

Signed-off-by: Timothy Seah <tseah@anyscale.com>

release/release_tests.yaml

justinvyu · 2025-10-14T20:12:01Z

release/train_tests/benchmark/compute_configs/compute_gpu_4x4_cpu_4_aws.yaml

+    - name: worker_node_cpu
+      instance_type: m5.4xlarge
+      max_workers: 4
+      min_workers: 4
+      use_spot: false


do we need to add cpu nodes for this test?

See the testing notes (#57120 (comment)) - doing this kept memory growth low. Lmk if this is fine.

interesting 🤔

release/train_tests/benchmark/compute_configs/compute_gpu_4x4_cpu_4_aws.yaml

release/release_tests.yaml

Signed-off-by: Timothy Seah <tseah@anyscale.com>

…test

justinvyu

Thanks!

justinvyu · 2025-10-15T19:20:43Z

release/train_tests/benchmark/compute_configs/compute_gpu_4x4_cpu_4_aws.yaml

+    - name: worker_node_cpu
+      instance_type: m5.4xlarge
+      max_workers: 4
+      min_workers: 4
+      use_spot: false


interesting 🤔

Run `full_training.parquet` for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

Run `full_training.parquet` for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>

Run `full_training.parquet` for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

Run `full_training.parquet` for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

Run `full_training.parquet` for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

TimothySeah marked this pull request as draft October 2, 2025 02:11

[train][release] Add training ingest soak test

3c376f8

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah force-pushed the tseah/train-soak-test branch from d7e6e97 to 3c376f8 Compare October 2, 2025 02:13

gemini-code-assist bot reviewed Oct 2, 2025

View reviewed changes

release/release_tests.yaml Outdated Show resolved Hide resolved

release/release_tests.yaml Outdated Show resolved Hide resolved

TimothySeah added 7 commits October 2, 2025 10:07

reduce to 100 epochs

a9ca5d6

Signed-off-by: Timothy Seah <tseah@anyscale.com>

remove debug logs which are too long

72e9a5a

Signed-off-by: Timothy Seah <tseah@anyscale.com>

Merge remote-tracking branch 'upstream/master' into tseah/train-soak-…

c958ce6

…test

try with fewer epochs and disabled progress bars

110b35f

Signed-off-by: Timothy Seah <tseah@anyscale.com>

add progress bar back

2525039

Signed-off-by: Timothy Seah <tseah@anyscale.com>

[DO NOT SUBMIT] set head node cpu resources to 0 for testing

ff6353c

Signed-off-by: Timothy Seah <tseah@anyscale.com>

try with 4 cpu workers and cpu:0 on gpu workers

7282f75

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah marked this pull request as ready for review October 9, 2025 17:52

TimothySeah added the go add ONLY when ready to merge, run all tests label Oct 9, 2025

TimothySeah requested a review from matthewdeng October 9, 2025 17:52

This comment was marked as outdated.

Sign in to view

ray-gardener bot added train Ray Train Related Issue release-test release test labels Oct 9, 2025

justinvyu reviewed Oct 14, 2025

View reviewed changes

address comments

98d0318

Signed-off-by: Timothy Seah <tseah@anyscale.com>

TimothySeah requested a review from justinvyu October 14, 2025 23:47

Merge remote-tracking branch 'upstream/master' into tseah/train-soak-…

a3d430a

…test

justinvyu approved these changes Oct 15, 2025

View reviewed changes

justinvyu merged commit c6b8c9f into ray-project:master Oct 15, 2025
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[train][release] Add training ingest soak test #57120

[train][release] Add training ingest soak test #57120

Uh oh!

TimothySeah commented Oct 2, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

justinvyu Oct 14, 2025

Uh oh!

TimothySeah Oct 14, 2025

Uh oh!

justinvyu Oct 15, 2025

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Uh oh!

justinvyu Oct 15, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[train][release] Add training ingest soak test #57120

[train][release] Add training ingest soak test #57120

Uh oh!

Conversation

TimothySeah commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

This comment was marked as outdated.

Uh oh!

Uh oh!

justinvyu Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

TimothySeah Oct 14, 2025

Choose a reason for hiding this comment

Uh oh!

justinvyu Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

justinvyu left a comment

Choose a reason for hiding this comment

Uh oh!

justinvyu Oct 15, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

TimothySeah commented Oct 2, 2025 •

edited

Loading