Skip to content

Conversation

@TimothySeah
Copy link
Contributor

@TimothySeah TimothySeah commented Oct 2, 2025

Summary

Run full_training.parquet for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold.

Testing

OOD run: https://buildkite.com/ray-project/release/builds/61277#0199a61f-b2aa-4782-8eeb-83429cc50a36.

head node cpu 0 run: https://console.anyscale-staging.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_92c7b71w55flm6gv6imv4m6vqg/jobs/prodjob_m5r3bthpdpf4x3gfpn4fcmyxvh/data?job-logs-section-tabs=application_logs&job-tab=metrics, https://buildkite.com/ray-project/release/builds/62401#_. No more disk growth but there is some memory growth that tapers off and fortunately doesn't OOM the test

Screenshot 2025-10-09 at 10 18 50 AM

head node cpu 0, gpu worker cpu 0, cpu worker run: https://console.anyscale-staging.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_92c7b71w55flm6gv6imv4m6vqg/jobs/prodjob_1akz6zmrr3mtgf1cpqyhc7mwgm?job-logs-section-tabs=application_logs&job-tab=metrics, https://buildkite.com/ray-project/release/builds/62479#0199c5cf-56fa-44c2-9895-3b8a863fe33e. Still no disk growth, and memory growth also stops much earlier

Screenshot 2025-10-09 at 10 18 12 AM

Note

Adds a weekly GPU soak test that runs the image_classification parquet ingest benchmark for 1000 epochs with Ray Data.

  • Train tests:
    • New soak test training_ingest_benchmark-soak-test in release/release_tests.yaml:
      • Weekly; team ml; GPU cluster compute_configs/compute_gpu_4x4_aws.yaml.
      • BYOD runtime env enables RAY_DATA_DEBUG_RESOURCE_MANAGER=1.
      • Runs train_tests/benchmark/train_benchmark.py with --task=image_classification --dataloader_type=ray_data --image_classification_data_format=parquet --num_workers=16 --num_epochs=1000; timeout: 86400.

Written by Cursor Bugbot for commit d7e6e97. This will update automatically on new commits. Configure here.

@TimothySeah TimothySeah marked this pull request as draft October 2, 2025 02:11
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah force-pushed the tseah/train-soak-test branch from d7e6e97 to 3c376f8 Compare October 2, 2025 02:13
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a new soak test for training ingest. The test runs an image classification task for 1000 epochs on a 16-GPU cluster. My review focuses on the configuration of this new test. I've suggested removing a debug flag to prevent excessive logging and recommended adjusting the test timeout to be more conservative to avoid tying up resources unnecessarily. Overall, the changes are straightforward and look good with these minor adjustments.

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah marked this pull request as ready for review October 9, 2025 17:52
@TimothySeah TimothySeah added the go add ONLY when ready to merge, run all tests label Oct 9, 2025
cursor[bot]

This comment was marked as outdated.

@ray-gardener ray-gardener bot added train Ray Train Related Issue release-test release test labels Oct 9, 2025
Comment on lines +18 to +22
- name: worker_node_cpu
instance_type: m5.4xlarge
max_workers: 4
min_workers: 4
use_spot: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to add cpu nodes for this test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the testing notes (#57120 (comment)) - doing this kept memory growth low. Lmk if this is fine.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting 🤔

Signed-off-by: Timothy Seah <tseah@anyscale.com>
@TimothySeah TimothySeah requested a review from justinvyu October 14, 2025 23:47
Copy link
Contributor

@justinvyu justinvyu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Comment on lines +18 to +22
- name: worker_node_cpu
instance_type: m5.4xlarge
max_workers: 4
min_workers: 4
use_spot: false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interesting 🤔

@justinvyu justinvyu merged commit c6b8c9f into ray-project:master Oct 15, 2025
6 checks passed
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
Run `full_training.parquet` for 50 epochs as a soak test. In the future,
we can make this more meaningful by tracking metrics (e.g. memory usage)
and failing if they exceed a certain threshold.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
Run `full_training.parquet` for 50 epochs as a soak test. In the future,
we can make this more meaningful by tracking metrics (e.g. memory usage)
and failing if they exceed a certain threshold.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
Run `full_training.parquet` for 50 epochs as a soak test. In the future,
we can make this more meaningful by tracking metrics (e.g. memory usage)
and failing if they exceed a certain threshold.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
Run `full_training.parquet` for 50 epochs as a soak test. In the future,
we can make this more meaningful by tracking metrics (e.g. memory usage)
and failing if they exceed a certain threshold.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
Run `full_training.parquet` for 50 epochs as a soak test. In the future,
we can make this more meaningful by tracking metrics (e.g. memory usage)
and failing if they exceed a certain threshold.

---------

Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests release-test release test train Ray Train Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants