-
Notifications
You must be signed in to change notification settings - Fork 7k
[train][release] Add training ingest soak test #57120
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[train][release] Add training ingest soak test #57120
Conversation
Signed-off-by: Timothy Seah <tseah@anyscale.com>
d7e6e97 to
3c376f8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request adds a new soak test for training ingest. The test runs an image classification task for 1000 epochs on a 16-GPU cluster. My review focuses on the configuration of this new test. I've suggested removing a debug flag to prevent excessive logging and recommended adjusting the test timeout to be more conservative to avoid tying up resources unnecessarily. Overall, the changes are straightforward and look good with these minor adjustments.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Signed-off-by: Timothy Seah <tseah@anyscale.com>
| - name: worker_node_cpu | ||
| instance_type: m5.4xlarge | ||
| max_workers: 4 | ||
| min_workers: 4 | ||
| use_spot: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to add cpu nodes for this test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See the testing notes (#57120 (comment)) - doing this kept memory growth low. Lmk if this is fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting 🤔
release/train_tests/benchmark/compute_configs/compute_gpu_4x4_cpu_4_aws.yaml
Show resolved
Hide resolved
Signed-off-by: Timothy Seah <tseah@anyscale.com>
justinvyu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
| - name: worker_node_cpu | ||
| instance_type: m5.4xlarge | ||
| max_workers: 4 | ||
| min_workers: 4 | ||
| use_spot: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
interesting 🤔
Run `full_training.parquet` for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
Run `full_training.parquet` for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: xgui <xgui@anyscale.com>
Run `full_training.parquet` for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Run `full_training.parquet` for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
Run `full_training.parquet` for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Summary
Run
full_training.parquetfor 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold.Testing
OOD run: https://buildkite.com/ray-project/release/builds/61277#0199a61f-b2aa-4782-8eeb-83429cc50a36.
head node cpu 0 run: https://console.anyscale-staging.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_92c7b71w55flm6gv6imv4m6vqg/jobs/prodjob_m5r3bthpdpf4x3gfpn4fcmyxvh/data?job-logs-section-tabs=application_logs&job-tab=metrics, https://buildkite.com/ray-project/release/builds/62401#_. No more disk growth but there is some memory growth that tapers off and fortunately doesn't OOM the test
head node cpu 0, gpu worker cpu 0, cpu worker run: https://console.anyscale-staging.com/cld_kvedZWag2qA8i5BjxUevf5i7/prj_92c7b71w55flm6gv6imv4m6vqg/jobs/prodjob_1akz6zmrr3mtgf1cpqyhc7mwgm?job-logs-section-tabs=application_logs&job-tab=metrics, https://buildkite.com/ray-project/release/builds/62479#0199c5cf-56fa-44c2-9895-3b8a863fe33e. Still no disk growth, and memory growth also stops much earlier
Note
Adds a weekly GPU soak test that runs the image_classification parquet ingest benchmark for 1000 epochs with Ray Data.
training_ingest_benchmark-soak-testinrelease/release_tests.yaml:ml; GPU clustercompute_configs/compute_gpu_4x4_aws.yaml.RAY_DATA_DEBUG_RESOURCE_MANAGER=1.train_tests/benchmark/train_benchmark.pywith--task=image_classification --dataloader_type=ray_data --image_classification_data_format=parquet --num_workers=16 --num_epochs=1000;timeout: 86400.Written by Cursor Bugbot for commit d7e6e97. This will update automatically on new commits. Configure here.