Skip to content

Commit c6b8c9f

Browse files
authored
[train][release] Add training ingest soak test (ray-project#57120)
Run `full_training.parquet` for 50 epochs as a soak test. In the future, we can make this more meaningful by tracking metrics (e.g. memory usage) and failing if they exceed a certain threshold. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>
1 parent c963d64 commit c6b8c9f

File tree

2 files changed

+39
-0
lines changed

2 files changed

+39
-0
lines changed

release/release_tests.yaml

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1800,6 +1800,23 @@
18001800
timeout: 1200
18011801
script: RAY_TRAIN_V2_ENABLED=1 python train_benchmark.py --task=recsys --dataloader_type=ray_data --num_workers=8 --train_batch_size=8192 --validation_batch_size=16384 --num_epochs=1
18021802

1803+
- name: training_ingest_benchmark-soak_test
1804+
group: Train tests
1805+
working_dir: train_tests/benchmark
1806+
1807+
frequency: weekly
1808+
team: ml
1809+
1810+
cluster:
1811+
byod:
1812+
type: gpu
1813+
cluster_compute: compute_configs/compute_gpu_4x4_cpu_4_aws.yaml
1814+
1815+
run:
1816+
timeout: 43200
1817+
long_running: true
1818+
script: RAY_TRAIN_V2_ENABLED=1 python train_benchmark.py --task=image_classification --dataloader_type=ray_data --num_workers=16 --image_classification_data_format=parquet --num_epochs=50
1819+
18031820
- name: train_multinode_persistence
18041821
python: "3.10"
18051822
group: Train tests
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
2+
region: us-west-2
3+
4+
head_node_type:
5+
name: head_node
6+
instance_type: m5.4xlarge
7+
resources:
8+
cpu: 0
9+
10+
worker_node_types:
11+
- name: worker_node_gpu
12+
instance_type: g4dn.12xlarge
13+
max_workers: 4
14+
min_workers: 4
15+
use_spot: false
16+
resources:
17+
cpu: 0
18+
- name: worker_node_cpu
19+
instance_type: m5.4xlarge
20+
max_workers: 4
21+
min_workers: 4
22+
use_spot: false

0 commit comments

Comments
 (0)