Skip to content

Commit 232341d

Browse files
Sparks0219dayshah
andauthored
[core] Nightly release test with cross AZ fault injection (#57579)
Signed-off-by: dayshah <dhyey2019@gmail.com> Signed-off-by: joshlee <joshlee@anyscale.com> Co-authored-by: dayshah <dhyey2019@gmail.com>
1 parent 0a0754f commit 232341d

File tree

4 files changed

+93
-5
lines changed

4 files changed

+93
-5
lines changed
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
2+
region: us-west-2
3+
4+
head_node_type:
5+
name: head
6+
instance_type: m5.2xlarge
7+
resources:
8+
CPU: 0
9+
GPU: 0
10+
11+
worker_node_types:
12+
- name: worker
13+
instance_type: m5.2xlarge
14+
min_workers: 250
15+
max_workers: 350
16+
17+
flags:
18+
enable_multi_az_serve: true
19+
allow-cross-zone-autoscaling: true
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
cloud_id: {{env["ANYSCALE_CLOUD_ID"]}}
2+
region: us-west1
3+
4+
head_node_type:
5+
name: head
6+
instance_type: n2-standard-8
7+
resources:
8+
CPU: 0
9+
GPU: 0
10+
11+
worker_node_types:
12+
- name: worker
13+
instance_type: n2-standard-8
14+
min_workers: 250
15+
max_workers: 350
16+
17+
flags:
18+
enable_multi_az_serve: true
19+
allow-cross-zone-autoscaling: true

release/nightly_tests/dataset/map_benchmark.py

Lines changed: 12 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,13 @@ def parse_args() -> argparse.Namespace:
6969
"output of the first run as input."
7070
),
7171
)
72+
parser.add_argument(
73+
"--concurrency",
74+
default=[1, 1024],
75+
nargs=2,
76+
type=int,
77+
help="Concurrency to use with 'map_batches'.",
78+
)
7279
return parser.parse_args()
7380

7481

@@ -80,7 +87,8 @@ def main(args: argparse.Namespace) -> None:
8087
path = f"s3://ray-benchmark-data/tpch/parquet/sf{args.sf}/lineitem"
8188
path = [path] * args.repeat_inputs
8289

83-
def apply_map_batches(ds, use_actors=False):
90+
def apply_map_batches(ds):
91+
use_actors = args.compute == "actors"
8492
if not use_actors:
8593
return ds.map_batches(
8694
functools.partial(
@@ -100,7 +108,7 @@ def apply_map_batches(ds, use_actors=False):
100108
fn_constructor_args=[model_ref, args.map_batches_sleep_ms],
101109
batch_format=args.batch_format,
102110
batch_size=args.batch_size,
103-
concurrency=(1, 1024),
111+
concurrency=tuple(args.concurrency),
104112
)
105113

106114
def benchmark_fn():
@@ -111,10 +119,9 @@ def benchmark_fn():
111119
if args.api == "map":
112120
ds = ds.map(increment_row)
113121
elif args.api == "map_batches":
114-
use_actors = args.compute == "actors"
115-
ds = apply_map_batches(ds, use_actors)
122+
ds = apply_map_batches(ds)
116123
if args.repeat_map_batches == "repeat":
117-
ds = apply_map_batches(ds, use_actors)
124+
ds = apply_map_batches(ds)
118125
elif args.api == "flat_map":
119126
ds = ds.flat_map(flat_increment_row)
120127

release/release_data_tests.yaml

Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -646,3 +646,46 @@
646646
run:
647647
timeout: 5400
648648
script: python tpch_q1.py --sf 100
649+
650+
#################################################
651+
# Cross-AZ RPC fault tolerance test
652+
#################################################
653+
654+
- name: "cross_az_map_batches_autoscaling"
655+
frequency: nightly
656+
env: gce
657+
658+
cluster:
659+
cluster_compute: cross_az_250_350_compute_gce.yaml
660+
661+
run:
662+
timeout: 10800
663+
script: >
664+
python map_benchmark.py --api map_batches --batch-format numpy
665+
--compute actors --sf 1000 --repeat-inputs 1 --concurrency 1024 2048
666+
667+
variations:
668+
- __suffix__: gce
669+
- __suffix__: aws
670+
env: aws
671+
cluster:
672+
cluster_compute: cross_az_250_350_compute_aws.yaml
673+
# TODO(#58246): Enable these variations once RAY_testing_rpc_failure is supported.
674+
# - __suffix__: gce_failure_injection
675+
# cluster:
676+
# byod:
677+
# # RAY_testing_rpc_failure is used to inject RPC failures across all RPCs (*) with no limit (-1) on the number of total failures,
678+
# # 10% request failures, 10% response failures, 1 guaranteed request failure and 1 guaranteed response failure.
679+
# # RAY_testing_rpc_failure_avoid_intra_node_failures=1 is used to avoid injecting RPC failures within the same node.
680+
# runtime_env:
681+
# - RAY_testing_rpc_failure="*=-1:10:10:1:1"
682+
# - RAY_testing_rpc_failure_avoid_intra_node_failures=1
683+
# cluster_compute: cross_az_250_350_compute_gce.yaml
684+
# - __suffix__: aws_failure_injection
685+
# env: aws
686+
# cluster:
687+
# byod:
688+
# runtime_env:
689+
# - RAY_testing_rpc_failure="*=-1:10:10:1:1"
690+
# - RAY_testing_rpc_failure_avoid_intra_node_failures=1
691+
# cluster_compute: cross_az_250_350_compute_aws.yaml

0 commit comments

Comments
 (0)