Skip to content

Conversation

@bveeramani
Copy link
Member

@bveeramani bveeramani commented Oct 23, 2025

Summary

This PR removes the image_classification_chaos_no_scale_back release test and its associated setup script (setup_cluster_compute_config_updater.py). This test has become non-functional and is no longer providing useful signal.

Background

The image_classification_chaos_no_scale_back release test was designed to validate Ray Data's fault tolerance when many nodes abruptly get preempted at the same time.

The test worked by:

  1. Running on an autoscaling cluster with 1-10 nodes
  2. Updating the compute config mid-test to downscale to 5 nodes
  3. Asserting that there are dead nodes as a sanity check

Why This Test Is Broken

After the removal of Parquet metadata fetching in #56105 (September 2, 2025), the autoscaling behavior changed significantly:

  • Before metadata removal: The cluster would autoscale more aggressively because metadata fetching created additional tasks that triggered faster scale-up. The cluster would scale past 5 nodes, then downscale, leaving dead nodes that the test could detect.

  • After metadata removal: Without the metadata fetching tasks, the cluster doesn't scale up fast enough to get past 5 nodes before the downscale happens. This means there are no dead nodes to detect, causing the test to fail.

Why We're Removing It

  1. Test is fundamentally broken: The test's assumptions about autoscaling behavior are no longer valid after the metadata fetching removal
  2. Not actively monitored: The test is marked as unstable and isn't closely watched

Changes

  • Removed image_classification_chaos_no_scale_back test from release/release_data_tests.yaml
  • Deleted release/nightly_tests/setup_cluster_compute_config_updater.py (only used by this test)

Related

See #56105

Fixes #56528

…se test

This test became broken after the removal of Parquet metadata fetching
tasks in #56105. The test relies on specific autoscaling behavior that
no longer works as expected without metadata fetching.

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
@bveeramani bveeramani changed the title [Data] Remove unstable image_classification_chaos_no_scale_back release test [Data] Remove unstable image_classification_chaos_no_scale_back release test Oct 23, 2025
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request removes the unstable image_classification_chaos_no_scale_back release test and its associated setup script. The justification for this removal is well-explained in the pull request description, citing that the test is broken due to changes in autoscaling behavior and is no longer providing a useful signal. The changes are straightforward and correctly implement the removal. I've added one comment on the deleted script regarding a hardcoded URL as a note for future reference.

@bveeramani bveeramani enabled auto-merge (squash) October 23, 2025 17:15
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Oct 23, 2025
@bveeramani bveeramani merged commit b8924aa into master Oct 23, 2025
7 checks passed
@bveeramani bveeramani deleted the remove-no-scale-back branch October 23, 2025 17:46
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 27, 2025
…ease test (ray-project#58048)

## Summary

This PR removes the `image_classification_chaos_no_scale_back` release
test and its associated setup script
(`setup_cluster_compute_config_updater.py`). This test has become
non-functional and is no longer providing useful signal.

## Background

The `image_classification_chaos_no_scale_back` release test was designed
to validate Ray Data's fault tolerance when many nodes abruptly get
preempted at the same time.

The test worked by:
1. Running on an autoscaling cluster with 1-10 nodes
2. Updating the compute config mid-test to downscale to 5 nodes
3. Asserting that there are dead nodes as a sanity check

## Why This Test Is Broken

After the removal of Parquet metadata fetching in ray-project#56105 (September 2,
2025), the autoscaling behavior changed significantly:

- **Before metadata removal**: The cluster would autoscale more
aggressively because metadata fetching created additional tasks that
triggered faster scale-up. The cluster would scale past 5 nodes, then
downscale, leaving dead nodes that the test could detect.

- **After metadata removal**: Without the metadata fetching tasks, the
cluster doesn't scale up fast enough to get past 5 nodes before the
downscale happens. This means there are no dead nodes to detect, causing
the test to fail.

## Why We're Removing It

1. **Test is fundamentally broken**: The test's assumptions about
autoscaling behavior are no longer valid after the metadata fetching
removal
2. **Not actively monitored**: The test is marked as unstable and isn't
closely watched

## Changes

- Removed `image_classification_chaos_no_scale_back` test from
`release/release_data_tests.yaml`
- Deleted
`release/nightly_tests/setup_cluster_compute_config_updater.py` (only
used by this test)

## Related

See ray-project#56105

Fixes ray-project#56528

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: xgui <xgui@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
…ease test (ray-project#58048)

## Summary

This PR removes the `image_classification_chaos_no_scale_back` release
test and its associated setup script
(`setup_cluster_compute_config_updater.py`). This test has become
non-functional and is no longer providing useful signal.

## Background

The `image_classification_chaos_no_scale_back` release test was designed
to validate Ray Data's fault tolerance when many nodes abruptly get
preempted at the same time.

The test worked by:
1. Running on an autoscaling cluster with 1-10 nodes
2. Updating the compute config mid-test to downscale to 5 nodes
3. Asserting that there are dead nodes as a sanity check

## Why This Test Is Broken

After the removal of Parquet metadata fetching in ray-project#56105 (September 2,
2025), the autoscaling behavior changed significantly:

- **Before metadata removal**: The cluster would autoscale more
aggressively because metadata fetching created additional tasks that
triggered faster scale-up. The cluster would scale past 5 nodes, then
downscale, leaving dead nodes that the test could detect.

- **After metadata removal**: Without the metadata fetching tasks, the
cluster doesn't scale up fast enough to get past 5 nodes before the
downscale happens. This means there are no dead nodes to detect, causing
the test to fail.

## Why We're Removing It

1. **Test is fundamentally broken**: The test's assumptions about
autoscaling behavior are no longer valid after the metadata fetching
removal
2. **Not actively monitored**: The test is marked as unstable and isn't
closely watched

## Changes

- Removed `image_classification_chaos_no_scale_back` test from
`release/release_data_tests.yaml`
- Deleted
`release/nightly_tests/setup_cluster_compute_config_updater.py` (only
used by this test)

## Related

See ray-project#56105

Fixes ray-project#56528

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
…ease test (ray-project#58048)

## Summary

This PR removes the `image_classification_chaos_no_scale_back` release
test and its associated setup script
(`setup_cluster_compute_config_updater.py`). This test has become
non-functional and is no longer providing useful signal.

## Background

The `image_classification_chaos_no_scale_back` release test was designed
to validate Ray Data's fault tolerance when many nodes abruptly get
preempted at the same time.

The test worked by:
1. Running on an autoscaling cluster with 1-10 nodes
2. Updating the compute config mid-test to downscale to 5 nodes
3. Asserting that there are dead nodes as a sanity check

## Why This Test Is Broken

After the removal of Parquet metadata fetching in ray-project#56105 (September 2,
2025), the autoscaling behavior changed significantly:

- **Before metadata removal**: The cluster would autoscale more
aggressively because metadata fetching created additional tasks that
triggered faster scale-up. The cluster would scale past 5 nodes, then
downscale, leaving dead nodes that the test could detect.

- **After metadata removal**: Without the metadata fetching tasks, the
cluster doesn't scale up fast enough to get past 5 nodes before the
downscale happens. This means there are no dead nodes to detect, causing
the test to fail.

## Why We're Removing It

1. **Test is fundamentally broken**: The test's assumptions about
autoscaling behavior are no longer valid after the metadata fetching
removal
2. **Not actively monitored**: The test is marked as unstable and isn't
closely watched

## Changes

- Removed `image_classification_chaos_no_scale_back` test from
`release/release_data_tests.yaml`
- Deleted
`release/nightly_tests/setup_cluster_compute_config_updater.py` (only
used by this test)

## Related

See ray-project#56105

Fixes ray-project#56528

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Release test image_classification_chaos_no_scale_back failed

3 participants