Skip to content

Conversation

@abrarsheikh
Copy link
Contributor

@abrarsheikh abrarsheikh commented Oct 16, 2025

flaky test

RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \                                                                                                                                                                                                                        RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \
RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \
pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]"

What I think is the likely cause

When using RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 with min aggregation:

  1. Replicas emit metrics at slightly different times (even if just 10ms apart due to the timestamp bucketing/rounding)

  2. The merged timeseries reflects the ramp-up:

    • At t=0: Maybe only replica 1 is reporting → total = 25 requests
    • At t=0.01: Replica 2 starts reporting → total = 40 requests
    • At t=0.02: Replica 3 starts reporting → total = 50 requests
    • etc.
  3. min aggregation captures the starting point:

    • aggregate_timeseries(..., aggregation_function="min") takes the minimum value from the merged timeseries
    • This will always be one of those initial low values (like 25) when only a subset of replicas had started reporting
    • This value can never be ≥ 45, making the test inherently flaky

Removing min from test fixture.

I think a more robust solution is to keep the last report in the controller, generate the final time series using both reports, then clip the data and mid-point, then apply the aggregation function.

Signed-off-by: abrar <abrar@anyscale.com>
@abrarsheikh abrarsheikh requested a review from a team as a code owner October 16, 2025 05:36
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@akyang-anyscale
Copy link
Contributor

what's the reasoning behind the flakiness?

@abrarsheikh abrarsheikh added the go add ONLY when ready to merge, run all tests label Oct 16, 2025
@ray-gardener ray-gardener bot added the serve Ray Serve Related Issue label Oct 16, 2025
@zcin zcin merged commit 978a9af into master Oct 16, 2025
6 checks passed
@zcin zcin deleted the SERVE-1239-abrar-flaky branch October 16, 2025 17:51
justinyeh1995 pushed a commit to justinyeh1995/ray that referenced this pull request Oct 20, 2025
flaky test
```
RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \                                                                                                                                                                                                                        RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \
RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \
pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]"
```
What I think is the likely cause

When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min`
aggregation:
1. **Replicas emit metrics at slightly different times** (even if just
10ms apart due to the timestamp bucketing/rounding)
2. **The merged timeseries reflects the ramp-up**:
   - At t=0: Maybe only replica 1 is reporting → total = 25 requests
   - At t=0.01: Replica 2 starts reporting → total = 40 requests  
   - At t=0.02: Replica 3 starts reporting → total = 50 requests
   - etc.

3. **`min` aggregation captures the starting point**:
- `aggregate_timeseries(..., aggregation_function="min")` takes the
minimum value from the merged timeseries
- This will always be one of those initial low values (like 25) when
only a subset of replicas had started reporting
   - This value can never be ≥ 45, making the test inherently flaky

Removing min from test fixture.

I think a more robust solution is to keep the last report in the
controller, generate the final time series using both reports, then clip
the data and mid-point, then apply the aggregation function.

Signed-off-by: abrar <abrar@anyscale.com>
xinyuangui2 pushed a commit to xinyuangui2/ray that referenced this pull request Oct 22, 2025
flaky test
```
RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \                                                                                                                                                                                                                        RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \
RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \
pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]"
```
What I think is the likely cause

When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min`
aggregation:
1. **Replicas emit metrics at slightly different times** (even if just
10ms apart due to the timestamp bucketing/rounding)
2. **The merged timeseries reflects the ramp-up**:
   - At t=0: Maybe only replica 1 is reporting → total = 25 requests
   - At t=0.01: Replica 2 starts reporting → total = 40 requests
   - At t=0.02: Replica 3 starts reporting → total = 50 requests
   - etc.

3. **`min` aggregation captures the starting point**:
- `aggregate_timeseries(..., aggregation_function="min")` takes the
minimum value from the merged timeseries
- This will always be one of those initial low values (like 25) when
only a subset of replicas had started reporting
   - This value can never be ≥ 45, making the test inherently flaky

Removing min from test fixture.

I think a more robust solution is to keep the last report in the
controller, generate the final time series using both reports, then clip
the data and mid-point, then apply the aggregation function.

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: xgui <xgui@anyscale.com>
elliot-barn pushed a commit that referenced this pull request Oct 23, 2025
flaky test
```
RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \                                                                                                                                                                                                                        RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \
RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \
pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]"
```
What I think is the likely cause

When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min`
aggregation:
1. **Replicas emit metrics at slightly different times** (even if just
10ms apart due to the timestamp bucketing/rounding)
2. **The merged timeseries reflects the ramp-up**:
   - At t=0: Maybe only replica 1 is reporting → total = 25 requests
   - At t=0.01: Replica 2 starts reporting → total = 40 requests  
   - At t=0.02: Replica 3 starts reporting → total = 50 requests
   - etc.

3. **`min` aggregation captures the starting point**:
- `aggregate_timeseries(..., aggregation_function="min")` takes the
minimum value from the merged timeseries
- This will always be one of those initial low values (like 25) when
only a subset of replicas had started reporting
   - This value can never be ≥ 45, making the test inherently flaky

Removing min from test fixture.

I think a more robust solution is to keep the last report in the
controller, generate the final time series using both reports, then clip
the data and mid-point, then apply the aggregation function.

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
landscapepainter pushed a commit to landscapepainter/ray that referenced this pull request Nov 17, 2025
flaky test
```
RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \                                                                                                                                                                                                                        RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \
RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \
pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]"
```
What I think is the likely cause

When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min`
aggregation:
1. **Replicas emit metrics at slightly different times** (even if just
10ms apart due to the timestamp bucketing/rounding)
2. **The merged timeseries reflects the ramp-up**:
   - At t=0: Maybe only replica 1 is reporting → total = 25 requests
   - At t=0.01: Replica 2 starts reporting → total = 40 requests  
   - At t=0.02: Replica 3 starts reporting → total = 50 requests
   - etc.

3. **`min` aggregation captures the starting point**:
- `aggregate_timeseries(..., aggregation_function="min")` takes the
minimum value from the merged timeseries
- This will always be one of those initial low values (like 25) when
only a subset of replicas had started reporting
   - This value can never be ≥ 45, making the test inherently flaky

Removing min from test fixture.

I think a more robust solution is to keep the last report in the
controller, generate the final time series using both reports, then clip
the data and mid-point, then apply the aggregation function.

Signed-off-by: abrar <abrar@anyscale.com>
Aydin-ab pushed a commit to Aydin-ab/ray-aydin that referenced this pull request Nov 19, 2025
flaky test
```
RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \                                                                                                                                                                                                                        RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \
RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \
pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]"
```
What I think is the likely cause

When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min`
aggregation:
1. **Replicas emit metrics at slightly different times** (even if just
10ms apart due to the timestamp bucketing/rounding)
2. **The merged timeseries reflects the ramp-up**:
   - At t=0: Maybe only replica 1 is reporting → total = 25 requests
   - At t=0.01: Replica 2 starts reporting → total = 40 requests
   - At t=0.02: Replica 3 starts reporting → total = 50 requests
   - etc.

3. **`min` aggregation captures the starting point**:
- `aggregate_timeseries(..., aggregation_function="min")` takes the
minimum value from the merged timeseries
- This will always be one of those initial low values (like 25) when
only a subset of replicas had started reporting
   - This value can never be ≥ 45, making the test inherently flaky

Removing min from test fixture.

I think a more robust solution is to keep the last report in the
controller, generate the final time series using both reports, then clip
the data and mid-point, then apply the aggregation function.

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: Aydin Abiar <aydin@anyscale.com>
Future-Outlier pushed a commit to Future-Outlier/ray that referenced this pull request Dec 7, 2025
flaky test
```
RAY_SERVE_HANDLE_AUTOSCALING_METRIC_PUSH_INTERVAL_S=0.1 \                                                                                                                                                                                                                        RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1 \
RAY_SERVE_COLLECT_AUTOSCALING_METRICS_ON_HANDLE=0 \
pytest -svvx "python/ray/serve/tests/test_autoscaling_policy.py::TestAutoscalingMetrics::test_basic[min]"
```
What I think is the likely cause

When using `RAY_SERVE_AGGREGATE_METRICS_AT_CONTROLLER=1` with `min`
aggregation:
1. **Replicas emit metrics at slightly different times** (even if just
10ms apart due to the timestamp bucketing/rounding)
2. **The merged timeseries reflects the ramp-up**:
   - At t=0: Maybe only replica 1 is reporting → total = 25 requests
   - At t=0.01: Replica 2 starts reporting → total = 40 requests
   - At t=0.02: Replica 3 starts reporting → total = 50 requests
   - etc.

3. **`min` aggregation captures the starting point**:
- `aggregate_timeseries(..., aggregation_function="min")` takes the
minimum value from the merged timeseries
- This will always be one of those initial low values (like 25) when
only a subset of replicas had started reporting
   - This value can never be ≥ 45, making the test inherently flaky

Removing min from test fixture.

I think a more robust solution is to keep the last report in the
controller, generate the final time series using both reports, then clip
the data and mid-point, then apply the aggregation function.

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests serve Ray Serve Related Issue

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants