Skip to content

Conversation

@antfin-oss
Copy link

This Pull Request was created automatically to merge the latest changes from master into main branch.

πŸ“… Created: 2025-11-17
πŸ”€ Merge direction: master β†’ main
πŸ€– Triggered by: Scheduled

Please review and merge if everything looks good.

owenowenisme and others added 30 commits October 28, 2025 13:28
…t#57579)

Signed-off-by: dayshah <dhyey2019@gmail.com>
Signed-off-by: joshlee <joshlee@anyscale.com>
Co-authored-by: dayshah <dhyey2019@gmail.com>
…AggType, U]) (ray-project#57281)

## Why are these changes needed?
The current Generic types in `AggregateFnV2` are not tied to the class,
so they are not picked up properly by static type checkers such as mypy.

<!-- Please give a short summary of the change and the problem this
solves. -->
By adding the Generic[] in the class definition, we get full type
checking support.


## Checks
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [N/A] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [N/A] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: Arthur <atte.book@gmail.com>
Co-authored-by: Goutam <goutam@anyscale.com>
```
REGRESSION 51.89%: single_client_get_calls_Plasma_Store (THROUGHPUT) regresses from 8378.589542828342 to 4030.5453313124744 in microbenchmark.json
REGRESSION 33.66%: n_n_actor_calls_with_arg_async (THROUGHPUT) regresses from 4105.951978131054 to 2723.9690685388855 in microbenchmark.json
REGRESSION 29.43%: client__tasks_and_put_batch (THROUGHPUT) regresses from 12873.97871447783 to 9085.541921590711 in microbenchmark.json
REGRESSION 27.77%: multi_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 13779.913284159866 to 9952.762154617178 in microbenchmark.json
REGRESSION 26.34%: client__get_calls (THROUGHPUT) regresses from 1129.1409512898194 to 831.689705073893 in microbenchmark.json
REGRESSION 25.07%: multi_client_put_gigabytes (THROUGHPUT) regresses from 36.697734067834084 to 27.49866375110667 in microbenchmark.json
REGRESSION 23.94%: actors_per_second (THROUGHPUT) regresses from 508.9808896382363 to 387.1219957094043 in benchmarks/many_actors.json
REGRESSION 17.22%: 1_1_async_actor_calls_async (THROUGHPUT) regresses from 4826.171895058453 to 3995.0258578261814 in microbenchmark.json
REGRESSION 16.58%: single_client_tasks_async (THROUGHPUT) regresses from 7034.736389002367 to 5868.239300602419 in microbenchmark.json
REGRESSION 14.30%: client__tasks_and_get_batch (THROUGHPUT) regresses from 0.9643252583791863 to 0.8264370292993273 in microbenchmark.json
REGRESSION 14.27%: client__1_1_actor_calls_concurrent (THROUGHPUT) regresses from 1037.1186014627438 to 889.14113884267 in microbenchmark.json
REGRESSION 12.90%: client__1_1_actor_calls_async (THROUGHPUT) regresses from 1014.64288638885 to 883.7347522770161 in microbenchmark.json
REGRESSION 11.37%: client__put_calls (THROUGHPUT) regresses from 805.4069136266919 to 713.82381443796 in microbenchmark.json
REGRESSION 11.02%: 1_1_actor_calls_concurrent (THROUGHPUT) regresses from 5222.99132120111 to 4647.56843532278 in microbenchmark.json
REGRESSION 7.95%: single_client_get_object_containing_10k_refs (THROUGHPUT) regresses from 12.272053704608084 to 11.296427707979271 in microbenchmark.json
REGRESSION 7.86%: client__1_1_actor_calls_sync (THROUGHPUT) regresses from 524.6134993747014 to 483.38098840508496 in microbenchmark.json
REGRESSION 7.26%: single_client_put_calls_Plasma_Store (THROUGHPUT) regresses from 4498.3519827438895 to 4171.572402867286 in microbenchmark.json
REGRESSION 6.47%: single_client_wait_1k_refs (THROUGHPUT) regresses from 4.700920788730696 to 4.396844484606209 in microbenchmark.json
REGRESSION 6.40%: 1_1_async_actor_calls_with_args_async (THROUGHPUT) regresses from 2766.907182403518 to 2589.906655726785 in microbenchmark.json
REGRESSION 5.06%: single_client_put_gigabytes (THROUGHPUT) regresses from 19.30103208209274 to 18.324991353469613 in microbenchmark.json
REGRESSION 1.73%: 1_n_actor_calls_async (THROUGHPUT) regresses from 7474.798821945149 to 7345.613928457275 in microbenchmark.json
REGRESSION 421.24%: dashboard_p99_latency_ms (LATENCY) regresses from 232.641 to 1212.615 in benchmarks/many_pgs.json
REGRESSION 377.77%: dashboard_p95_latency_ms (LATENCY) regresses from 11.336 to 54.16 in benchmarks/many_pgs.json
REGRESSION 306.02%: dashboard_p99_latency_ms (LATENCY) regresses from 749.022 to 3041.184 in benchmarks/many_tasks.json
REGRESSION 162.57%: dashboard_p50_latency_ms (LATENCY) regresses from 11.744 to 30.836 in benchmarks/many_actors.json
REGRESSION 94.47%: dashboard_p95_latency_ms (LATENCY) regresses from 487.355 to 947.76 in benchmarks/many_tasks.json
REGRESSION 35.48%: dashboard_p99_latency_ms (LATENCY) regresses from 49.716 to 67.355 in benchmarks/many_nodes.json
REGRESSION 33.15%: dashboard_p95_latency_ms (LATENCY) regresses from 2876.107 to 3829.61 in benchmarks/many_actors.json
REGRESSION 27.55%: dashboard_p95_latency_ms (LATENCY) regresses from 13.982 to 17.834 in benchmarks/many_nodes.json
REGRESSION 7.56%: time_to_broadcast_1073741824_bytes_to_50_nodes (LATENCY) regresses from 13.777409734000003 to 14.81957527099999 in scalability/object_store.json
REGRESSION 4.48%: 3000_returns_time (LATENCY) regresses from 6.1422604579999955 to 6.417386639000014 in scalability/single_node.json
REGRESSION 3.41%: avg_pg_remove_time_ms (LATENCY) regresses from 1.419495533032749 to 1.4678401576576676 in stress_tests/stress_test_placement_group.json
REGRESSION 1.48%: 10000_get_time (LATENCY) regresses from 25.136106761999997 to 25.508083513999992 in scalability/single_node.json
REGRESSION 0.76%: stage_2_avg_iteration_time (LATENCY) regresses from 36.08304100036621 to 36.358218574523924 in stress_tests/stress_test_many_tasks.json
```

Signed-off-by: kevin <kevin@anyscale.com>
…ay-project#57631)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?
To prevent unknown issues with streaming executor not completing after
hours, we will add assert statements to rule out potential isses. This
PR ensures that when an operator is completed
- Internal Input queue is empty
- Internal Output queue is empty
- External Input Queue is empty

The external output queue can be non-empty, because the downstream
operators will consume from it
<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…roject#58234)

## Description

We need to make sure we're running tests on at least SF100 to make sure
we capturing regressions that could fall under the noise level
otherwise.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
…/5) (ray-project#58068)

Updating docgpubuild to run on python 3.10

updating minbuild-multiply job name to minbuild-serve

Post merge test that uses the docgpubuild image:
https://buildkite.com/ray-project/postmerge/builds/14073

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Nikhil G <nrghosh@users.noreply.github.com>
…optimized download function (ray-project#57854)

Signed-off-by: ahao-anyscale <ahao@anyscale.com>
…ay-project#58261)

> Thank you for contributing to Ray! πŸš€
> Please review the [Ray Contribution
Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html)
before opening a pull request.

> ⚠️ Remove these instructions before submitting your PR.

> πŸ’‘ Tip: Mark as draft if you want early feedback, or ready for review
when it's complete.

## Description

This change make RD by dump verbose telemetry for `ResourceManager` into
the `ray-data.log` by default.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
## Description
Update formatting of FailurePolicy log message to be more readable.

## Additional information

**Before:**

```
[FailurePolicy] Decision: FailureDecision.RAISE, Error source: controller, Error count / maximum errors allowed: 1/0, Error: Training failed due to controller error:
Worker group is not active. Call WorkerGroup.create() to create a new worker group.
```

**After:**

```
[FailurePolicy] RAISE
  Source: controller
  Error count: 1 (max allowed: 0)

Training failed due to controller error:
Worker group is not active. Call WorkerGroup.create() to create a new worker group.
```

Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>
Signed-off-by: Yicheng-Lu-llll <luyc58576@gmail.com>
Signed-off-by: Yicheng-Lu-llll <51814063+Yicheng-Lu-llll@users.noreply.github.com>
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
…#56742)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

Fifth split of ray-project#56416

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

<!-- CURSOR_SUMMARY -->
---

> [!NOTE]
> Enables Ruff import sorting for rllib/examples by narrowing per-file
ignores and updates example files’ imports accordingly with no
functional changes.
> 
> - **Tooling/Lint**:
> - Update `pyproject.toml` Ruff per-file-ignores (replace blanket
`rllib/*` with targeted subpaths) to enable import-order linting for
`rllib/examples`.
> - **Examples**:
> - Reorder and normalize imports across `rllib/examples/**` to satisfy
Ruff isort rules; no logic or behavior changes.
> 
> <sup>Written by [Cursor
Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit
101586e. This will update automatically
on new commits. Configure
[here](https://cursor.com/dashboard?tab=bugbot).</sup>
<!-- /CURSOR_SUMMARY -->

---------

Signed-off-by: Gagandeep Singh <gdp.1807@gmail.com>
Signed-off-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>
Signed-off-by: Mark Towers <mark@anyscale.com>
Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>
Co-authored-by: Mark Towers <mark.m.towers@gmail.com>
Co-authored-by: Mark Towers <mark@anyscale.com>
…_options. (ray-project#58275)

## Description
> Exclude IMPLICIT_RESOURCE_PREFIX from ReplicaConfig.ray_actor_options

## Related issues
> Link related issues: "Fixes ray-project#58085"

Signed-off-by: xingsuo-zbz <zhao_abc_123@163.com>
… + discrepancy fix in Python API 'serve.start' function (ray-project#57622)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

1. Fix bug with 'proxy_location' set for 'serve run' CLI command

`serve run` CLI command ignores `proxy_location` from config and uses
default value `EveryNode`.

Steps to reproduce:
- have a script:
```python
# hello_world.py
from ray.serve import deployment

@deployment
async def hello_world():
    return "Hello, world!"

hello_world_app = hello_world.bind()
```
Execute:
```
ray stop
ray start --head
serve build -o config.yaml hello_world:hello_world_app
```
- change `proxy_location` in the `config.yaml`: EveryNode -> Disabled
```
serve run config.yaml
curl -s -X GET "http://localhost:8265/api/serve/applications/" | jq -r '.proxy_location'
```
Output:
```
Before change:
EveryNode - but Disabled expected
After change:
Disabled
```

2. Fix discrepancy for 'proxy_location' in the Python API 'start' method

`serve.start` function in Python API sets different
`http_options.location` depending on if `http_options` is provided.

 Steps to reproduce:
- have a script:
```python
# discrepancy.py
import time

from ray import serve
from ray.serve.context import _get_global_client

if __name__ == '__main__':
    serve.start()
    client = _get_global_client()
    print(f"Empty http_options: `{client.http_config.location}`")

    serve.shutdown()
    time.sleep(5)

    serve.start(http_options={"host": "0.0.0.0"})
    client = _get_global_client()
    print(f"Non empty http_options: `{client.http_config.location}`")
```
Execute:
```
ray stop
ray start --head
python -m discrepancy
```
Output:
```
Before change:
Empty http_options: `EveryNode`
Non empty http_options: `HeadOnly`
After change:
Empty http_options: `EveryNode`
Non empty http_options: `EveryNode`
```

-------------------------------------------------------------
It changes current behavior in the following ways:
1. `serve run` CLI command respects `proxy_location` parameter from
config instead of using the hardcoded `EveryNode`.
2. `serve.start` function in Python API stops using the default
`HeadOnly` in case of empty `proxy_location` and provided `http_options`
dictionary without `location` specified.

<!-- Please give a short summary of the change and the problem this
solves. -->

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

Aims to simplify changes in the PR: ray-project#56507

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run pre-commit jobs to lint the changes in this PR.
([pre-commit
setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting))
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

---------

Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
## Description


### Status Quo
This PR ray-project#54667 addressed issues
of OOM by sampling a few lines of the file. However, this code always
assumes the input file is seekable(ie, not compressed). This means
zipped files are broken like this issue:
ray-project#55356

### Potential Workaround
- Refractor reused code between JsonDatasource and FileDatasource
- default to 10000 if zipped file found

## Related issues
ray-project#55356

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ray-project#58180)

## Expose Route Patterns in Proxy Metrics

fixes ray-project#52212

### Problem
Proxy metrics (`ray_serve_num_http_requests_total`,
`ray_serve_http_request_latency_ms`) only expose `route_prefix` (e.g.,
`/api`) instead of actual route patterns (e.g., `/api/users/{user_id}`).
This prevents granular monitoring of individual endpoints without
causing high cardinality from unique request paths.

### Design
**Route Pattern Extraction & Propagation:**
- Replicas extract route patterns from ASGI apps (FastAPI/Starlette) at
initialization using `extract_route_patterns()`
- Patterns propagate: Replica β†’ `ReplicaMetadata` β†’ `DeploymentState` β†’
`EndpointInfo` β†’ Proxy
- Works with both normal patterns (routes in class) and factory patterns
(callable returns app)

**Proxy Route Matching:**
- `ProxyRouter.match_route_pattern()` matches incoming requests to
specific patterns using cached mock Starlette apps
- Metrics tag requests with parameterized routes (e.g.,
`/api/users/{user_id}`) instead of prefixes
- Fallback to `route_prefix` if patterns unavailable or matching fails

**Performance:**


Metric | Before | After
-- | -- | --
Requests per second (RPS) | 403.39 | 397.82
Mean latency (ms) | 247.9 | 251.37
p50 (ms) | 224 | 223
p90 (ms) | 415 | 428
p99 (ms) | 526 | 544

### Testing
- Unit tests for `extract_route_patterns()`
- Integration test verifying metrics use patterns and avoid high
cardinality
- Parametrized for both normal and factory patterns

---------

Signed-off-by: abrar <abrar@anyscale.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
…#58060)

This PR replace STATS with Metric as a way to define metric inside ray
(as a unification effort) in all core worker components. For the most
parts, metrics are defined as the top level component
(core_worker_process.cc) and pass down as an interface to the
sub-components.

**Details**
Full context of this refactoring work.
- Each component (e.g., gcs, raylet, core_worker, etc.) now has a
metrics.h file located in its top-level directory. This file defines all
metrics for that component.
- In most cases, metrics are defined once in the main entry point of
each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for
Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.).
These metrics are then passed down to subcomponents via the
ray::observability::MetricInterface.
- This approach significantly reduces rebuild time when metric
infrastructure changes. Previously, a change would trigger a full Ray
rebuild; now, only the top-level entry points of each component need
rebuilding.
- There are a few exceptions where metrics are tracked inside object
libraries (e.g., task_specification). In these cases, metrics are
defined within the library itself, since there is no corresponding
top-level entry point.
- Finally, the obsolete metric_defs.h and metric_defs.cc files can now
be completely removed. This paves the way for further dead code cleanup
in a future PR.

Test:
- CI

Signed-off-by: Cuong Nguyen <can@anyscale.com>
)

## Description

We want to keep the limit pushdown as default so we should set
`udf_modifying_row_count` default to false as default.

## Related issues

## Additional information

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
updating tune release tests to run on python 3.10

Successful release test run:
https://buildkite.com/ray-project/release/builds/65655
(failing tests are already disabled)

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Remove actor handle from object that get's passed around in long poll
communication.

Return actor handle in nested objects from the task make the caller of
this task a borrower from the reference counting POV. But this pattern,
although allowed, is not very well tested. Hence breaking it by passing
actor_name from listen_for_change instead.

---------

Signed-off-by: abrar <abrar@anyscale.com>
## Description

The full name was probably hallucinated from LLM.

## Related issues

## Additional information

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
…ross-node parallelism (ray-project#57261)

Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Nikhil G <nrghosh@users.noreply.github.com>
…imit` (ray-project#58303)

## Description

## Related issues

Fix comment
ray-project#58264 (comment)
## Additional information

Signed-off-by: You-Cheng Lin <mses010108@gmail.com>
…egy (ray-project#58306)

Signed-off-by: wei-chenglai <qazwsx0939059006@gmail.com>
…ist with nixl (ray-project#58263)

## Description
For nixl, reuse previous metadata if transferring the same tensor list.
This is to avoid repeated `register_memory` before `deregister_memory`

---------

Signed-off-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Dhyey Shah <dhyey2019@gmail.com>
Co-authored-by: Stephanie Wang <smwang@cs.washington.edu>
…tuned_examples/`` in ``rllib`` (ray-project#56746)

<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

<!-- Please give a short summary of the change and the problem this
solves. -->

Seventh split of ray-project#56416

## Related issue number

<!-- For example: "Closes ray-project#1234" -->

## Checks

- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [ ] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: Gagandeep Singh <gdp.1807@gmail.com>
Signed-off-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>
Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>
Co-authored-by: Mark Towers <mark.m.towers@gmail.com>
…ct#57835)

## Description
builds atop of ray-project#58047, this pr
ensures the following when `auth_mode` is `token`:
calling `ray.init() `(without passing an existing cluster address) ->
check if token is present, generate and store in default path if not
present
calling `ray.init(address="xyz")` (connecting to an existing cluster) ->
check if token is present, raise exception if one is not present

---------

Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
aslonnie and others added 22 commits November 14, 2025 19:08
not used any more; all tests moved to python 3.10

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
it is already imported, and the version being imported is not really
being used.

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
## Description

In IMPALA, we access an attribute `self._minibatch_size` which does not
exist anymore.
It should be `self._minibatch_size`. While this check is nice, it's
effectively untested code.
This PR introduces a test that adds a small test that triggers the
relevant code path.
Minor follow ups from: ray-project#58539

Example error message:
```
Task failed because the node it was running on is dead or unavailable. Node IP: 127.0.0.1, node ID: e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646. This can happen if the node was preempted, had a hardware failure, or its raylet crashed unexpectedly. To see node death information, use `ray list nodes --filter node_id=e55b8ca03ebf3f7418f51533d8d55abeaab75fa9b29e2e6282b47646`, check the Ray dashboard cluster page, search the node ID in the GCS logs, or use `ray logs raylet.out -ip 127.0.0.1`.
```

---------

Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
…ators (ray-project#58555)

## Description
Push `Filter` past Join (depends on the join op), `Filter` into Union
branches, `Filter` past projections (accounting for renames), past all
shuffle ops.
## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
## Description

Similar to ray-project#58599, should have
added avg metric for generation length as well.

## Related issues

## Additional information

---------

Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
…ay-project#57241)

## Why are these changes needed?

The gcs_client behavior today is hard to overload/override. This is bad
for a few reasons. One being that it can be difficult to instrument
testing. There are examples today amongst accessors that need to give
β€˜Friend’ level access to methods to different classes in order to
accomplish some of the tests today.

To resolve this, we're going to introduce some new abstractions. This is
the first PR that sets up some of the framework. From here, we'll do the
remainder of the accessors.

### New Classes

```
actor_info_accessor.cc <-- Concretion implementations 
actor_info_accessor.h  <-- Concretion declarations (contains private methods and members)
actor_info_accessor_interface.h <-- Interface function declarations
```

gcs_client and all other accessor users will interact now with an
ActorInfoAccessorInterface object. Next in order to make this
injectable, we need to pull away the inline concretion instantiation
going on in gcs_client.connect(). To do this, we are going to create a
new factory interface called AccessorFactoryInterface which is included
in this PR.

There is one other class/abstraction we need to introduce. Ideally, in
order to make this pluggable, we need separated build targets between
interfaces and implementations and gcs_client. Before, everything was
bundled together. But since we’re moving things out, we need to break
the circular dependency between different accessors and the gcs_client.
The main reason why the gcs_client passes itself into each accessor is
that today these different accessors try to share a subscriber and a
rpc_client (and potentially more in the future). So we introduce a new
interface to break this cycle. It is the GcsClientContext. The intention
of this object is to be a β€˜grab bag’ of objects that are needed by
accessor implementations.


ray-project#54805

---------

Signed-off-by: zac <zac@anyscale.com>
… deleted (ray-project#58605)

The tests that exercised actor failures when they go out of scope, such
as `test_actor_ray_shutdown_called_on_del` and
`test_actor_ray_shutdown_called_on_scope_exit` [were
flaky](https://buildkite.com/ray-project/postmerge/builds/14336#019a7abe-73d3-46e0-8dc2-13351e12b7c3/613-1919).
This PR fixes the flakiness by ensuring actors use graceful shutdown
when GCS polling detects actor refs are deleted.

**Problem**
When actors go out of scope, GCS uses two mechanisms to detect reference
deletion:
1. Push model (`GcsActorManager::HandleReportActorOutOfScope`) - already
fixed in ray-project#57090
2. Pull model (`GcsActorManager::PollOwnerForActorRefDeleted`) - was
still using force kill

The pull model was calling DestroyActor(..., force_kill=true), which
skips `__ray_shutdown__` and immediately terminates the actor. This
created a race condition: whichever mechanism completed first determined
whether cleanup callbacks ran, causing test flakiness.

To fix the issue, changed `PollOwnerForActorRefDeleted` to use graceful
shutdown with timeout (same as `HandleReportActorOutOfScope`). I ran all
the actor failure tests that exercise this shutdown path 20 times
locally, and where they failed 3/20 previously, they succeeded everytime
after the fix.

Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
…ray-project#57982)

## Description
Currently, when running ray start --block, Ray prints the exit code of
subprocesses such as raylet only to stdout.
If users don’t redirect or save the console output, this diagnostic
information is lost.

## Related issues
Closes ray-project#57941

## Additional information
<img width="1607" height="228" alt="image"
src="https://github.com/user-attachments/assets/9042e750-5cc2-4f2e-882d-ced114bdfe67"
/>

- Added a persistent log file ray_process_exit.log under the node’s logs
directory.
- On unexpected subprocess termination, exit codes are now:
  - Printed to stdout as before.
  - Appended to ray_process_exit.log for later inspection.

---------

Signed-off-by: Dongjun Na <kmu5544616@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
## Description
Before this PR, the metrics would follow this path
1. `StreamingExecutor` collects metrics per operator
2. `_StatsManager` creates a thread to export metrics
3. `StreamingExecutor` sends metrics to `_StatsManager`, which performs
a copy and holds a `_stats_lock`.
4. Stats Thread reads the metrics sent from 2)
5. Stats Thread sleeps every 5-10 seconds before exporting metrics to
`_StatsActor`. These metrics can come in 2 forms: iteration and
execution metrics.

I believe the purpose of the stats thread created in 2) was 2-fold
- Don't export stats very frequently
- Don't export Iteration and Execution stats separately (have them sent
in the same rpc call)

However, this creates a lot of complexity (handling idle threads,
etc...) and also makes it harder to perform histogram metrics, which
need to copy an entire list of values. See
ray-project#57851 for more details.

By removing the stats thread in 2), we can reduce complexity of
management, and also avoid wasteful copying of metrics. The downside is
that iteration and execution metrics are now sent separately, increasing
the # of rpc calls. I don't think this is a concern, because the async
updates to the `_StatsActor` were happening previously, and we can also
tweak the update interval.

~~It's important to note that `_stats_lock` still lives on to update the
last timestamps of each dataset. See * below for more details.~~

Now the new flow is:
1. `StreamingExecutor` collects metrics per operator
2. `StreamingExecutor` checks the last time `_StatsActor` was updated.
If more than a default 5 seconds has passed since last updated, we send
metrics to `_StatsActor` through the `_StatsManager`. Afterwards, we
update the last updated timestamp. See * below for caveat.

~~\*[important] Ray Data supports running multiple datasets
concurrently. Therefore, I must keep track of each dataset last updated
timestamp. `_stats_lock` is used to update that dictionary[dataset,
last_updated] safely on `register_dataset` and on `shutdown`. On update,
we don't require the lock because it does not update the dictionary's
size. If we want to remove the lock entirely, I can think of 2
workarounds.~~
1. ~~Create a per dataset `StatsManager`. Pros: no thread lock. Cons:
Much more code changes. The iteration metrics go through a separate code
path that is independent of the streaming executor, which will make this
more challenging.~~
2. ~~Update on every unix_epoch_timestamp % interval == 0, so that at
12:00, 12:05, etc.. the updates will be on that interval. Pros: easy to
implement and it's stateless. Cons: Breaks down for slower streaming
executors.~~
3. I can removed the lock by keeping the state in the 2 areas
- BatchIterator
- StreamingExecutor

I also verified that ray-project#55163 still
solves the original issue
## Related issues

## Additional information

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
)

Signed-off-by: Seiji Eicher <seiji@anyscale.com>
…58601)

## Description
Now on `write()` the raw data is written to the underlying parquet files
and the metadata is returned, namely `DataFiles`.

On `on_write_complete()` we commit the transaction. For upsert, the data
has to be read back in memory and we do that in a separate Ray Task.

## Related issues
> Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to
ray-project#1234".

## Additional information
> Optional: Add implementation details, API changes, usage examples,
screenshots, etc.

---------

Signed-off-by: Goutam <goutam@anyscale.com>
## Description
Adds ray data metrics documentation for visibility. This should be
periodically updated with the latest metrics.

## Related issues
None

## Additional information
None

---------

Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
…ay-project#58466)

running multimodal inference tests on 3.10

Successful release test runs:
https://buildkite.com/ray-project/release/builds/66846#019a60d7-521e-414a-b1ab-0d58b7d8074e

---------

Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Deprecate `raw_metrics` API and replace them all with the
`raw_metric_timeseries` API. `raw_metrics` returns the current snapshot
of a set of metrics, while `raw_metric_timeseries` returns the full time
series. The later is more reliable when checking the latest instance of
several independent metrics.

Test:
- CI

Signed-off-by: Cuong Nguyen <can@anyscale.com>
…eterministic (ray-project#58631)

## Description
N/A

## Related issues
Fixes ray-project#58560

## Additional information
N/A

---------

Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>
so that it does not need / create project config files

we just need the self-contained binary

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
## Description

This pr support return a generator object from map_groups UDF. if UDF
have a large output , we return iterator to reduce memory cost.

## Related issues

Close ray-project#57935

## Additional information

This change centers on the `_apply_udf_to_groups` helper function within
the file ray/data/grouped_data.py.

`map_groups` internally calls map_batches, providing a wrapper function
(wrapped_fn) that in turn calls `_apply_udf_to_groups` to apply the
user's UDF to each group.

The key modification is that instead of directly yielding the UDF's
return value, the logic now inspects the result first. If the result is
an Iterator, it is consumed with `yield from` to produce each data batch
individually. If it is not an iterator, the single data batch is yielded
directly, preserving the original behavior.

---------

Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
…URI columns (ray-project#58517)

The `_sample_sizes` method was using `as_completed()` to collect file
sizes, which returns results in completion order rather than submission
order. This scrambled the file sizes list so it no longer corresponded
to the input URI order.

When multiple URI columns are used, `_estimate_nrows_per_partition`
calls `zip(*sampled_file_sizes_by_column.values())` on line 284, which
assumes file sizes from different columns align by row index. The
scrambled ordering caused file sizes from different rows to be
incorrectly combined, producing incorrect partition size estimates.

## Changes

- Pre-allocate the `file_sizes` list with the correct size
- Use a `future_to_file_index` mapping to track the original submission
order
- Place results at their correct positions regardless of completion
order
- Add assertion to verify list length matches expected size

## Related issues

ray-project#58464 (comment)

---------

Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->

<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->

## Why are these changes needed?

Adding a version arg to read_delta_lake to support reading from a
specific version

<!-- Please give a short summary of the change and the problem this
solves. -->
>

## Checks

- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
           corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
   - [x] Unit tests
   - [ ] Release tests
   - [ ] This PR is not tested :(

Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
it uses `enum.Enum` that are not deepcopy-able

Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pull request #678 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5397.

@gemini-code-assist
Copy link

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is an automated daily merge from the master branch into the main branch, incorporating a wide array of updates to the project's build, test, and deployment infrastructure. Key changes include a significant overhaul of CI/CD pipeline definitions, enhancements to Bazel build configurations, and a shift in Python environment management tools. These updates aim to streamline development workflows, improve build reproducibility, and ensure robust code quality across the project.

Highlights

  • CI/CD Configuration Updates: Numerous updates to .buildkite YAML files, including the introduction of new files like _images.rayci.yml, dependencies.rayci.yml, and doc.rayci.yml, to reorganize and enhance the build and test pipelines.
  • Bazel Build System Enhancements: Modifications to .bazelrc to enable strict action environments, add platform-specific compiler options, and improve warning suppression for third-party code. The Bazel BUILD files have also been refactored to use new pkg_files and pkg_zip rules for artifact packaging.
  • Python Environment Management: Transition from miniconda to miniforge3 in various build scripts and Dockerfiles, along with the introduction of uv for Python dependency management in the CI environment.
  • Code Ownership and Linting: Consolidation and refinement of CODEOWNERS entries for better module-specific ownership. Pre-commit hooks have been updated to include semgrep, vale, cython-lint, and eslint for improved code quality checks.
  • Refactored C++ API and Runtime: Changes in the C++ API to use lang_type_ for remote function holders and updates in the runtime to use ray::GetNodeIpAddressFromPerspective() and ray::BuildAddress() for network addressing. The C++ build process now generates a ray_cpp_pkg.zip artifact.
Ignored Files
  • Ignored by pattern: .gemini/** (1)
    • .gemini/config.yaml
  • Ignored by pattern: .github/workflows/** (1)
    • .github/workflows/stale_pull_request.yaml
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with πŸ‘ and πŸ‘Ž on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is an automated daily merge from master to main. It contains a large number of changes, primarily focused on refactoring the CI/CD pipelines and Bazel build system. Key changes include moving to uv for dependency management, modularizing Bazel build files, refactoring Docker image build processes, and improving the test selection logic. I've identified a potential issue with error handling in one of the C++ files.

Comment on lines +47 to +50
memory_store_->Put(
::ray::RayObject(buffer, nullptr, std::vector<rpc::ObjectReference>()),
object_id,
/*has_reference=*/false);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The status check for memory_store_->Put has been removed. While the method signature might have changed to void and now throws exceptions on error, the GetRaw method in this same file still performs a status check on the returned Status object from memory_store_->Get. This inconsistency suggests that error handling might have been unintentionally removed here. Please verify if memory_store_->Put can fail and if so, how errors are propagated. If it can fail without throwing, the status check should be restored.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.