🔄 daily merge: master → main 2025-11-26 #689

antfin-oss · 2025-11-26T02:58:52Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-26
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

be consistent with doc build environment Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

migrating all doc related things to run on python 3.12 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

excluding `*_tests` directories for now to reduce the impact Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

using `bazelisk run //java:gen_ray_java_pkg` everywhere Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

This PR adds 2 new metrics to core_worker by way of the reference counter. The two new metrics keep track of the count and size of objects owned by the worker as well as keeping track of their states. States are defined as: - **PendingCreation**: An object that is pending creation and hasn't finished it's initialization (and is sizeless) - **InPlasma**: An object which has an assigned node address and isn't spilled - **Spilled**: An object which has an assigned node address and is spilled - **InMemory**: An object which has no assigned address but isn't pending creation (and therefore, must be local) The approach used by these new metrics is to examine the state 'before and after' any mutations on the reference in the reference_counter. This is required in order to do the appropriate bookkeeping (decrementing values and incrementing others). Admittedly, there is potential for counting on the in between decrements/increments depending on when the RecordMetrics loop is run. This unfortunate side effect however seems preferable to doing mutual exclusion with metric collection as this is potentially a high throughput code path. In addition, performing live counts seemed preferable then doing full accounting of the object store and across all references at time of metric collection. Reason being, that potentially the reference counter is tracking millions of objects, and each metric scan could potentially be very expensive. So running the accounting (despite being potentially innaccurate for short periods) seemed the right call. This PR also allows for object size to potentially change due to potential non deterministic instantation (say an object is initially created, but it's primary copy dies, and then the recreation fails). This is an edge case, but seems important for completeness sake. --------- Signed-off-by: zac <zac@anyscale.com>

to 0.21.0; supports wanda priority now. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…8498) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…#58286) ## Description Predicate pushdown (ray-project#58150) in conjunction with this PR should speed up reads from Iceberg. Once the above change lands, we can add the pushdown interface support for IcebergDatasource --------- Signed-off-by: Goutam <goutam@anyscale.com>

## Description * Does the work to bump pydoclint up to the latest version * And allowlist any new violations it finds ## Related issues n/a ## Additional information n/a --------- Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

fix pattern_async_actor demo typo. Add `self.`. --------- Signed-off-by: curiosity-hyf <curiooosity.h@gmail.com>

…hboard agent (ray-project#58405) Add a grpc service interceptor to intercept all dashboard agent rpc calls and validate the presence of auth token (when auth mode is token) --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…tests (ray-project#58528) the auth token test setup in `conftest.py` is breaking macos test. there are two test scripts (`test_microbenchmarks.py` and `test_basic.py`) that run after the wheel is installed but without editable mode. for these test to pass,` conftest.py` cannot import anything under `ray.tests`. this pr moves `authentication_test_utils` into `ray._private` to fix this issue Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>

This PR enables open telemetry as the default backend for ray metric stack. The bulk of this PR is actually to fix tests that were written with some assumptions that no longer hold true. For ease of reviewing, I inline the reasons for the change together with the change for each tests in the comments. This PR also depends on a release of vllm (so that we can update the minimal supported version of vllm in ray). Test: - CI  --- > [!NOTE] > Enable OpenTelemetry metrics backend by default and refactor metrics/Serve tests to use timeseries APIs and updated `ray_serve_*` metric names. > > - **Core/Config**: > - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to `true` in `ray_constants.py` and `ray_config_def.h`. > - Metrics `Counter`: use `CythonCount` by default; keep legacy `CythonSum` only when OTEL is explicitly disabled. > - **Serve/Metrics Tests**: > - Replace text scraping with `PrometheusTimeseries` and `fetch_prometheus_metric_timeseries` throughout. > - Update metric names/tags to `ray_serve_*` and counter suffixes `*_total`; adjust latency metric names and processing/queued gauges. > - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and pass through helpers. > - **General Test Fixes**: > - Remove OTEL parametrization/fixtures; simplify expectations where counters-as-gauges no longer apply; drop related tests. > - Cardinality tests: include `"low"` level and remove OTEL gating; stop injecting `enable_open_telemetry` in system config. > - Actor/state/thread tests: migrate to cluster fixtures, wait for dashboard agent, and adjust expected worker thread counts. > - **Build**: > - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env from C++ stats test. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 1d0190f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Signed-off-by: Cuong Nguyen <can@anyscale.com>

…mmended (ray-project#57726)    ## Description If users schedule a detached actor into a placement group, Raylet will kill the actor when the placement group is removed. The actor will be stuck in the `RESTARTING` state forever if it's restartable until users explicitly kill it. In that case, if users try to `get_actor` with the actor's name, it can still return the restarting actor, but no process exists. It will no longer be restarted because the PG is gone, and no PG with the same ID will be created during the cluster's lifetime. The better behavior would be for Ray to transition a task/actor's state to dead when it is impossible to restart. However, this would add too much complexity to the core, so I think it's not worth it. Therefore, this PR adds a warning log, and users should use detached actors or PGs correctly. Example: Run the following script and run `ray list actors`. ```python import ray from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy from ray.util.placement_group import placement_group, remove_placement_group @ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1) class Actor: pass ray.init() pg = placement_group([{"CPU": 1}]) ray.get(pg.ready()) actor = Actor.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg, ) ).remote() ray.get(actor.__ray_ready__.remote()) ``` ## Related issues  ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [x] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes ⚠️ - [x] No  **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [x] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context  --------- Signed-off-by: Kai-Hsun Chen <khchen@x.ai> Signed-off-by: Robert Nishihara <robertnishihara@gmail.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…58182) Signed-off-by: dayshah <dhyey2019@gmail.com>

…y-project#57715) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>

Signed-off-by: dayshah <dhyey2019@gmail.com>

…-project#56783) Signed-off-by: dayshah <dhyey2019@gmail.com>

The python test step is failing on master now because of this. Probably a logical merge conflict. ``` FAILED: //python/ray/tests:test_grpc_authentication_server_interceptor (Summary) ... [2025-11-11T22:11:54Z] from ray.tests.authentication_test_utils import ( -- | [2025-11-11T22:11:54Z] ModuleNotFoundError: No module named 'ray.tests.authentication_test_utils' ``` Signed-off-by: dayshah <dhyey2019@gmail.com>

be consistent with the default build environment Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ject#58543) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

## Description - rename RAY_auth_mode → RAY_AUTH_MODE environment variable across codebase - Excluded healthcheck endpoints from authentication for Kubernetes compatibility - Fixed dashboard cookie handling to respect auth mode and clear stale tokens when switching clusters --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ls (ray-project#58424) ## Description - Use client interceptor for adding auth tokens in grpc calls when `AUTH_MODE=token` - BuildChannel() will automatically include the interceptor - Removed `auth_token` parameter from `ClientCallImpl` - removed manual auth from `python_gcs_subscriber`.cc - tests to verify auth works for autoscaller apis --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…`) (ray-project#57090) When actors terminate gracefully, Ray calls the actor's `__ray_shutdown__()` method if defined, allowing for cleanup of resources. But, this is not invoked in case actor goes out of scope due to `del actor`. ### Why `del actor` doesn't invoke `__ray_shutdown__` Traced through the entire code path, and here's what happens: Flow when `del actor` is called: 1. **Python side**: `ActorHandle.__del__()` -> `worker.core_worker.remove_actor_handle_reference(actor_id)` https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/python/ray/actor.py#L2040 2. **C++ ref counting**: `CoreWorker::RemoveActorHandleReference()` -> `reference_counter_->RemoveLocalReference()` - When ref count reaches 0, triggers `OnObjectOutOfScopeOrFreed` callback https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L2503-L2506 3. **Actor manager callback**: `MarkActorKilledOrOutOfScope()` -> `AsyncReportActorOutOfScope()` to GCS https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/actor_manager.cc#L180-L183 https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/task_submission/actor_task_submitter.cc#L44-L51 4. **GCS receives notification**: `HandleReportActorOutOfScope()` - **THE PROBLEM IS HERE** ([line 279 in `src/ray/gcs/gcs_actor_manager.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/gcs/gcs_actor_manager.cc#L279)): ```cpp DestroyActor(actor_id, GenActorOutOfScopeCause(actor), /*force_kill=*/true, // <-- HARDCODED TO TRUE! [reply, send_reply_callback]() { ``` 5. **Actor worker receives kill signal**: `HandleKillActor()` in [`src/ray/core_worker/core_worker.cc`](https://github.com/ray-project/ray/blob/3b1de771d5bb0e5289c4f13e9819bc3e8a0ad99e/src/ray/core_worker/core_worker.cc#L3970) ```cpp if (request.force_kill()) { // This is TRUE for OUT_OF_SCOPE ForceExit(...) // Skips __ray_shutdown__ } else { Exit(...) // Would call __ray_shutdown__ } ``` 6. **ForceExit path**: Bypasses graceful shutdown -> No `__ray_shutdown__` callback invoked. This PR simply changes the GCS to use graceful shutdown for OUT_OF_SCOPE actors. Also, updated the docs. --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> Co-authored-by: Ibrahim Rabbani <israbbani@gmail.com>

Currently, a node is considered idle while pulling objects from the remote object store. This can lead to situations where a node is terminated as idle, causing the cluster to enter an infinite loop when pulling large objects that exceed the node idle termination timeout. This PR fixes the issue by treating object pulling as a busy activity. Note that nodes can still accept additional tasks while pulling objects (since pulling consumes no resources), but the auto-scaler will no longer terminate the node prematurely. Closes ray-project#54372 Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com>

…_FACTOR` to 2 (ray-project#58262) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description This was setting the value to be aligned with the previous default of 4. However, after some consideration i've realized that 4 is too high of a number so actually lowering this to 2 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…ray-project#58504) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

…y-project#58523) ## Description This PR improves documentation consistency in the `python/ray/data` module by converting all remaining rST-style docstrings (`:param:`, `:return:`, etc.) to Google-style format (`Args:`, `Returns:`, etc.). ## Additional information **Files modified:** - `python/ray/data/preprocessors/utils.py` - Converted `StatComputationPlan.add_callable_stat()` - `python/ray/data/preprocessors/encoder.py` - Converted `unique_post_fn()` - `python/ray/data/block.py` - Converted `BlockColumnAccessor.hash()` and `BlockColumnAccessor.is_composed_of_lists()` - `python/ray/data/_internal/datasource/delta_sharing_datasource.py` - Converted `DeltaSharingDatasource.setup_delta_sharing_connections()` Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…oject#58549) ## Description The original `test_concurrency` function combined multiple test scenarios into a single test with complex control flow and expensive Ray cluster initialization. This refactoring extracts the parameter validation tests into focused, independent tests that are faster, clearer, and easier to maintain. Additionally, the original test included "validation" cases that tested valid concurrency parameters but didn't actually verify that concurrency was being limited correctly—they only checked that the output was correct, which isn't useful for validating the concurrency feature itself. **Key improvements:** - Split validation tests into `test_invalid_func_concurrency_raises` and `test_invalid_class_concurrency_raises` - Use parametrized tests for different invalid concurrency values - Switch from `shutdown_only` with explicit `ray.init()` to `ray_start_regular_shared` to eliminate cluster initialization overhead - Minimize test data from 10 blocks to 1 element since we're only validating parameter errors - Remove non-validation tests that didn't verify concurrency behavior ## Related issues N/A ## Additional information The validation tests now execute significantly faster and provide clearer failure messages. Each test has a single, well-defined purpose making maintenance and debugging easier. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

previously it was actually using 0.4.0, which is set up by the grpc repo. the declaration in the workspace file was being shadowed.. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…roject#58864) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Fix obj_store_mem_max_pending_output_per_task reporting Fix `obj_store_mem_max_pending_output_per_task` when sample is unavailable to factor in, - `bytes_per_output` = `MAX_SAFE_BLOCK_SIZE_FACTOR` * `target_max_block_size`. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…te matching (ray-project#58927) The correct route value is already part of RequestMetadata after ray-project#58180, no need to recompute it again. no observed perf diff in microbenchmark After ``` Type Name # Requests # Fails Median (ms) 95%ile (ms) 99%ile (ms) Average (ms) Min (ms) Max (ms) Average size (bytes) Current RPS Current Failures/s GET /echo?message=hello 28068 0 200 410 470 228.27 80 592 26 430.3 0 Aggregated 28068 0 200 410 470 228.27 80 592 26 430.3 0 ``` Before ``` Type Name # Requests # Fails Median (ms) 95%ile (ms) 99%ile (ms) Average (ms) Min (ms) Max (ms) Average size (bytes) Current RPS Current Failures/s GET /echo?message=hello 27427 0 210 410 470 232.12 76 604 26 429.7 0 Aggregated 27427 0 210 410 470 232.12 76 604 26 429.7 0 ``` Additionally, old implementation wrongly assumed that there will only be one method (GET,PUT) corresponding to a route. This PR fixes that assumption and tests for it. --------- Signed-off-by: abrar <abrar@anyscale.com>

Signed-off-by: irabbani <irabbani@anyscale.com>

> Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] Add iter_prefetched_blocks stats Report prefetched bytes per iterator as stats. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: Srinath Krishnamachari <68668616+srinathk10@users.noreply.github.com>

…58299) This PR replace STATS with Metric as a way to define metric inside ray (as a unification effort) in all common components. Normally, metrics are defined at the top-level component and passed down to sub-components. However, in this case, because the common component is used as an API across, doing so would feel unnecessarily cumbersome. I decided to define the metrics inline within each client and server class instead. Note that the metric classes (Metric, Gauge, Sum, etc.) are simply wrappers around static OpenCensus/OpenTelemetry entities. **Details** Full context of this refactoring work. - Each component (e.g., gcs, raylet, core_worker, etc.) now has a metrics.h file located in its top-level directory. This file defines all metrics for that component. - In most cases, metrics are defined once in the main entry point of each component (gcs/gcs_server_main.cc for GCS, raylet/main.cc for Raylet, core_worker/core_worker_process.cc for the Core Worker, etc.). These metrics are then passed down to subcomponents via the ray::observability::MetricInterface. - This approach significantly reduces rebuild time when metric infrastructure changes. Previously, a change would trigger a full Ray rebuild; now, only the top-level entry points of each component need rebuilding. - There are a few exceptions where metrics are tracked inside object libraries (e.g., task_specification). In these cases, metrics are defined within the library itself, since there is no corresponding top-level entry point. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com>

ray-project#58710) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ray-project#58711 decreased the scale of the `map_groups` tests from scale-factor 100 to scale-factor 10 because some of the `map_groups` release tests were failing. However, after more investigation, I realized that the only variant that doesn't work with scale-factor 100 is the hash shuffle with autoscaling variant (see ray-project#58734). This PR re-increases the scale and only disables the cases that fail. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

## Summary This PR removes `test_large_args_scheduling_strategy` from `test_stats.py` because its flaky and not worth keeping (It tests implementation details rather than behavior and conflates multiple concerns) See https://buildkite.com/ray-project/premerge/builds/54495#019ab720-249f-49c5-8e25-5e9005cc41e2 ## Motivation 1. **Hardcodes scheduling strategy values** - The test assumes large args use `'DEFAULT'` and small args use `'SPREAD'`. If these defaults change in `context.py`, the test fails even though the system is working correctly. 2. **Tests stats format, not scheduling behavior** - The test doesn't verify that the correct scheduling strategy is actually passed to Ray tasks. It only checks that a specific string appears in stats output. 3. **Mixes two concerns** - The test conflates: - Scheduling strategy selection based on data size (belongs in a map-related test) - Stats output including scheduling strategy info (belongs in a general stats formatting test) Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

## Why are these changes needed? We introduced an improved error message when environments fail in ray-project#55567. At the same time, this bypasses the silencing of env step errors. This PR consolidates the messages. --------- Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>

…-project#58915) # Description This PR refactors the `PhysicalOperator` class to eliminate hidden side effects in the `completed()` method. Previously, calling `completed()` could inadvertently modify the internal state of the operator, which could lead to unexpected behavior. This change separates the logic for checking if the operator is marked as finished from the logic that computes whether it is actually finished. Key changes include: - Renaming `_execution_finished` to `_is_execution_marked_finished` to clarify its purpose. - Renaming `execution_finished()` to `has_execution_finished()` and making it a pure computed property without side effects. - Updating the `completed()` method to call `has_execution_finished()` instead of modifying internal state. - Ensuring that `mark_execution_finished()` correctly sets the renamed field. ## Related issues Fixes ray-project#58884 ## Additional information This refactor ensures that both `has_execution_finished()` and `completed()` are pure query methods, allowing them to be called multiple times without altering the state of the operator. T --------- Signed-off-by: Simeet Nayan <simeetnayan.8100@gmail.com> Co-authored-by: Balaji Veeramani <balaji@anyscale.com>

## Description The links for APPO were referencing the PPO paper. I updated them to link to the IMPACT paper Signed-off-by: Philipp Schmutz <2059887+pschmutz@users.noreply.github.com>

… completed episodes when sampling a fixed number of episodes (ray-project#58931) ## Description The `MultiAgentEnvRunner` would previously call the callback twice for the final episode of a batch (when sampling a fixed number of episodes). This PR fixes this problem ensuring that the callback only happens once for finished episode ## Related issues Closes ray-project#55452 --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>

## Description When the Autoscaler receives a resource request and decides which type of node to scale up,, only the `UtilizationScore` is considered (that is, Ray tries to avoid launching a large node for a small resource request, which would lead to resource waste). If multiple node types in the cluster have the same `UtilizationScore`, Ray always request for the same node type. In Spot scenarios, cloud resources are dynamically changing. Therefore, we want the Autoscaler to be aware of cloud resource availability — if a certain node type becomes unavailable, the Autoscaler should be able to automatically switch to requesting other node types. In this PR, I added the `CloudResourceMonitor` class, which records node types that have failed resource allocation, and in future scaling events, reduces the weight of these node types. ## Related issues Related to ray-project#49983 Fixes ray-project#53636 ray-project#39788 ray-project#39789 ## implementation details 1. `CloudResourceMonitor` This is a subscriber of Instances. When a Instance get status of `ALLOCATION_FAILED`, `CloudResourceMonitor` record the node_type and set a lower its availability score. 2. `ResourceDemandScheduler` This class determines how to select the best node_type to handle resource request. I modify the part of selecting the best node type: ```python # Sort the results by score. results = sorted( results, key=lambda r: ( r.score, cloud_resource_availabilities.get(r.node.node_type, 1), ), reverse=True ) ``` The sorting includes: 2.1. UtilizationScore: to maximize resource utilization. 2.2. Cloud resource availabilities: prioritize node types with the most available cloud resources, in order to minimize allocation failures. --------- Signed-off-by: xiaowen.wxw <wxw403883@alibaba-inc.com> Co-authored-by: 行筠 <wxw403883@alibaba-inc.com>

this is for kuberay 1.5.1 release, for ray auth token mode Docs link: https://anyscale-ray--58885.com.readthedocs.build/en/58885/cluster/getting-started.html --------- Signed-off-by: Future-Outlier <eric901201@gmail.com>

…t events (ray-project#58953) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

## Why are these changes needed? The memory leak being tested ([apache/arrow#45493](apache/arrow#45493)) specifically occurs when inferring types from **ndarray objects**, not from lists containing ndarrays. Testing the `list` case added no value since the leak doesn't manifest there—it only added execution time and obscured the test's purpose. More importantly, the previous 1 MiB threshold was too tight and caused flaky failures. Memory measurements via RSS are inherently noisy due to OS-level allocation behavior, garbage collection timing, and memory fragmentation. A test that occasionally uses 1.1 MiB would fail despite no actual leak. The new approach: - **Calls `_infer_pyarrow_type` 8 times in a loop**, which leaks 1 GiB without Ray Data's workaround (admittedly, 8 is a magic number here) - **Uses a 64 MiB threshold**, providing a much larger margin above normal variation while still catching any real leak with a clear signal This creates a much stronger test: if the leak exists, we'd see memory growth approaching 1 GiB (with repeated runs), making failures unambiguous. Meanwhile, normal RSS fluctuations of a few MiB won't trigger false positives. --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

## Description Based on the comment here: ray-project#58630 (comment) Current `IssueDetector` base class requires all its subclasses include the `StreamingExecutor` as the arguments, making classes hard to mock and test because we have to mock all of StreamingExecutor. In this PR, we did following: 1. Remove constructor in `IssueDetector` base class and add `from_executor()` to setup the class based on the executor 2. Refactor subclasses of `IssueDetector` to use this format ## Related issues Related to ray-project#58562 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: machichima <nary12321@gmail.com>

## Description `asv.conf.json` appears to be a legacy file in `python` and `rllib` used for benchmarking that hasn't been modified in 5 years. Core is a nightly benchmark and RLlib is moving to adding this, therefore, this file shouldn't be necessary anymore Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>

## Description `test_backpressure_e2e` occasionally fails without any traceback or warning message: ``` [2025-11-24T21:42:12Z] ==================== Test output for //python/ray/data:test_backpressure_e2e: -- [2025-11-24T21:42:12Z] /opt/miniforge/lib/python3.12/site-packages/paramiko/pkey.py:82: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0. [2025-11-24T21:42:12Z] "cipher": algorithms.TripleDES, [2025-11-24T21:42:12Z] /opt/miniforge/lib/python3.12/site-packages/paramiko/transport.py:253: CryptographyDeprecationWarning: TripleDES has been moved to cryptography.hazmat.decrepit.ciphers.algorithms.TripleDES and will be removed from cryptography.hazmat.primitives.ciphers.algorithms in 48.0.0. [2025-11-24T21:42:12Z] "class": algorithms.TripleDES, [2025-11-24T21:42:12Z] ============================= test session starts ============================== [2025-11-24T21:42:12Z] platform linux -- Python 3.12.9, pytest-7.4.4, pluggy-1.3.0 -- /opt/miniforge/bin/python3 [2025-11-24T21:42:12Z] cachedir: .pytest_cache [2025-11-24T21:42:12Z] rootdir: /root/.cache/bazel/_bazel_root/1df605deb6d24fc8068f6e25793ec703/execroot/io_ray [2025-11-24T21:42:12Z] configfile: pytest.ini [2025-11-24T21:42:12Z] plugins: repeat-0.9.3, anyio-3.7.1, fugue-0.8.7, aiohttp-1.1.0, asyncio-0.17.2, docker-tools-3.1.3, forked-1.4.0, pytest_httpserver-1.1.3, lazy-fixtures-1.1.2, mock-3.14.0, remotedata-0.3.2, rerunfailures-11.1.2, sphinx-0.5.1.dev0, sugar-0.9.5, timeout-2.1.0, typeguard-2.13.3 [2025-11-24T21:42:12Z] asyncio: mode=Mode.AUTO [2025-11-24T21:42:12Z] timeout: 180.0s [2025-11-24T21:42:12Z] timeout method: signal [2025-11-24T21:42:12Z] timeout func_only: False [2025-11-24T21:42:12Z] collecting ... collected 12 items [2025-11-24T21:42:12Z] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_large_e2e_backpressure_no_spilling PASSED [ 8%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[False-3-500] PASSED [ 16%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[False-4-100] PASSED [ 25%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[False-3-100] PASSED [ 33%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[True-3-500] PASSED [ 41%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[True-4-100] PASSED [ 50%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_small_cluster_resources[True-3-100] PASSED [ 58%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_resource_contention[False] PASSED [ 66%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_on_resource_contention[True] PASSED [ 75%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_no_deadlock_with_preserve_order PASSED [ 83%] [2025-11-24T21:42:12Z] python/ray/data/tests/test_backpressure_e2e.py::test_input_backpressure_e2e PASSED [ 91%]================================================================================ ``` To make this easier to debug, this PR enables the `-s` flag to log more information. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…ect#58754) ## Description Previously, if `DataContext.get_current().enable_get_object_locations_for_metrics=False` (which it is by default), then we would default return `(-1, -1, -1)`. This wasn't properly being handled, so we would get negative metrics. This PR addresses that This PR also fixes run_index=-1 for **streaming split**. For all iterators but streaming split: 1. create executor with `dataset_tag` from 2) 2. increment `dataset_tag` 3. Get dataset_tag (dataset_-1) However, streaming_split skips step 2. This PR addreses that. ## Related issues ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

## Description RLlib is missing nightly testing making it difficult to track training performance over time. This PR, re-enables it, just for APPO to start with on Atari and MuJoCo environments I've removed the AutoROM comment as its no longer used by ALE to install atari roms. --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com>

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…ashboard's reporter_head. (ray-project#58978) There's no user of this endpoint in the codebase. This has the added benefit of reducing the surface area for our cython-bindings for GcsClient by removing ActorInfoAccessor::AsyncKillActor. Signed-off-by: irabbani <irabbani@anyscale.com>

sourcery-ai

The pull request #689 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5512.

gemini-code-assist · 2025-11-26T03:07:11Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request represents a significant refactoring and modernization of the project's continuous integration (CI) pipeline, build system, and dependency management. It introduces a more modular CI structure, adopts uv and raydepsets for enhanced Python dependency control, and updates the Bazel build configurations for improved efficiency and maintainability. These changes aim to streamline development workflows, ensure more reproducible builds, and prepare the project for future scalability and platform support.

Highlights

CI Pipeline Modernization: The Buildkite CI pipeline has undergone extensive refactoring, introducing new modular YAML configurations for image builds and dependency management, and consolidating various build and test steps for improved efficiency and maintainability.
Python Dependency Management with uv and raydepsets: The project transitions to uv and a new raydepsets system for managing Python dependencies, replacing older pip-compile and miniconda setups, leading to more hermetic and reproducible Python environments across various platforms.
Bazel Build System Enhancements: Significant updates to Bazel configurations include new packaging rules for C++ and Python artifacts, more granular C++ target definitions, and improved Python toolchain management, alongside enabling strict_action_env by default.
Updated Python and CUDA Support: CI configurations have been updated to reflect changes in supported Python versions (e.g., dropping Python 3.9 in some areas, defaulting to 3.10) and expanding CUDA versions, ensuring compatibility with newer environments.
C++ API and Runtime Refinements: The C++ API and runtime components have undergone refactoring, including changes to remote function handling, metric recording, object store behavior, and network utility functions, enhancing consistency and maintainability.
Documentation Tooling Integration: New linting tools like vale and semgrep have been integrated into the pre-commit hooks and CI, alongside updates to the documentation build process and style guide, improving code quality and consistency.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR is a massive and impressive refactoring of the entire CI/CD and build system. The changes introduce better modularity, adopt modern tools like uv and pre-commit, and improve dependency management with the new raydepsets tool. The build process is now more structured with multi-stage Docker builds and pre-built components. The overall direction is excellent and will significantly improve maintainability and developer experience. I've reviewed the changes and have a couple of minor corrections for test cases to align with the new build ID handling logic.

gemini-code-assist · 2025-11-26T03:12:56Z

ci/ray_ci/test_linux_tester_container.py

    with mock.patch("subprocess.check_call", side_effect=_mock_subprocess):
        LinuxTesterContainer("team", build_type="debug")
-        docker_image = f"{_DOCKER_ECR_REPO}:{_RAYCI_BUILD_ID}-team"
+        docker_image = f"{_DOCKER_ECR_REPO}:team"


The _RAYCI_BUILD_ID is set to a1b2c3d4 in the test setup, and the get_docker_image utility function prepends it to the docker tag. The expected image name here should include the build ID to match the implementation.

Suggested change

docker_image = f"{_DOCKER_ECR_REPO}:team"

docker_image = f"{_DOCKER_ECR_REPO}:{os.environ.get('RAYCI_BUILD_ID')}-team"

gemini-code-assist · 2025-11-26T03:12:57Z

ci/ray_ci/test_windows_container.py

        "C:\\rayci",
-        "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:unknown-test",
+        "029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:test",
        "bash",


The _get_docker_image method in WindowsContainer will produce an image tag with a leading hyphen (e.g., ...:-test) when RAYCI_BUILD_ID is empty, which is its new default. This test seems to expect the hyphen to be absent. The test should be updated to reflect the actual output. A better long-term fix would be to update WindowsContainer._get_docker_image to use the shared get_docker_image utility, which handles empty build IDs gracefully.

Suggested change

"bash",

"029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:-test",

aslonnie and others added 30 commits November 10, 2025 14:00

[doc] ci: move doc annotation check to python 3.12 (ray-project#58507)

b09b076

be consistent with doc build environment Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[doc] change link check to run on python 3.12 (ray-project#58506)

ce1fd47

migrating all doc related things to run on python 3.12 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci] apply isort to release test directory, part 1 (ray-project#58505)

b23adc7

excluding `*_tests` directories for now to reduce the impact Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[java] remove local genrule //java:ray_java_pkg (ray-project#58503)

f2dd0e2

using `bazelisk run //java:gen_ray_java_pkg` everywhere Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci] upgrade rayci version (ray-project#58514)

405c464

to 0.21.0; supports wanda priority now. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[serve][llm] Fix import path in muli-node release test (ray-project#5…

09f0113

…8498) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Docs] fix pattern_async_actor demo typo (ray-project#58486)

1a48e73

fix pattern_async_actor demo typo. Add `self.`. --------- Signed-off-by: curiosity-hyf <curiooosity.h@gmail.com>

[core] Make GlobalState lazy initialization thread-safe (ray-project#…

711d945

…58182) Signed-off-by: dayshah <dhyey2019@gmail.com>

[docs][serve][llm] examples and doc for cross-node TP/PP in Serve (ra…

6c9607e

…y-project#57715) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>

[core] Improve kill actor logs (ray-project#58544)

89a329c

Signed-off-by: dayshah <dhyey2019@gmail.com>

[core][rdt] Abort NIXL and allow actor reuse on failed transfers (ray…

20bf682

…-project#56783) Signed-off-by: dayshah <dhyey2019@gmail.com>

[doc] downgrade readthedocs to use python 3.10 (ray-project#58536)

a15f5be

be consistent with the default build environment Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[doc][serve][llm] Attached the correct figure to the pd docs (ray-pro…

584f5ac

…ject#58543) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

[serve][llm] Fix ReplicaContext serialization error in DPRankAssigner (…

7e87283

…ray-project#58504) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>

[bazel] upgrade bazel python rules to 0.25.0 (ray-project#58535)

9d5a241

previously it was actually using 0.4.0, which is set up by the grpc repo. the declaration in the workspace file was being shadowed.. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

srinathk10 and others added 22 commits November 24, 2025 15:57

[core] Removing unused boilerplate grpc benchmarks (ray-project#58955)

a4e647a

Signed-off-by: irabbani <irabbani@anyscale.com>

Update APPO paper reference in documentation (ray-project#58935)

683a29d

## Description The links for APPO were referencing the PPO paper. I updated them to link to the IMPACT paper Signed-off-by: Philipp Schmutz <2059887+pschmutz@users.noreply.github.com>

[Docs][Kuberay] Update version to 1.5.1 (ray-project#58885)

e1c0742

this is for kuberay 1.5.1 release, for ray auth token mode Docs link: https://anyscale-ray--58885.com.readthedocs.build/en/58885/cluster/getting-started.html --------- Signed-off-by: Future-Outlier <eric901201@gmail.com>

[Core] Print out detailed exception information when we fail to repor…

9136be5

…t events (ray-project#58953) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

[Core] Deflake test_put_out_of_disk (ray-project#56442)

3d0f004

Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

antfin-oss requested review from SongGuyang and kfstorm as code owners November 26, 2025 02:58

antfin-oss added auto-generated daily-merge labels Nov 26, 2025

antfin-oss assigned ffbin Nov 26, 2025

sourcery-ai bot reviewed Nov 26, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 26, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-26 #689

🔄 daily merge: master → main 2025-11-26 #689

Uh oh!

antfin-oss commented Nov 26, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 26, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Uh oh!

gemini-code-assist bot Nov 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

74 participants

	docker_image = f"{_DOCKER_ECR_REPO}:team"
	docker_image = f"{_DOCKER_ECR_REPO}:{os.environ.get('RAYCI_BUILD_ID')}-team"

	"bash",
	"029272617770.dkr.ecr.us-west-2.amazonaws.com/rayproject/citemp:-test",

🔄 daily merge: master → main 2025-11-26 #689

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-26 #689

Uh oh!

Conversation

antfin-oss commented Nov 26, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 26, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

74 participants