🔄 daily merge: master → main 2025-11-12 #675

antfin-oss · 2025-11-12T02:56:47Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-12
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

…t#57620)   ## Why are these changes needed? This will be used to help control the targets that are returned.  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

## Description This PR adds a new check to make sure proxies are ready to serve traffic before finishing serve.run. For now, the check immediately finishes.  ## Related issues  ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [ ] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes ⚠️ - [ ] No  **Testing:** - [ ] Added/updated tests for my changes - [ ] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [ ] Signed off every commit (`git commit -s`) - [ ] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context  --------- Signed-off-by: akyang-anyscale <alexyang@anyscale.com>

…roject#57793) When deploying ray on Yarn using Skein, it's useful to expose the ray's dashboard via Skein's web ui. This PR shows how to expose that and update the related document. Signed-off-by: Zakelly <zakelly.lan@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…cgroup even if they are drivers (ray-project#57955) For more details about the resource isolation project see ray-project#54703. Driver processes that are registered in ray's internal namespace (such as ray dashboard's job and serve modules) are considered system processes. Therefore, they will not be moved into the workers cgroup when they register with the raylet. --------- Signed-off-by: irabbani <israbbani@gmail.com>

…ray-project#57938) This PR adds persistent epoch data to the checkpointing logic in the [FSDP2 Template](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html). This PR includes: - New logic for saving the epoch into a distributed checkpoint - New logic for resuming training from the saved epoch in a loaded checkpoint - Updates the [OSS FSDP2 example](https://docs.ray.io/en/master/train/examples/pytorch/pytorch-fsdp/README.html) to include the new logic Passing release test: https://buildkite.com/ray-project/release/builds/64867#019a08e3-1a3e-4fc5-9633-b8e3a0b0f34f --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> Signed-off-by: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com>

> Briefly describe what this PR accomplishes and why it's needed. Making Cancel Remote Task RPC idempotent and fault tolerant. Added a python test to verify retry behavior, no cpp test since it just calls CancelTask RPC so nothing to test. Also renamed uses of RemoteCancelTask to CancelRemoteTask since it should be consistent. --------- Signed-off-by: joshlee <joshlee@anyscale.com>

…utside of a Ray Train worker (ray-project#57863) Introduce a decorator to mark functions that require running inside a worker process spawned by Ray Train. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

## Description Fix the typing for UDFs. This should not accept an instance as it is currently defined. Signed-off-by: Matthew Owen <mowen@anyscale.com>

…ock sizing (ray-project#58013) ## Summary Add a `repartition` call with `target_num_rows_per_block=BATCH_SIZE` to the audio transcription benchmark. This ensures blocks are appropriately sized to: - Prevent out-of-memory (OOM) errors - Ensure individual tasks don't take too long to complete ## Changes - Added `ds = ds.repartition(target_num_rows_per_block=BATCH_SIZE)` after reading the parquet file in `ray_data_main.py:98` Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

…57044) running core scalability tests on python 3.10 Updating unit test Successful release test: https://buildkite.com/ray-project/release/builds/60890#01999c8a-6fdc-446a-a9da-2b9b006692d3 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

## Description We are using `read_parquet` in two of our tests in `test_operator_fusion.py`, this switches those to use `range` to make the tests less brittle. Signed-off-by: Matthew Owen <mowen@anyscale.com>

with comments to github issues Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

otherwise, the ordering or messages looks strange on windows. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

…8020) ## Description Currently, streaming repartition isn't combining blocks to the `target_num_rows_per_block` which is problematic, in a sense that it can only split blocks but not recombine them. This PR is addressing that by allowing it to recombine smaller blocks into bigger ones. However, one caveat is that the remainder of the block could still be under `target_num_rows_per_block`. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…e buildup (ray-project#57996) …e buildup > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] ConcurrencyCapBackpressurePolicy - Handle internal output queue buildup **Issue** - When there is internal output queue buildup specifically when preserve_order is set, we don't limit task concurrency in streaming executor and just honor static concurrency cap. - When concurrency cap is unlimited, we keep queuing more Blocks into internal output queue leading to spill and steep spill curve. **Solution** In ConcurrencyCapBackpressurePolicy, detect internal output queue buildup and then limit the concurrency of the tasks. - Keep the internal output queue history and detect trends in percentage & size in GBs. Based on trends, increase/decrease the concurrency cap. - Given queue based buffering is needed for `preserve_order`, allow adaptive queuing threshold. This would result in Spill, but would flatten out the Spill curve and not cause run away buffering queue growth. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…#57999) We have a feature flag to control the rolling out of ray export event, but the feature flag is missing the controlling of `StartExportingEvents`. This PR fixes the issue. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com>

otherwise they are failing windows core python tests Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…y-project#58023) > Thank you for contributing to Ray! 🚀 > Please review the [Ray Contribution Guide](https://docs.ray.io/en/master/ray-contribute/getting-involved.html) before opening a pull request. > ⚠️ Remove these instructions before submitting your PR. > 💡 Tip: Mark as draft if you want early feedback, or ready for review when it's complete. ## Description > Briefly describe what this PR accomplishes and why it's needed. ### [Data] ConcurrencyCapBackpressurePolicy - Only increase threshold When `_update_queue_threshold` to adjust the queue threshold to cap concurrency based on current queued bytes, - Only allow increasing the threshold or maintaining it. - Cannot decrease threshold because the steady state of queued bytes is not known. ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com> Signed-off-by: Srinath Krishnamachari <68668616+srinathk10@users.noreply.github.com>

combining all depset checks into a single job TODO: add raydepset feature to build all depsets for the depset graph --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

- default deployment name was changed to _TaskConsumerWrapper after async inference implementation, fixed it now Signed-off-by: harshit <harshit@anyscale.com>

…#58033) ## Description This change properly handles of pushing of the renaming projections into read ops (that support projections, like parquet reads). ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com>

…-project#58011) Signed-off-by: Kourosh Hakhamaneshi <Kourosh@anyscale.com>

## Description This PR adds support for reading Unity Catalog Delta tables in Ray Data with automatic credential vending. This enables secure, temporary access to Delta Lake tables stored in Databricks Unity Catalog without requiring users to manage cloud credentials manually. ### What's Added - **`ray.data.read_unity_catalog()`** - Updated public API for reading Unity Catalog Delta tables - **`UnityCatalogConnector`** - Handles Unity Catalog REST API integration and credential vending - **Multi-cloud support** - Works with AWS S3, Azure Data Lake Storage, and Google Cloud Storage - **Automatic credential management** - Obtains temporary, least-privilege credentials via Unity Catalog API - **Delta Lake integration** - Properly configures PyArrow filesystem for Delta tables with session tokens ### Key Features ✅ **Production-ready credential vending API** - Uses stable, public Unity Catalog APIs ✅ **Secure by default** - Temporary credentials with automatic cleanup ✅ **Multi-cloud** - AWS (S3), Azure (Blob Storage), and GCP (Cloud Storage) ✅ **Delta Lake optimized** - Handles session tokens and PyArrow filesystem configuration ✅ **Comprehensive error handling** - Helpful messages for common issues (deletion vectors, permissions, etc.) ✅ **Full logging support** - Debug and info logging throughout ### Usage Example ```python import ray # Read a Unity Catalog Delta table ds = ray.data.read_unity_catalog( table="main.sales.transactions", url="https://dbc-XXXXXXX-XXXX.cloud.databricks.com", token="dapi...", region="us-west-2" # Optional, for AWS ) # Use standard Ray Data operations ds = ds.filter(lambda row: row["amount"] > 100) ds.show(5) ``` ### Implementation Notes This is a **simplified, focused implementation** that: - Supports **Unity Catalog tables only** (no volumes - that's in private preview) - Assumes **Delta Lake format** (most common Unity Catalog use case) - Uses **production-ready APIs** only (no private preview features) - Provides ~600 lines of clean, reviewable code The full implementation with volumes and multi-format support is available in the `data_uc_volumes` branch and can be added in a future PR once this foundation is reviewed. ### Testing - ✅ All ruff lint checks pass - ✅ Code formatted per Ray standards - ✅ Tested with real Unity Catalog Delta tables on AWS S3 - ✅ Proper PyArrow filesystem configuration verified - ✅ Credential vending flow validated ## Related issues Related to Unity Catalog and Delta Lake support requests in Ray Data. ## Additional information ### Architecture The implementation follows the **connector pattern** rather than a `Datasource` subclass because Unity Catalog is a metadata/credential layer, not a data format. The connector: 1. Fetches table metadata from Unity Catalog REST API 2. Obtains temporary credentials via credential vending API 3. Configures cloud-specific environment variables 4. Delegates to `ray.data.read_delta()` with proper filesystem configuration ### Delta Lake Special Handling Delta Lake on AWS requires explicit PyArrow S3FileSystem configuration with session tokens (environment variables alone are insufficient). This implementation correctly creates and passes the filesystem object to the `deltalake` library. ### Cloud Provider Support | Provider | Credential Type | Implementation | |----------|----------------|----------------| | AWS S3 | Temporary IAM credentials | PyArrow S3FileSystem with session token | | Azure Blob | SAS tokens | Environment variables (AZURE_STORAGE_SAS_TOKEN) | | GCP Cloud Storage | OAuth tokens / Service account | Environment variables (GCP_OAUTH_TOKEN, GOOGLE_APPLICATION_CREDENTIALS) | ### Error Handling Comprehensive error messages for common issues: - **Deletion Vectors**: Guidance on upgrading deltalake library or disabling the feature - **Column Mapping**: Compatibility information and solutions - **Permissions**: Clear list of required Unity Catalog permissions - **Credential issues**: Detailed troubleshooting steps ### Future Enhancements Potential follow-up PRs: - Unity Catalog volumes support (when out of private preview) - Multi-format support (Parquet, CSV, JSON, images, etc.) - Custom datasource integration - Advanced Delta Lake features (time travel, partition filters) ### Dependencies - Requires `deltalake` package for Delta Lake support - Uses standard Ray Data APIs (`read_delta`, `read_datasource`) - Integrates with existing PyArrow filesystem infrastructure ### Documentation - Full docstrings with examples - Type hints throughout - Inline comments with references to external documentation - Comprehensive error messages with actionable guidance --------- Signed-off-by: soffer-anyscale <stephen.offer@anyscale.com>

…ease test (ray-project#58048) ## Summary This PR removes the `image_classification_chaos_no_scale_back` release test and its associated setup script (`setup_cluster_compute_config_updater.py`). This test has become non-functional and is no longer providing useful signal. ## Background The `image_classification_chaos_no_scale_back` release test was designed to validate Ray Data's fault tolerance when many nodes abruptly get preempted at the same time. The test worked by: 1. Running on an autoscaling cluster with 1-10 nodes 2. Updating the compute config mid-test to downscale to 5 nodes 3. Asserting that there are dead nodes as a sanity check ## Why This Test Is Broken After the removal of Parquet metadata fetching in ray-project#56105 (September 2, 2025), the autoscaling behavior changed significantly: - **Before metadata removal**: The cluster would autoscale more aggressively because metadata fetching created additional tasks that triggered faster scale-up. The cluster would scale past 5 nodes, then downscale, leaving dead nodes that the test could detect. - **After metadata removal**: Without the metadata fetching tasks, the cluster doesn't scale up fast enough to get past 5 nodes before the downscale happens. This means there are no dead nodes to detect, causing the test to fail. ## Why We're Removing It 1. **Test is fundamentally broken**: The test's assumptions about autoscaling behavior are no longer valid after the metadata fetching removal 2. **Not actively monitored**: The test is marked as unstable and isn't closely watched ## Changes - Removed `image_classification_chaos_no_scale_back` test from `release/release_data_tests.yaml` - Deleted `release/nightly_tests/setup_cluster_compute_config_updater.py` (only used by this test) ## Related See ray-project#56105 Fixes ray-project#56528 Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

These numbers are outdated, and the ones we report are not very useful. We will refresh them soon. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…54857) Signed-off-by: EkinKarabulut <ekarabulut@nvidia.com> Signed-off-by: EkinKarabulut <82878945+EkinKarabulut@users.noreply.github.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> Co-authored-by: fscnick <6858627+fscnick@users.noreply.github.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com>

## Description https://arrow.apache.org/docs/python/generated/pyarrow.Array.html#pyarrow.Array.to_numpy <img width="772" height="270" alt="Screenshot 2025-10-18 at 3 14 36 PM" src="https://github.com/user-attachments/assets/d9cbf986-4271-41e6-9c4c-96201d32d1c6" /> `zero_copy_only` is actually default to True, so we should explicit pass False, for pyarrow version < 13.0.0 https://github.com/ray-project/ray/blob/1e38c9408caa92c675f0aa3e8bb60409c2d9159f/python/ray/data/_internal/arrow_block.py#L540-L546 ## Related issues Closes ray-project#57819 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: You-Cheng Lin (Owen) <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <youchenglin@youchenglin-L3DPGF50JG.local> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Co-authored-by: You-Cheng Lin <youchenglin@youchenglin-L3DPGF50JG.local>

) Updating the default value calculation in the docstrings for the public API. Signed-off-by: irabbani <israbbani@gmail.com>

…#58025) Signed-off-by: Kourosh Hakhamaneshi <Kourosh@anyscale.com>

…ject#57233) Update remaining mulitmodal release tests to use new depsets.

…y-project#58441) ## Description Currently, we clear _external_ queues when an operator is manually marked as finished. But we don't clear their _internal_ queues. This PR fixes that ## Related issues Fixes this test https://buildkite.com/ray-project/postmerge/builds/14223#019a5791-3d46-4ab8-9f97-e03ea1c04bb0/642-736 ## Additional information --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com>

be consistent with doc build environment Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

migrating all doc related things to run on python 3.12 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

excluding `*_tests` directories for now to reduce the impact Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

using `bazelisk run //java:gen_ray_java_pkg` everywhere Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

This PR adds 2 new metrics to core_worker by way of the reference counter. The two new metrics keep track of the count and size of objects owned by the worker as well as keeping track of their states. States are defined as: - **PendingCreation**: An object that is pending creation and hasn't finished it's initialization (and is sizeless) - **InPlasma**: An object which has an assigned node address and isn't spilled - **Spilled**: An object which has an assigned node address and is spilled - **InMemory**: An object which has no assigned address but isn't pending creation (and therefore, must be local) The approach used by these new metrics is to examine the state 'before and after' any mutations on the reference in the reference_counter. This is required in order to do the appropriate bookkeeping (decrementing values and incrementing others). Admittedly, there is potential for counting on the in between decrements/increments depending on when the RecordMetrics loop is run. This unfortunate side effect however seems preferable to doing mutual exclusion with metric collection as this is potentially a high throughput code path. In addition, performing live counts seemed preferable then doing full accounting of the object store and across all references at time of metric collection. Reason being, that potentially the reference counter is tracking millions of objects, and each metric scan could potentially be very expensive. So running the accounting (despite being potentially innaccurate for short periods) seemed the right call. This PR also allows for object size to potentially change due to potential non deterministic instantation (say an object is initially created, but it's primary copy dies, and then the recreation fails). This is an edge case, but seems important for completeness sake. --------- Signed-off-by: zac <zac@anyscale.com>

to 0.21.0; supports wanda priority now. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…8498) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

…#58286) ## Description Predicate pushdown (ray-project#58150) in conjunction with this PR should speed up reads from Iceberg. Once the above change lands, we can add the pushdown interface support for IcebergDatasource --------- Signed-off-by: Goutam <goutam@anyscale.com>

## Description * Does the work to bump pydoclint up to the latest version * And allowlist any new violations it finds ## Related issues n/a ## Additional information n/a --------- Signed-off-by: Thomas Desrosiers <thomas@anyscale.com>

fix pattern_async_actor demo typo. Add `self.`. --------- Signed-off-by: curiosity-hyf <curiooosity.h@gmail.com>

…hboard agent (ray-project#58405) Add a grpc service interceptor to intercept all dashboard agent rpc calls and validate the presence of auth token (when auth mode is token) --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…tests (ray-project#58528) the auth token test setup in `conftest.py` is breaking macos test. there are two test scripts (`test_microbenchmarks.py` and `test_basic.py`) that run after the wheel is installed but without editable mode. for these test to pass,` conftest.py` cannot import anything under `ray.tests`. this pr moves `authentication_test_utils` into `ray._private` to fix this issue Signed-off-by: sampan <sampan@anyscale.com> Co-authored-by: sampan <sampan@anyscale.com>

This PR enables open telemetry as the default backend for ray metric stack. The bulk of this PR is actually to fix tests that were written with some assumptions that no longer hold true. For ease of reviewing, I inline the reasons for the change together with the change for each tests in the comments. This PR also depends on a release of vllm (so that we can update the minimal supported version of vllm in ray). Test: - CI  --- > [!NOTE] > Enable OpenTelemetry metrics backend by default and refactor metrics/Serve tests to use timeseries APIs and updated `ray_serve_*` metric names. > > - **Core/Config**: > - Default-enable OpenTelemetry: set `RAY_enable_open_telemetry` to `true` in `ray_constants.py` and `ray_config_def.h`. > - Metrics `Counter`: use `CythonCount` by default; keep legacy `CythonSum` only when OTEL is explicitly disabled. > - **Serve/Metrics Tests**: > - Replace text scraping with `PrometheusTimeseries` and `fetch_prometheus_metric_timeseries` throughout. > - Update metric names/tags to `ray_serve_*` and counter suffixes `*_total`; adjust latency metric names and processing/queued gauges. > - Reduce ad-hoc HTTP scrapes; plumb a reusable `timeseries` object and pass through helpers. > - **General Test Fixes**: > - Remove OTEL parametrization/fixtures; simplify expectations where counters-as-gauges no longer apply; drop related tests. > - Cardinality tests: include `"low"` level and remove OTEL gating; stop injecting `enable_open_telemetry` in system config. > - Actor/state/thread tests: migrate to cluster fixtures, wait for dashboard agent, and adjust expected worker thread counts. > - **Build**: > - Remove OTEL-specific Bazel test shard/env overrides; clean OTEL env from C++ stats test. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit 1d0190f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Signed-off-by: Cuong Nguyen <can@anyscale.com>

…mmended (ray-project#57726)    ## Description If users schedule a detached actor into a placement group, Raylet will kill the actor when the placement group is removed. The actor will be stuck in the `RESTARTING` state forever if it's restartable until users explicitly kill it. In that case, if users try to `get_actor` with the actor's name, it can still return the restarting actor, but no process exists. It will no longer be restarted because the PG is gone, and no PG with the same ID will be created during the cluster's lifetime. The better behavior would be for Ray to transition a task/actor's state to dead when it is impossible to restart. However, this would add too much complexity to the core, so I think it's not worth it. Therefore, this PR adds a warning log, and users should use detached actors or PGs correctly. Example: Run the following script and run `ray list actors`. ```python import ray from ray.util.scheduling_strategies import PlacementGroupSchedulingStrategy from ray.util.placement_group import placement_group, remove_placement_group @ray.remote(num_cpus=1, lifetime="detached", max_restarts=-1) class Actor: pass ray.init() pg = placement_group([{"CPU": 1}]) ray.get(pg.ready()) actor = Actor.options( scheduling_strategy=PlacementGroupSchedulingStrategy( placement_group=pg, ) ).remote() ray.get(actor.__ray_ready__.remote()) ``` ## Related issues  ## Types of change - [ ] Bug fix 🐛 - [ ] New feature ✨ - [x] Enhancement 🚀 - [ ] Code refactoring 🔧 - [ ] Documentation update 📖 - [ ] Chore 🧹 - [ ] Style 🎨 ## Checklist **Does this PR introduce breaking changes?** - [ ] Yes ⚠️ - [x] No  **Testing:** - [ ] Added/updated tests for my changes - [x] Tested the changes manually - [ ] This PR is not tested ❌ _(please explain why)_ **Code Quality:** - [x] Signed off every commit (`git commit -s`) - [x] Ran pre-commit hooks ([setup guide](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) **Documentation:** - [ ] Updated documentation (if applicable) ([contribution guide](https://docs.ray.io/en/latest/ray-contribute/docs.html)) - [ ] Added new APIs to `doc/source/` (if applicable) ## Additional context  --------- Signed-off-by: Kai-Hsun Chen <khchen@x.ai> Signed-off-by: Robert Nishihara <robertnishihara@gmail.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Robert Nishihara <robertnishihara@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…58182) Signed-off-by: dayshah <dhyey2019@gmail.com>

…y-project#57715) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>

Signed-off-by: dayshah <dhyey2019@gmail.com>

…-project#56783) Signed-off-by: dayshah <dhyey2019@gmail.com>

sourcery-ai

The pull request #675 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5341.

gemini-code-assist · 2025-11-12T04:01:33Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request performs a routine daily merge from the master branch into the main branch. The changes primarily focus on significant refactoring of the CI/CD pipeline, C++ core components, and Python packaging. It aims to streamline the build process, improve modularity, and update development toolchains, ensuring a more robust and maintainable codebase. Key updates include consolidating Buildkite image definitions, modernizing C++ RPC structures, and transitioning to miniforge3 for Python environments, alongside various test and linting enhancements.

Highlights

Bazel Configuration Updates: The Bazel configuration (.bazelrc) now globally enables incompatible_strict_action_env and adds specific workspace_status_command for Linux builds. Windows builds will now use /utf-8 for C++ compilation. Warnings for third-party files (src/ray/thirdparty/) are now ignored, and a new try-import .user.bazelrc allows for custom user options. macOS builds will now ignore deprecated declaration warnings.
CI Pipeline Refactoring: A major refactoring of the Buildkite CI pipeline has occurred. Image definitions previously scattered across _forge.rayci.yml are now consolidated into a new _images.rayci.yml file. Doc-related CI steps have been moved to a new doc.rayci.yml, and dependency compilation steps to dependencies.rayci.yml. Python 3.10 has been added to several oss-ci-base build matrices, and new ray-core-build, ray-dashboard-build, ray-java-build steps have been introduced to pre-build core components.
C++ Core Refactoring: Extensive refactoring of C++ RPC components has taken place, with numerous ray_cc_library and cc_grpc_library definitions removed from BUILD.bazel. The ray_mock target has been updated, and several core C++ components like ray_common, raylet, and various RPC-related libraries have been removed from the top-level BUILD.bazel file, indicating a more modular structure.
Python Packaging and Build System Changes: The Python packaging process has been updated, replacing copy_to_workspace rules with pkg_files and pkg_zip for artifact management. New Python scripts (gen_py_proto.py, gen_ray_pkg.py, gen_redis_pkg.py, gen_ray_cpp_pkg.py, gen_ray_java_pkg.py) have been added to extract various packages. miniconda installations in Dockerfiles and scripts have been replaced with miniforge3.
Test and Linting Updates: The PR template has been updated to a more concise format. Pre-commit hooks have been enhanced with semgrep, vale, cython-lint, check-train-circular-imports, and eslint. pydoclint has been updated and split into local and CI stages. Several test configurations have been adjusted, including splitting Dask and Modin tests, adding RAYCI_DISABLE_TEST_DB=1 to C++ tests, and updating Python versions for various test matrices.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This is a massive automated PR that refactors large parts of the build and CI system. The changes are generally positive, moving towards more modularity, standard tooling (e.g. pkg_* rules, pre-commit), and optimizations (pre-built images). I've pointed out several areas where test coverage seems to have been reduced, which should be addressed. I also have a couple of questions about specific changes to ensure they are intentional and don't introduce issues. Overall, this is a great step forward for the project's infrastructure.

github-actions · 2025-11-27T01:39:06Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

akyang-anyscale and others added 30 commits October 22, 2025 11:52

[data] Fix type for UDFs (ray-project#57976)

4da5915

## Description Fix the typing for UDFs. This should not accept an instance as it is currently defined. Signed-off-by: Matthew Owen <mowen@anyscale.com>

[data] Use ranges in test_operator_fusion.py (ray-project#58000)

1b1bd91

## Description We are using `read_parquet` in two of our tests in `test_operator_fusion.py`, this switches those to use `range` to make the tests less brittle. Signed-off-by: Matthew Owen <mowen@anyscale.com>

[examples] disable tests that have been failing (ray-project#58015)

7e21548

with comments to github issues Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci] write test progress message to stderr (ray-project#58019)

a6193b2

otherwise, the ordering or messages looks strange on windows. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[train] Update vicuna release test example to use V2 (ray-project#57767)

e7a79ba

Updates the vicuna lightning deepspeed example to run w/ Train V2. --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com>

[core] add no_windows tag on cgroup related tests (ray-project#58027)

244caa8

otherwise they are failing windows core python tests Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[deps][ci] compiling all depsets in single job (ray-project#57957)

f0c7b53

combining all depset checks into a single job TODO: add raydepset feature to build all depsets for the depset graph --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

fix default dep name for async inf (ray-project#57664)

42b58a4

- default deployment name was changed to _TaskConsumerWrapper after async inference implementation, fixed it now Signed-off-by: harshit <harshit@anyscale.com>

[serve][llm] Add KV connector factory and MultiConnector support (ray…

9051773

…-project#58011) Signed-off-by: Kourosh Hakhamaneshi <Kourosh@anyscale.com>

[core] Mark the scalability envelope as WIP (ray-project#58055)

4f497a6

These numbers are outdated, and the ones we report are not very useful. We will refresh them soon. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[core] (cgroups 22/n) Updating public API doc-strings (ray-project#58059

4f2db49

) Updating the default value calculation in the docstrings for the public API. Signed-off-by: irabbani <israbbani@gmail.com>

[doc][serve][llm] Add user guide for kv-cache offloading (ray-project…

ac943b3

…#58025) Signed-off-by: Kourosh Hakhamaneshi <Kourosh@anyscale.com>

omatthew98 and others added 20 commits November 10, 2025 10:54

[data] Update depsets for multimodal inference release tests (ray-pro…

ffb51f8

…ject#57233) Update remaining mulitmodal release tests to use new depsets.

[doc] ci: move doc annotation check to python 3.12 (ray-project#58507)

b09b076

be consistent with doc build environment Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[doc] change link check to run on python 3.12 (ray-project#58506)

ce1fd47

migrating all doc related things to run on python 3.12 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci] apply isort to release test directory, part 1 (ray-project#58505)

b23adc7

excluding `*_tests` directories for now to reduce the impact Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[java] remove local genrule //java:ray_java_pkg (ray-project#58503)

f2dd0e2

using `bazelisk run //java:gen_ray_java_pkg` everywhere Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci] upgrade rayci version (ray-project#58514)

405c464

to 0.21.0; supports wanda priority now. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[serve][llm] Fix import path in muli-node release test (ray-project#5…

09f0113

…8498) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[Docs] fix pattern_async_actor demo typo (ray-project#58486)

1a48e73

fix pattern_async_actor demo typo. Add `self.`. --------- Signed-off-by: curiosity-hyf <curiooosity.h@gmail.com>

[core] Make GlobalState lazy initialization thread-safe (ray-project#…

711d945

…58182) Signed-off-by: dayshah <dhyey2019@gmail.com>

[docs][serve][llm] examples and doc for cross-node TP/PP in Serve (ra…

6c9607e

…y-project#57715) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com>

[core] Improve kill actor logs (ray-project#58544)

89a329c

Signed-off-by: dayshah <dhyey2019@gmail.com>

[core][rdt] Abort NIXL and allow actor reuse on failed transfers (ray…

20bf682

…-project#56783) Signed-off-by: dayshah <dhyey2019@gmail.com>

antfin-oss requested review from SongGuyang and kfstorm as code owners November 12, 2025 02:56

antfin-oss added auto-generated daily-merge labels Nov 12, 2025

antfin-oss assigned ffbin Nov 12, 2025

sourcery-ai bot reviewed Nov 12, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 12, 2025

View reviewed changes

github-actions bot added the stale label Nov 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-12 #675

🔄 daily merge: master → main 2025-11-12 #675

Uh oh!

antfin-oss commented Nov 12, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 12, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

82 participants

🔄 daily merge: master → main 2025-11-12 #675

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-12 #675

Uh oh!

Conversation

antfin-oss commented Nov 12, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 12, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

82 participants