🔄 daily merge: master → main 2025-11-07 #672

antfin-oss · 2025-11-07T02:55:17Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-07
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

…57817) Signed-off-by: dayshah <dhyey2019@gmail.com>

…igurable (ray-project#57705) Recently, when we ran performance tests with task event generation turned on. We saw some performance regression when the workloads ran on very small CPU machines. With further investigation, the overhead mainly comes from the name format convention when converting the proto message to JSON format payload in the aggregator agent. This PR adds an env var for the config to control the name conversion behavior and update the corresponding tests. Also note that, eventually we are planning to remove this config turn off the field name conversion by default after migrated all the current event usage. --------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

…57861) Signed-off-by: joshlee <joshlee@anyscale.com>

It used to be in 3 different groups, now unionized in 1. Signed-off-by: kevin <kevin@anyscale.com>

…nter (ray-project#56848) * Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication. * Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends.   ## Why are these changes needed?  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: cem-anyscale <cem@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

… only once." (ray-project#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <can@anyscale.com>

## Description Ray data can't serialize zero (byte) length numpy arrays: ```python3 import numpy as np import ray.data array = np.empty((2, 0), dtype=np.int8) ds = ray.data.from_items([{"array": array}]) for batch in ds.iter_batches(batch_size=1): print(batch) ``` What I expect to see: ``` {'array': array([], shape=(1, 2, 0), dtype=int8)} ``` What I see: ``` /Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py:736: RuntimeWarning: invalid value encountered in scalar divide offsets = np.arange( 2025-10-17 17:18:09,499 WARNING arrow.py:189 -- Failed to convert column 'array' into pyarrow array due to: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []; falling back to serialize as pickled python objects Traceback (most recent call last): File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 672, in from_numpy return cls._from_numpy(arr) ^^^^^^^^^^^^^^^^^^^^ File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 736, in _from_numpy offsets = np.arange( ^^^^^^^^^^ ValueError: arange: cannot compute length The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 141, in convert_to_pyarrow_array return ArrowTensorArray.from_numpy( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 678, in from_numpy raise ArrowConversionError(data_str) from e ray.air.util.tensor_extensions.arrow.ArrowConversionError: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: [] 2025-10-17 17:18:09,789 INFO logging.py:293 -- Registered dataset logger for dataset dataset_0_0 2025-10-17 17:18:09,815 WARNING resource_manager.py:134 -- ⚠️ Ray's object store is configured to use only 33.5% of available memory (2.0GiB out of 6.0GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable. {'array': array([array([], shape=(2, 0), dtype=int8)], dtype=object)} ``` This PR fixes the issue so that zero-length arrays are serialized correctly, and the shape and dtype is preserved. ## Additional information This is `ray==2.50.0`. --------- Signed-off-by: Chris O'Hara <cohara87@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

use awscli directly; stop installing extra dependencies Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Signed-off-by: joshlee <joshlee@anyscale.com>

## Description Found this while reading the docs. Not sure what this "Note that" is referring to or why it is there. Signed-off-by: Max van Dijck <50382570+MaxVanDijck@users.noreply.github.com>

…ray-project#57891) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

it should not run on macos intel silicon anymore Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ect#57877) so that we are not tied to using public s3 buckets Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ject#57925) This PR moves the error handling of metric+event exporter agent one level up, inside the `metrics_agent_client` callback. Previously, the errors handled were handled by either the metric or event recorder, which leads to confusion and buggy code. Test: - CI --------- Signed-off-by: Cuong Nguyen <can@anyscale.com>

## Description Bumping from small to medium because it's timing out for Python 3.12. Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>

…project#57931) Signed-off-by: dayshah <dhyey2019@gmail.com>

)

…project#57932) ## Description This PR add prometheus metrics to the selected RLlib components. --------- Signed-off-by: joshlee <joshlee@anyscale.com> Signed-off-by: Kamil Kaczmarek <kamil@anyscale.com> Signed-off-by: kevin <kevin@anyscale.com> Signed-off-by: cem <cem@anyscale.com> Signed-off-by: cem-anyscale <cem@anyscale.com> Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Chris O'Hara <cohara87@gmail.com> Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Max van Dijck <50382570+MaxVanDijck@users.noreply.github.com> Signed-off-by: Seiji Eicher <seiji@anyscale.com> Co-authored-by: Joshua Lee <73967497+Sparks0219@users.noreply.github.com> Co-authored-by: Kevin H. Luu <kevin@anyscale.com> Co-authored-by: cem-anyscale <cem@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: Chris O'Hara <cohara87@gmail.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Co-authored-by: Max van Dijck <50382570+MaxVanDijck@users.noreply.github.com> Co-authored-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>

This PR makes the `ray.get` public API thread-safe. It also cleans up a lot of tech-debt wrt to * Workers yielding CPU to the raylet when blocked. * Cleaning up finished/inflight Get requests. Previously, the raylet coalesced all get requests from the same worker into one Get (and Pull) request. However, Get request cleanup could happen on multiple threads meaning **one thread could cancel inflight get requests for all threads in a worker**. This issue was reported in ray-project#54007. ### Changes in this PR: Raylet (server-side) 1. AsyncGetObjects will return a request_id. 2. LeaseDependencyManager no longer coalesces AsyncGetObjects requests from the same worker. 3. LeaseDependencyManager has two methods for cleanup (delete all requests for worker during worker disconnect/lease cleanup) and delete a specific request (called through CancelGetRequest) 4. Wait no longer cancels all Get requests for the worker (this was probably a bug) 5. NotifyWorkerUnblock does not cancel get requests anymore. CoreWorker (client-side) 1. PlasmaStoreProvider::Get will make 1 call to AsyncGetObjects per batch. 2. PlasmaStoreProvider::Get will store scoped cleanup handlers that will call CancelGetRequest for each call to AsyncGetObjects to guarantee RAII-style cleanup Closes ray-project#54007. --------- Signed-off-by: irabbani <israbbani@gmail.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…acing work (ray-project#57908) Tracing code hasn’t been maintained, and it can’t be run by relying on the docs alone. 1. [https://docs.ray.io/en/latest/ray-observability/user-guides/ray-tracing.html#installation](https://docs.ray.io/en/latest/ray-observability/user-guides/ray-tracing.html#installation) `opentelemetry-api==1.1.0` Version 1.1.0 is too old— https://github.com/ray-project/ray/blob/b988ce4e9b0fb618b40865600c0d98f1714c3bcf/ci/docker/serve.build.Dockerfile#L47 we’re already using 1.3.0+, which is incompatible with 1.1.0. 2. A legacy issue? This prevents the help information from being displayed. --------- Signed-off-by: justwph <2732352+wph95@users.noreply.github.com> Signed-off-by: JustWPH <2732352+wph95@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

## Why are these changes needed? Fixes a unit test that was broken but not running in CI. ## Related issue number Fixes ray-project#53478 ## Checks - [y] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [y] I've run `scripts/format.sh` to lint the changes in this PR. - [y] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [y] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/. - Testing Strategy - [y] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Jason <jcarlson212@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ject#57843) ## Description This PR will: 1- Fix Ray Data operator fusion tests by aligning them with the updated `Filter` signature and the `Dataset.stats()` output. 2- Add the standard pytest.main footer so that the test can be run directly and re-enable the Semgrep coverage that enforces it. ## Related issues Closes ray-project#57822 --------- Signed-off-by: Youssef Esseddiq <47015407+YoussefEssDS@users.noreply.github.com> Signed-off-by: YoussefEssDS <oyoussefesseddiq@gmail.com>

…ray-project#57918) This PR removes the overridden `completed()` method from `ActorPoolMapOperator` that is no longer needed. ray-project#52754 overrode `ActorPoolMapOperator.completed()` to fix a bug by checking `_bundle_queue.is_empty()` in addition to the parent class checks. However, PR ray-project#52806 more holistically fixed this issue by: 1. Adding `InternalQueueOperatorMixin` to force proper implementation of queue accounting methods 2. Fixing `OpState` methods to properly distinguish between bundled pending dispatch and internally queued items As a result, the parent class `completed()` method now correctly handles internal queue accounting, making the override in `ActorPoolMapOperator` redundant. Related PRs - ray-project#52754 - Original workaround that added the override - ray-project#52806 - Holistic fix that made the override unnecessary --------- Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

)   ## Why are these changes needed?  Third split of ray-project#56416 ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Gagandeep Singh <gdp.1807@gmail.com> Signed-off-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com> Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>

…ay-project#57017) ## Why are these changes needed? `SingleAgentEpisode.concat` would only support numpy array based observations due to `np.all(old_episode.observations[-1] == new_episode.observations[0])`. I've changed the implementation to use `tree.assert_same_structure` and `np.all` on the flatten structures to verify that observations are equivalent even for complex observation structures. In addition, I've added a test using a dict obs environment to verify this works. ## Related issue number Closes ray-project#54659 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [x] This PR is not tested :(  --- > [!NOTE] > Use structure-aware equality for observations during episode concatenation and add a test with dict observations; minor docstring tweaks. > > - **rllib/env**: > - **`single_agent_episode.py`**: > - `concat_episode`: Replace `np.all(a == b)` with `tree.assert_same_structure` and per-leaf `np.array_equal` to compare complex/nested observations. > - Add `tree` import. > - Minor docstring wording tweaks for `len_lookback_buffer`. > - **Tests**: > - **`rllib/env/tests/test_single_agent_episode.py`**: > - Add `DictTestEnv` and `test_concat_episode_with_complex_obs` to validate concatenation with dict observations. > - Fix test class name typo. > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit dc4856f. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com> Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>

…#57004) 1. currently, reporter agent is spawned by raylet process. It's assumed that all core workers are direct children of raylet, but it's not the case with new features (uv, image_url). reporter agent need another way to find all core workers. https://github.com/ray-project/ray/blob/10eacfd6ddf3b84827d016e37294bc5f2577ad3f/python/ray/dashboard/modules/reporter/reporter_agent.py#L911 2. driver is not spawned by raylet, thus is never monitored implementation: 1. add an grpc endpoint in raylet process (node manager), and allow reporter agent to connect 2. reporter agent fetches worker lists via grpc reply, including driver. it creates a raylet client with a dedicated thread Closes ray-project#56739 --------- Signed-off-by: tianyi-ge <tianyig@outlook.com>

## Why are these changes needed? `SegmentTree` is a component of the rllib `PrioritizedEpisodeReplayBuffer` however for extreme edge case prefix sum values then `find_prefixsum_idx` will return invalid out of bounds value. I couldn't find a bug, rather if the `prefixsum_value` is equal to the `SegmentTree.sum()` then traversing down the tree could cause it to return invalid indexes. I've added unittests to reproduce the original error and check against it ## Related issue number Close ray-project#54284 ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Mark Towers <mark@anyscale.com> Co-authored-by: Mark Towers <mark@anyscale.com> Co-authored-by: simonsays1980 <simon.zehnder@gmail.com>

…l_raylet_dies` (ray-project#57951) Example failure: https://buildkite.com/ray-project/postmerge/builds/13835#0199f4a3-5657-46cf-a498-bde0b2ba774e/615-2139 This happens because we tune the Raylet health check threshold to be very tight, so it's marked as dead extraneously during startup. Setting the timeout to 1s, which should increase test runtime slightly but deflake. Also removed the usage of `internal_kv`. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…or_crash_restart` (ray-project#57952) The test is periodically failing due to an OOM in CI: https://buildkite.com/ray-project/postmerge/builds/13835#0199f4a3-5657-46cf-a498-bde0b2ba774e/615-2139 There's no clear reason for it aside from just consuming a lot of memory. I've attempted to reduce the memory consumption by reducing the size of the objects generated. The test was also just a mess and very hard to understand, so I've cleaned it up and hopefully made it more clear. Also reduced the Raylet health check threshold to speed it up (~20s -> ~10s). --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…scription` (ray-project#57953) Failures in CI: - https://buildkite.com/ray-project/postmerge-macos/builds/8825#0199f6fb-e711-468c-875f-04bc66f9a545/2385-4866 - https://buildkite.com/ray-project/postmerge-macos/builds/8878#019a052b-723a-4a17-a563-f3b49d0d48a0/2384-3840 Looks like the timeout is just tight -- bumping from 5s -> 10s. Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

upgrading reef tests to run on 3.10 Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <joshlee@anyscale.com>

upgrading data ci tests to py3.10 postmerge build: https://buildkite.com/ray-project/postmerge/builds/14192 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

upgrading serve tests to run on python 3.10 Post merge run: https://buildkite.com/ray-project/postmerge/builds/14190 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

@jjyao

…roject#58307) There was a video object detection Ray Data workload hang reported. An initial investigation by @jjyao and @dayshah observed that it was due to an actor restart and the actor creation task was being spilled to a raylet that had an outdated resource view. This was found by looking at the raylet state dump. This actor creation task required 1 GPU and 1 CPU, and the raylet where this actor creation task was being spilled to had a cluster view that reported no available GPUs. However there were many available GPUs, and all the other raylet state dumps correctly reported this. Furthermore in the raylet logs for the oudated raylet there was a "Failed to send a message to node: " originating from the ray syncer. Hence an initial hypothesis was formed that the ray syncer retry policy was not working as intended. A follow up investigation by @edoakes and I revealed an incorrect usage of the grpc streaming callback API. Currently how retries works in the ray syncer on fail to send/write is: - OnWriteDone/OnReadDone(ok = false) is called after a failed read/write - Disconnect() (the one in *_bidi_reactor.h!) is called which flips _disconnected to true and calls DoDisconnect() - DoDisconnect() notifies grpc we will no longer write to the channel via StartWritesDone() and removes the hold via RemoveHold() - GRPC will see that the channel is idle and has no hold so will call OnDone() - we've overriden OnDone() to hold a cleanup_cb that contains the retry policy that reinitializes the bidi reactor and connects to the same server at a repeated interval of 2 seconds until it succeeds - fault tolerance accomplished! :) However from logs that we added we weren't seeing OnDone() being called after DoDisconnect() happens. From reading the grpc streaming callback best practices here: https://grpc.io/docs/languages/cpp/best_practices/#callback-streaming-api it states that "The best practice is always to read until ok=false on the client side" From the OnDone grpc documentation: https://grpc.github.io/grpc/cpp/classgrpc_1_1_client_bidi_reactor.html#a51529f76deeda6416ce346291577ffa9: it states that "Notifies the application that all operations associated with this RPC have completed and all Holds have been removed" Since we call StartWritesDone() and removed the hold, this should notify grpc that all operations associated with this bidi reactor are completed. HOWEVER reads may not be finished, i.e. we have not read all incoming data. Consider the following scenario: 1.) We receive a bunch of resource view messages from the GCS and have not processed all of them 2.) OnWriteDone(ok = false) is called => Disconnected() => disconnected_ = false 3.) OnReadDone(ok = true) is called however because disconnected_ = true we early return and STOP processing any more reads as shown below: https://github.com/ray-project/ray/blob/275a585203bef4e48c04b46b2b7778bd8265cf46/src/ray/ray_syncer/ray_syncer_bidi_reactor_base.h#L178-L180 4.) Pending reads left in queue, and prevent grpc from calling OnDone since not all operations are done 5.) Hang, we're left in a zombie state and drop all incoming resource view messages and don't send any resource view updates due to the disconnected check Hence the solution is to remove the disconnected check in OnReadDone and simply allow all incoming data to be read. There's a couple of interesting observations/questions remaining: 1.) The raylet with the outdated view is the local raylet to the gcs and we're seeing read/write errors despite being on the same node 2.) From the logs I see that the gcs syncer thinks that the channel to the raylet syncer is still available. There's no error logs on the gcs side, its still sending messages to the raylet. Hence even though the raylet gets the "Failed to write error: " we don't see a corresponding error log on the GCS side. --------- Signed-off-by: joshlee <joshlee@anyscale.com>

…project#58161) ## Description kai-scheduler supports gang scheduling at [v0.9.3](NVIDIA/KAI-Scheduler#500 (comment)). But gang scheduling doesn't work at v0.9.4. However, it works again at v0.10.0-rc1. ## Related issues ## Additional information The reason might be as follow. The `numOfHosts` is taken into consideration at v0.9.3. https://github.com/NVIDIA/KAI-Scheduler/blob/0a680562b3cdbae7d81688a81ab4d829332abd0a/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162 The snippet of code is missing at v0.9.4. https://github.com/NVIDIA/KAI-Scheduler/blob/281f4269b37ad864cf7213f44c1d64217a31048f/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L131-L140 Then, it shows up at v0.10.0-rc1. https://github.com/NVIDIA/KAI-Scheduler/blob/96b4d22c31d5ec2b7375b0de0e78e59a57baded6/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162 --------- Signed-off-by: fscnick <fscnick.dev@gmail.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

It is sometimes intuitive for users to provide their extensions with '.' at the start. This PR takes care of that and removed the '.' when it is provided. For example, when using `ray.data.read_parquet`, the parameter `file_extensions` needs to be something like `['parquet']`. However, intuitively some users may interpret this parameter as being able to use `['.parquet']`. This commit allows users to switch from: ```python train_data = ray.data.read_parquet( 'example_parquet_folder/', file_extensions=['parquet'], ) ``` to ```python train_data = ray.data.read_parquet( 'example_parquet_folder/', file_extensions=['.parquet'], # Now will read files, instead of silently not reading anything ) ```

…roject#58372) When starting a Ray cluster in a Kuberay environment, the startup process may sometimes be slow. In such cases, it is necessary to increase the timeout duration for proper startup, otherwise, the error "ray client connection timeout" will occur. Therefore, we need to make the timeout and retry policies for the Ray worker configurable. --------- Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>

…#58277) ## Description Rich progress currently doesn't support reporting progress from worker. As this is expected to take a lot of design into consideration, default to using tqdm progress (which supports progress reporting from worker) furthermore, we don't have an auto-detect to set `use_ray_tqdm`, so the requirement is for that to be disabled as well. In summary, requirements for rich progress as of now: - rich progress bars enabled - use_ray_tqdm disabled. ## Related issues Fixes ray-project#58250 ## Additional information N/A --------- Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

…#58381) and also use 12.8.1 cuda base image for default Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

python 3.9 is out of its life cycle Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

it is using the same docker file, but was not updated. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Updating ray examples to run on python 3.10 as the min Release build link: https://buildkite.com/ray-project/release/builds/66525 Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

upgrading rllib release tests to run on python 3.10 Release link: https://buildkite.com/ray-project/release/builds/66495#_ All failing tests are disabled Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

…y-project#58389) Updating core daily tests Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

## Description Adding missing test for issue detection ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Matthew Owen <mowen@anyscale.com>

…#58414) Sorting requirements and constraints for raydepsets --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

PyArrow URL-encodes partition values when writing to cloud storage. To ensure the values are consistent when you read them back, this PR updates the partitioning logic to URL-decode them. See apache/arrow#34905. Closes ray-project#57564 --------- Signed-off-by: Lucas Lam <laml2@github.com> Signed-off-by: lucaschadwicklam97 <52645624+lucaschadwicklam97@users.noreply.github.com> Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> Co-authored-by: Lucas Lam <laml2@github.com> Co-authored-by: Balaji Veeramani <bveeramani@berkeley.edu>

…58345) ## Summary Adds a new method to expose all downstream deployments that a replica calls into, enabling dependency graph construction. ## Motivation Deployments call downstream deployments via handles in two ways: 1. **Stored handles**: Passed to `__init__()` and stored as attributes → `self.model.func.remote()` 2. **Dynamic handles**: Obtained at runtime via `serve.get_deployment_handle()` → `model.func.remote()` Previously, there was no way to programmatically discover these dependencies from a running replica. ## Implementation ### Core Changes - **`ReplicaActor.list_outbound_deployments()`**: Returns `List[DeploymentID]` of all downstream deployments - Recursively inspects user callable attributes to find stored handles (including nested in dicts/lists) - Tracks dynamic handles created via `get_deployment_handle()` at runtime using a callback mechanism - **Runtime tracking**: Modified `get_deployment_handle()` to register handles when called from within a replica via `ReplicaContext._handle_registration_callback` Next PR: ray-project#58350 --------- Signed-off-by: abrar <abrar@anyscale.com>

sourcery-ai

The pull request #672 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5208.

gemini-code-assist · 2025-11-07T03:16:29Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive set of changes aimed at modernizing the build and test infrastructure, improving dependency management, and enhancing code quality. The changes span multiple areas, including Bazel configurations, Buildkite pipelines, Dockerfiles, and linting scripts, resulting in a more streamlined, reliable, and maintainable development process.

Highlights

Bazel Configuration Updates: Modified .bazelrc to enable strict action environment, set compiler options for various platforms, and ignore warnings for third-party code. Also updated test configurations for CI.
Buildkite Pipeline Changes: Restructured .buildkite configurations by adding _images.rayci.yml and modifying existing YAML files to streamline the build process and introduce new image building steps.
Dependency Management: Introduced dependencies.rayci.yml to manage pip-compile dependencies and raydepsets, ensuring consistent dependency versions across the project.
Core Test Updates: Modified core.rayci.yml to refine core tests, including adjustments to test tags, dependencies, and build configurations.
Data Test Modifications: Updated data.rayci.yml to align with new build configurations and test setups, focusing on arrow versions and test parallelization.
LLM Test Configuration: Adjusted llm.rayci.yml to specify the correct base images and CUDA versions for LLM tests.
macOS Build Adjustments: Refactored macos.rayci.yml to streamline macOS builds, including changes to instance types and build scripts.
Release Automation Enhancements: Enhanced release automation scripts and configurations, including updates to Dockerfiles and pre-release checks.
Code Formatting and Linting: Updated linting configurations, including .pre-commit-config.yaml and related scripts, to enforce code style and quality.
Core C++ Changes: Refactored Core C++ code to remove some dependencies and update the gRPC implementation.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is a large automated merge from master to main, containing a significant number of changes. The primary focus of these changes is a major refactoring of the CI/CD pipelines and the Bazel build system. Key changes include modularizing Buildkite steps, expanding test matrices for better coverage, and updating dependency management. The Bazel build system has been modernized, with a cleaner root BUILD.bazel file and adoption of standard packaging rules. Additionally, there are updates to linting configurations, C++ code modernization, and a switch to Apple Silicon runners for macOS CI. While the scope of changes is vast, they appear to be well-structured improvements to the project's infrastructure. I have one suggestion to improve the robustness of the CI utility code.

gemini-code-assist · 2025-11-07T03:22:37Z

ci/ray_ci/container.py

+def get_docker_image(docker_tag: str, build_id: Optional[str] = None) -> str:
+    """Get rayci image for a particular tag."""
+    if not build_id:
+        build_id = _RAYCI_BUILD_ID
+    if build_id:
+        return f"{_DOCKER_ECR_REPO}:{build_id}-{docker_tag}"
+    return f"{_DOCKER_ECR_REPO}:{docker_tag}"


The global variable _RAYCI_BUILD_ID is initialized at module import time. This can cause issues in tests where os.environ is patched after import, as this variable will not be updated. This has led to incorrect assertions in several tests (e.g., test_linux_tester_container.py, test_windows_container.py). To make this more robust, RAYCI_BUILD_ID should be read from os.environ directly inside this function.

Suggested change

def get_docker_image(docker_tag: str, build_id: Optional[str] = None) -> str:

"""Get rayci image for a particular tag."""

if not build_id:

build_id = _RAYCI_BUILD_ID

if build_id:

return f"{_DOCKER_ECR_REPO}:{build_id}-{docker_tag}"

return f"{_DOCKER_ECR_REPO}:{docker_tag}"

def get_docker_image(docker_tag: str, build_id: Optional[str] = None) -> str:

"""Get rayci image for a particular tag."""

if not build_id:

build_id = os.environ.get("RAYCI_BUILD_ID", "")

if build_id:

return f"{_DOCKER_ECR_REPO}:{build_id}-{docker_tag}"

return f"{_DOCKER_ECR_REPO}:{docker_tag}"

github-actions · 2025-11-22T01:39:46Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

dayshah and others added 30 commits October 20, 2025 19:23

[core] Kill raylet file and just keep node manager file (ray-project#…

532ac12

…57817) Signed-off-by: dayshah <dhyey2019@gmail.com>

[core] Make DrainRaylet + ShutdownRaylet Fault Tolerant (ray-project#…

2bbd13a

…57861) Signed-off-by: joshlee <joshlee@anyscale.com>

[release] Group all hello world tests together (ray-project#57920)

670151e

It used to be in 3 different groups, now unionized in 1. Signed-off-by: kevin <kevin@anyscale.com>

[ci] fix postmerge tests that require credentials (ray-project#57915)

034c54f

use awscli directly; stop installing extra dependencies Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[core] Make ReleaseUnusedBundles Fault Tolerant (ray-project#57786)

a9065a3

Signed-off-by: joshlee <joshlee@anyscale.com>

[doc] remove "Note that" in dataset.py documentation (ray-project#57884)

f2aa5a8

## Description Found this while reading the docs. Not sure what this "Note that" is referring to or why it is there. Signed-off-by: Max van Dijck <50382570+MaxVanDijck@users.noreply.github.com>

[codeowners] Reorder CODEOWNERS for resolution order + organization (…

2c680d7

…ray-project#57891) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[ci] change macos bisect job to use arm64 (ray-project#57914)

65bb37d

it should not run on macos intel silicon anymore Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[doc build] use rayci.anyscale.dev to fetch doc build cache (ray-proj…

4badd82

…ect#57877) so that we are not tied to using public s3 buckets Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[train] bump test_util timeout (ray-project#57939)

f97b6a6

## Description Bumping from small to medium because it's timing out for Python 3.12. Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>

[core] Don't log actor restart warning if arg is detached actor (ray-…

d86484d

…project#57931) Signed-off-by: dayshah <dhyey2019@gmail.com>

[data.dashboard] Add queued blocks to operator panels (ray-project#57739

5c77681

)

elliot-barn and others added 20 commits November 3, 2025 15:10

[ci] reef tests on py310 (ray-project#58379)

01ad74f

upgrading reef tests to run on 3.10 Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[docs][serve][llm] added touch ups (ray-project#58406)

52915af

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

[image] change build-docker.sh script to use python 3.10 (ray-project…

dcdbe74

…#58381) and also use 12.8.1 cuda base image for default Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[image] stop building python 3.9 slim image bases (ray-project#58418)

403f450

python 3.9 is out of its life cycle Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[data] change datanbuild to use python 3.10 (ray-project#58415)

3ef3b53

it is using the same docker file, but was not updated. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[release] ray-examples on py3.10 (ray-project#58392)

9c0dc3c

Updating ray examples to run on python 3.10 as the min Release build link: https://buildkite.com/ray-project/release/builds/66525 Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[release] RlLib py310 release test upgrades (ray-project#58385)

60de6e8

upgrading rllib release tests to run on python 3.10 Release link: https://buildkite.com/ray-project/release/builds/66495#_ All failing tests are disabled Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[release] updating core daily release tests to run on python 3.10 (ra…

b4eef69

…y-project#58389) Updating core daily tests Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[ci] sorting requirements and constraints for raydepsets (ray-project…

342bd78

…#58414) Sorting requirements and constraints for raydepsets --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

antfin-oss requested review from SongGuyang and kfstorm as code owners November 7, 2025 02:55

antfin-oss added auto-generated daily-merge labels Nov 7, 2025

antfin-oss assigned ffbin Nov 7, 2025

sourcery-ai bot reviewed Nov 7, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 7, 2025

View reviewed changes

github-actions bot added the stale label Nov 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-07 #672

🔄 daily merge: master → main 2025-11-07 #672

Uh oh!

antfin-oss commented Nov 7, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 7, 2025

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

82 participants

🔄 daily merge: master → main 2025-11-07 #672

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-07 #672

Uh oh!

Conversation

antfin-oss commented Nov 7, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 7, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 7, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Nov 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

82 participants