🔄 daily merge: master → main 2025-11-05 #670

antfin-oss · 2025-11-05T02:56:00Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-05
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

## Why are these changes needed? Computing the `num_module_steps_trained_(lifetime)_throughput` metrics are biased due to the way how we record throughput times in a loop over module batches. This PR offers a fix to this bias. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: simonsays1980 <simon.zehnder@gmail.com> Co-authored-by: Kamil Kaczmarek <kaczmarek.poczta@gmail.com>

…orker` (ray-project#57859) ## Description The type annotation for `actor_location_tracker` is currently `ActorLocationTracker`, but it should be `ray.actor.ActorHandle[ActorLocationTracker]`. This PR fixes that issue. Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>

ray-project#57834) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

…r'. (ray-project#57673)   ## Why are these changes needed? The type hints for `learner_connector` in `AlgorithmConfig.training` was deprecated still using the `RLModule` as parameter. This PR adjust type hints to the actual expected form of the callable. ## Related issue number  ## Checks - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>

`result_of_t` is deprecated Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ectural Design (ray-project#57889) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

- disables java tests; ray java not supported on apple silicon yet. - skipping cpp tests that are not passing yet we already stopped releasing macos wheels for Intel silicon, the tests that are disabled or skipped were never passing on apple silicon, so nothing is regressed. Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ay-project#57876) ## Description ## Related issues Closes ray-project#57847 ## Additional information Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

…ystem cgroup (ray-project#57864) For more details about the resource isolation project see ray-project#54703. When starting the head node, move the dashboard api server's subprocesses into the system cgroup. I updated the integration test and added a helpful error message because the test will break in the future when a new dashboard module is added. I ran the integration tests 25 times locally. > (ray2) ubuntu@devbox:~/code/ray2$ python -m pytest -s python/ray/tests/resource_isolation/test_resource_isolation_integration.py --count 25 -x ... collecting ... python/ray/tests/resource_isolation/test_resource_isolation_integration.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 25% ██▌ 2025-10-17 23:13:51,897 INFO worker.py:1833 -- Connecting to existing Ray cluster at address: 172.31.12.251:6379... 2025-10-17 23:13:51,905 INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 python/ray/tests/resource_isolation/test_resource_isolation_integration.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 26% ██▋ 2025-10-17 23:13:57,592 INFO worker.py:1833 -- Connecting to existing Ray cluster at address: 172.31.12.251:6379... 2025-10-17 23:13:57,598 INFO worker.py:2004 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 python/ray/tests/resource_isolation/test_resource_isolation_integration.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 98% █████████▊2025-10-17 23:19:45,417 INFO worker.py:2004 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 python/ray/tests/resource_isolation/test_resource_isolation_integration.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 99% █████████▉2025-10-17 23:19:50,194 INFO worker.py:2004 -- Started a local Ray instance. View the dashboard at http://127.0.0.1:8265 python/ray/tests/resource_isolation/test_resource_isolation_integration.py ✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓✓ 100% ██████████ Results (366.41s): 100 passed --------- Signed-off-by: irabbani <israbbani@gmail.com>

…roject#57037) During the execution of tail_job_logs() after the job submission, if the ray head connection breaks, the tail_job_logs() will not raise any error. The error should be raised. Query the rayjob status when receiving the message, and raise error if connection closed with rayjob not in terminate stage. ## Related issue number Closes: ray-project#57002 --------- Signed-off-by: machichima <nary12321@gmail.com>

…ect#57897)

…ay-project#57802) ## Description 1. This PR added the `jax.distributed.shutdown()` for JaxBackend in order to free up any leaked resources on TPU RayTrainWorkers. 2. if `jax.distributed` is not on, it is a noop: https://docs.jax.dev/en/latest/_autosummary/jax.distributed.shutdown.html 3. Tested on Anyscale workspace. <img width="1264" height="62" alt="image" src="https://github.com/user-attachments/assets/f28102ff-f6d1-4da0-b41a-6cc785603e72" />

…ay Serve LLM (ray-project#57830) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

we are not releasing `x86_64` wheels anymore Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…57817) Signed-off-by: dayshah <dhyey2019@gmail.com>

…igurable (ray-project#57705) Recently, when we ran performance tests with task event generation turned on. We saw some performance regression when the workloads ran on very small CPU machines. With further investigation, the overhead mainly comes from the name format convention when converting the proto message to JSON format payload in the aggregator agent. This PR adds an env var for the config to control the name conversion behavior and update the corresponding tests. Also note that, eventually we are planning to remove this config turn off the field name conversion by default after migrated all the current event usage. --------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>

…57861) Signed-off-by: joshlee <joshlee@anyscale.com>

It used to be in 3 different groups, now unionized in 1. Signed-off-by: kevin <kevin@anyscale.com>

…nter (ray-project#56848) * Updated preprocessors to use a callback-based approach for stat computation. This improves code organization and reduces duplication. * Added ValueCounter aggregator and value_counts method to BlockColumnAccessor. Includes implementations for both Arrow and Pandas backends.   ## Why are these changes needed?  ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: cem <cem@anyscale.com> Signed-off-by: cem-anyscale <cem@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

… only once." (ray-project#57917) This PR fixes the Ray check failure RayEventRecorder::StartExportingEvents() should be called only once.. The failure can occur in the following scenario: - The metric_agent_client successfully establishes a connection with the dashboard agent. In this case, RayEventRecorder::StartExportingEvents is correctly invoked to start sending events. - At the same time, the metric_agent_client exceeds its maximum number of connection retries. In this case, RayEventRecorder::StartExportingEvents is invoked again incorrectly, causing duplicate attempts to start exporting events. This PR introduces two fixes: - In metric_agent_client, the connection success and retry logic are now synchronized (previously they ran asynchronously, allowing both paths to trigger). - Do not call StartExportingEvents if the connection cannot be established. Test: - CI --------- Signed-off-by: Cuong Nguyen <can@anyscale.com>

## Description Ray data can't serialize zero (byte) length numpy arrays: ```python3 import numpy as np import ray.data array = np.empty((2, 0), dtype=np.int8) ds = ray.data.from_items([{"array": array}]) for batch in ds.iter_batches(batch_size=1): print(batch) ``` What I expect to see: ``` {'array': array([], shape=(1, 2, 0), dtype=int8)} ``` What I see: ``` /Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py:736: RuntimeWarning: invalid value encountered in scalar divide offsets = np.arange( 2025-10-17 17:18:09,499 WARNING arrow.py:189 -- Failed to convert column 'array' into pyarrow array due to: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: []; falling back to serialize as pickled python objects Traceback (most recent call last): File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 672, in from_numpy return cls._from_numpy(arr) ^^^^^^^^^^^^^^^^^^^^ File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 736, in _from_numpy offsets = np.arange( ^^^^^^^^^^ ValueError: arange: cannot compute length The above exception was the direct cause of the following exception: Traceback (most recent call last): File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 141, in convert_to_pyarrow_array return ArrowTensorArray.from_numpy( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/chris.ohara/Downloads/.venv/lib/python3.12/site-packages/ray/air/util/tensor_extensions/arrow.py", line 678, in from_numpy raise ArrowConversionError(data_str) from e ray.air.util.tensor_extensions.arrow.ArrowConversionError: Error converting data to Arrow: column: 'array', shape: (1, 2, 0), dtype: int8, data: [] 2025-10-17 17:18:09,789 INFO logging.py:293 -- Registered dataset logger for dataset dataset_0_0 2025-10-17 17:18:09,815 WARNING resource_manager.py:134 -- ⚠️ Ray's object store is configured to use only 33.5% of available memory (2.0GiB out of 6.0GiB total). For optimal Ray Data performance, we recommend setting the object store to at least 50% of available memory. You can do this by setting the 'object_store_memory' parameter when calling ray.init() or by setting the RAY_DEFAULT_OBJECT_STORE_MEMORY_PROPORTION environment variable. {'array': array([array([], shape=(2, 0), dtype=int8)], dtype=object)} ``` This PR fixes the issue so that zero-length arrays are serialized correctly, and the shape and dtype is preserved. ## Additional information This is `ray==2.50.0`. --------- Signed-off-by: Chris O'Hara <cohara87@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

use awscli directly; stop installing extra dependencies Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

Signed-off-by: joshlee <joshlee@anyscale.com>

## Description Found this while reading the docs. Not sure what this "Note that" is referring to or why it is there. Signed-off-by: Max van Dijck <50382570+MaxVanDijck@users.noreply.github.com>

…ray-project#57891) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

it should not run on macos intel silicon anymore Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ect#57877) so that we are not tied to using public s3 buckets Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ject#57925) This PR moves the error handling of metric+event exporter agent one level up, inside the `metrics_agent_client` callback. Previously, the errors handled were handled by either the metric or event recorder, which leads to confusion and buggy code. Test: - CI --------- Signed-off-by: Cuong Nguyen <can@anyscale.com>

## Description Bumping from small to medium because it's timing out for Python 3.12. Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>

…project#57931) Signed-off-by: dayshah <dhyey2019@gmail.com>

Add "WORKDIR /home/ray" in build-docker.sh. If "WORKDIR" is not set, it defaults to /root, causing permission issues with conda init. ``` 31.00 # >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<< 31.00 31.00 Traceback (most recent call last): 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/conda/exception_handler.py", line 18, in __call__ 31.00 return func(*args, **kwargs) 31.00 ^^^^^^^^^^^^^^^^^^^^^ 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/conda/cli/main.py", line 44, in main_subshell 31.00 context.__init__(argparse_args=pre_args) 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/conda/base/context.py", line 517, in __init__ 31.00 self._set_search_path( 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/conda/common/configuration.py", line 1430, in _set_search_path 31.00 self._search_path = IndexedSet(self._expand_search_path(search_path, **kwargs)) 31.00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/boltons/setutils.py", line 118, in __init__ 31.00 self.update(other) 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/boltons/setutils.py", line 351, in update 31.00 for o in other: 31.00 ^^^^^ 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/conda/common/configuration.py", line 1403, in _expand_search_path 31.00 if path.is_file() and ( 31.00 ^^^^^^^^^^^^^^ 31.00 File "/home/ray/anaconda3/lib/python3.12/pathlib.py", line 892, in is_file 31.00 return S_ISREG(self.stat().st_mode) 31.00 ^^^^^^^^^^^ 31.00 File "/home/ray/anaconda3/lib/python3.12/pathlib.py", line 840, in stat 31.00 return os.stat(self, follow_symlinks=follow_symlinks) 31.00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 31.00 PermissionError: [Errno 13] Permission denied: '$XDG_CONFIG_HOME/conda/.condarc' 31.00 31.00 `$ /home/ray/anaconda3/bin/conda init` 31.00 31.00 environment variables: 31.00 CIO_TEST=<not set> 31.00 CONDA_ROOT=/home/ray/anaconda3 31.00 CURL_CA_BUNDLE=<not set> 31.00 HTTPS_PROXY=<set> 31.00 HTTP_PROXY=<set> 31.00 LD_LIBRARY_PATH=:/usr/local/nvidia/lib64 31.00 LD_PRELOAD=<not set> 31.00 NO_PROXY=<set> 31.00 PATH=/home/ray/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/ 31.00 bin:/sbin:/bin:/usr/local/nvidia/bin 31.00 PYTHON_VERSION=3.9 31.00 REQUESTS_CA_BUNDLE=<not set> 31.00 SSL_CERT_FILE=<not set> 31.00 http_proxy=<set> 31.00 https_proxy=<set> 31.00 no_proxy=<set> ``` Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ct#58320) Signed-off-by: win5923 <ken89@kimo.com>

…oject#58329) Created by release automation bot. Update with commit a69004e Signed-off-by: kevin <kevin@anyscale.com>

… and GRPO. (ray-project#57961) ## Description Example for first blog in the RDT series using NIXL for GPU-GPU tensor transfers. --------- Signed-off-by: Ricardo Decal <public@ricardodecal.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Ricardo Decal <public@ricardodecal.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

python 3.9 is now out of the support window all using python 3.12 wheel names for unit testing Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

we will stop releasing them Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

and move them into bazel dir. getting ready for python version upgrade Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

python 3.9 is out of support window Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ect#58375) Starting with KubeRay 1.5.0, KubeRay supports gang scheduling for RayJob custom resources. Just add a mention for Yunikorn scheduler. Related to ray-project/kuberay#3948. Signed-off-by: win5923 <ken89@kimo.com>

This PR adds support for token-based authentication in the Ray bi-directional syncer, for both client and server sides. It also includes tests to verify the functionality. --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Support token based authentication in runtime env (client and server). refactor existing dashboard head code so that the utils and midleware can be reused by runtime env agent as well --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…un. (ray-project#58335) ## Description > Add spark master model validation to let Ray run on Spark-On-YARN mode. ## Why need this? > If we directly run Ray on a YARN cluster, we need to do more tests and integration, and also need to setup related tools and environments. If support ray-on-spark-on-yarn and we already have Spark envs setup, we don't need to do other things, can use Spark and let the user run pyspark. Signed-off-by: Cai Zhanqi <zhanqi.cai@shopee.com> Co-authored-by: Cai Zhanqi <zhanqi.cai@shopee.com>

upgrading reef tests to run on 3.10 Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <joshlee@anyscale.com>

upgrading data ci tests to py3.10 postmerge build: https://buildkite.com/ray-project/postmerge/builds/14192 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

upgrading serve tests to run on python 3.10 Post merge run: https://buildkite.com/ray-project/postmerge/builds/14190 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

@jjyao

…roject#58307) There was a video object detection Ray Data workload hang reported. An initial investigation by @jjyao and @dayshah observed that it was due to an actor restart and the actor creation task was being spilled to a raylet that had an outdated resource view. This was found by looking at the raylet state dump. This actor creation task required 1 GPU and 1 CPU, and the raylet where this actor creation task was being spilled to had a cluster view that reported no available GPUs. However there were many available GPUs, and all the other raylet state dumps correctly reported this. Furthermore in the raylet logs for the oudated raylet there was a "Failed to send a message to node: " originating from the ray syncer. Hence an initial hypothesis was formed that the ray syncer retry policy was not working as intended. A follow up investigation by @edoakes and I revealed an incorrect usage of the grpc streaming callback API. Currently how retries works in the ray syncer on fail to send/write is: - OnWriteDone/OnReadDone(ok = false) is called after a failed read/write - Disconnect() (the one in *_bidi_reactor.h!) is called which flips _disconnected to true and calls DoDisconnect() - DoDisconnect() notifies grpc we will no longer write to the channel via StartWritesDone() and removes the hold via RemoveHold() - GRPC will see that the channel is idle and has no hold so will call OnDone() - we've overriden OnDone() to hold a cleanup_cb that contains the retry policy that reinitializes the bidi reactor and connects to the same server at a repeated interval of 2 seconds until it succeeds - fault tolerance accomplished! :) However from logs that we added we weren't seeing OnDone() being called after DoDisconnect() happens. From reading the grpc streaming callback best practices here: https://grpc.io/docs/languages/cpp/best_practices/#callback-streaming-api it states that "The best practice is always to read until ok=false on the client side" From the OnDone grpc documentation: https://grpc.github.io/grpc/cpp/classgrpc_1_1_client_bidi_reactor.html#a51529f76deeda6416ce346291577ffa9: it states that "Notifies the application that all operations associated with this RPC have completed and all Holds have been removed" Since we call StartWritesDone() and removed the hold, this should notify grpc that all operations associated with this bidi reactor are completed. HOWEVER reads may not be finished, i.e. we have not read all incoming data. Consider the following scenario: 1.) We receive a bunch of resource view messages from the GCS and have not processed all of them 2.) OnWriteDone(ok = false) is called => Disconnected() => disconnected_ = false 3.) OnReadDone(ok = true) is called however because disconnected_ = true we early return and STOP processing any more reads as shown below: https://github.com/ray-project/ray/blob/275a585203bef4e48c04b46b2b7778bd8265cf46/src/ray/ray_syncer/ray_syncer_bidi_reactor_base.h#L178-L180 4.) Pending reads left in queue, and prevent grpc from calling OnDone since not all operations are done 5.) Hang, we're left in a zombie state and drop all incoming resource view messages and don't send any resource view updates due to the disconnected check Hence the solution is to remove the disconnected check in OnReadDone and simply allow all incoming data to be read. There's a couple of interesting observations/questions remaining: 1.) The raylet with the outdated view is the local raylet to the gcs and we're seeing read/write errors despite being on the same node 2.) From the logs I see that the gcs syncer thinks that the channel to the raylet syncer is still available. There's no error logs on the gcs side, its still sending messages to the raylet. Hence even though the raylet gets the "Failed to write error: " we don't see a corresponding error log on the GCS side. --------- Signed-off-by: joshlee <joshlee@anyscale.com>

…project#58161) ## Description kai-scheduler supports gang scheduling at [v0.9.3](NVIDIA/KAI-Scheduler#500 (comment)). But gang scheduling doesn't work at v0.9.4. However, it works again at v0.10.0-rc1. ## Related issues ## Additional information The reason might be as follow. The `numOfHosts` is taken into consideration at v0.9.3. https://github.com/NVIDIA/KAI-Scheduler/blob/0a680562b3cdbae7d81688a81ab4d829332abd0a/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162 The snippet of code is missing at v0.9.4. https://github.com/NVIDIA/KAI-Scheduler/blob/281f4269b37ad864cf7213f44c1d64217a31048f/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L131-L140 Then, it shows up at v0.10.0-rc1. https://github.com/NVIDIA/KAI-Scheduler/blob/96b4d22c31d5ec2b7375b0de0e78e59a57baded6/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162 --------- Signed-off-by: fscnick <fscnick.dev@gmail.com>

sourcery-ai

The pull request #670 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5203.

gemini-code-assist · 2025-11-05T03:02:49Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request automates the daily merge from master to main, focusing on a significant overhaul of the project's build and continuous integration infrastructure. The changes aim to modernize Python dependency management using uv and raydepsets, standardize Docker image creation across various platforms and components, and refine CI pipelines for improved efficiency and maintainability. This extensive refactoring touches upon Bazel configurations, Docker image definitions, test execution strategies, and documentation-related tooling.

Highlights

Bazel Configuration Updates: The build --incompatible_strict_action_env flag is now enabled by default. New configurations were added for Linux (workspace status command) and Windows (UTF-8 C++ compiler option). Warnings for src/ray/thirdparty/ files are now ignored. Custom user options can be imported via .user.bazelrc, and macOS C++ builds will suppress deprecated declaration warnings.
CI Pipeline Refactoring: Several base Docker image build steps were consolidated and redefined in a new _images.rayci.yml file. Python 3.10 support was expanded across various CI matrices. New dedicated build steps were introduced for ray-core, ray-dashboard, and ray-java components. The raydepsets tool is now integrated for dependency compilation, and semgrep has been added for linting.
Python Dependency Management Migration: The project is migrating to uv and raydepsets for Python dependency management. This includes changes in Dockerfiles (Miniconda to Miniforge3), build scripts, and the introduction of new .depsets.yaml configuration files for various components like docs, Ray images, LLM, and ML.
macOS Build and Test Enhancements: macOS bisect jobs now use macos-arm64 instance types. The macOS wheel build process has been streamlined, and C++ tests are now executed via run_ray_cpp with updated commands.
Docker Image Tagging and Registry Updates: Docker image tagging logic has been refined to include RAYCI_BUILD_ID and support new ray-extra, ray-ml-extra, and ray-llm-extra image types. Azure Container Registry login and push commands were added to the Anyscale Docker container build process.
C++ API and Runtime Changes: C++ API headers were updated to use _ suffix for member variables in RemoteFunctionHolder. Metric recording in C++ now uses std::vector<std::pair<std::string_view, std::string>> for tags. GetNodeIpAddress was replaced with ray::GetNodeIpAddressFromPerspective for consistency.
Documentation Style and Tools: A new .cursor/rules/ray-docs-style.mdc file was added to enforce documentation style. pydantic version was updated in doc/requirements-doc.txt. Google Tag Manager IDs and event tracking were updated in _templates/extrahead.html and _static/js/custom.js.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request is an automated daily merge from master to main. It includes a large number of changes, primarily focused on refactoring and improving the build and CI systems. Key changes include:

Modularization of the Bazel build files, moving many targets from the root BUILD.bazel to more specific locations.
Extensive updates to Buildkite pipeline configurations, improving organization and expanding test coverage.
A switch from miniconda to miniforge for Python environment setup.
A move to ARM64-only for macOS CI builds.
Introduction of a new dependency management system raydepsets.
Modernization of C++ code, such as replacing std::result_of with std::invoke_result.

The changes are extensive but appear to be well-executed and consistent. I have one suggestion to improve the efficiency of a script used for generating Docker multi-arch indexes. Overall, the changes represent a significant improvement to the project's infrastructure.

gemini-code-assist · 2025-11-05T03:07:37Z

ci/ray_ci/automation/generate_index.py

+    tags = [f"rayproject/ray:{tag}" for tag in tags]
    indexes_to_publish = []
    for tag in tags:
-        if "-aarch64" not in tag and tag + "-aarch64" in tags:
+        if not tag.endswith("-aarch64") and tag + "-aarch64" in tags:
            indexes_to_publish.append((tag, tag + "-aarch64"))


For better performance, it's recommended to convert the tags list to a set before iterating over it. Checking for an element's presence in a list is an O(n) operation, which makes the loop O(n^2). Using a set for lookups will reduce the complexity to O(n) on average.

Suggested change

tags = [f"rayproject/ray:{tag}" for tag in tags]

indexes_to_publish = []

for tag in tags:

if "-aarch64" not in tag and tag + "-aarch64" in tags:

if not tag.endswith("-aarch64") and tag + "-aarch64" in tags:

indexes_to_publish.append((tag, tag + "-aarch64"))

tags_set = {f"rayproject/ray:{tag}" for tag in tags}

indexes_to_publish = []

for tag in tags_set:

if not tag.endswith("-aarch64") and f"{tag}-aarch64" in tags_set:

indexes_to_publish.append((tag, f"{tag}-aarch64"))

gemini-code-assist · 2025-11-05T03:09:04Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the latest changes from the master branch into main, focusing on a comprehensive overhaul of the project's build and continuous integration infrastructure. The updates aim to modernize the Bazel build configurations, streamline Docker image creation, enhance test execution across various environments and Python versions, and improve dependency management. These changes collectively contribute to a more robust, efficient, and maintainable development workflow.

Highlights

Bazel Configuration Updates: The .bazelrc file has been significantly updated to enable --incompatible_strict_action_env by default, add a workspace_status_command for Linux builds, include /utf-8 cxxopt for Windows, and ignore warnings from src/ray/thirdparty files. It also introduces a try-import for .user.bazelrc and adds a flag to suppress deprecated declaration errors on macOS.
CI Pipeline Refactoring: The Buildkite CI configuration has undergone a major refactoring. Image build steps were moved from _forge.rayci.yml to a new _images.rayci.yml file, centralizing Docker image definitions. Various CI jobs now support Python 3.10, and new base images for CUDA 12.8 (e.g., oss-ci-base_cu128-multipy) have been introduced.
Build and Test Process Enhancements: New dedicated build steps for ray-core, ray-dashboard, and ray-java have been added. The CI now includes ray-extra Docker image builds and updated ray-llm builds to CUDA 12.8. Test configurations have been streamlined, with cgraph and workflow tests removed, dask and modin tests split, and new java worker tests introduced. Instance types for macOS tests have been standardized to macos-arm64.
Dependency Management with Raydepsets: A new dependencies.rayci.yml file has been added to manage pip-compile and raydepsets compilation. The ci/compile_llm_requirements.sh script has been refactored to leverage the new raydepsets tool for dependency compilation.
Docker Image Base Updates: The base Dockerfiles for forge and manylinux images now use miniforge3 instead of miniconda3 and include uv for Python environment management. Docker group IDs have been updated, and Azure CLI installation has been integrated into the forge image.
Code Ownership and Linting Updates: The .github/CODEOWNERS file has been updated to reflect current team ownerships. Pre-commit hooks have been enhanced with new linters like semgrep, vale, cython-lint, and eslint, and the pydoclint hook now includes CI-specific baseline management.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a massive and impressive refactoring of the project's build and CI/CD infrastructure. The changes are extensive, touching upon Bazel configurations, Buildkite pipelines, dependency management, linting, and documentation. Key improvements include a major cleanup and modularization of the Bazel build system, a comprehensive overhaul of the CI pipelines for better organization and efficiency, the introduction of a new raydepsets tool for dependency management, and a switch to a more robust pre-commit-based linting workflow. The C++ code has also been modernized, and documentation has been updated. Overall, these are high-quality changes that significantly enhance the project's maintainability and developer experience. I have not found any issues of medium or higher severity and approve of these changes.

github-actions · 2025-11-20T01:40:42Z

This pull request has been automatically marked as stale because it has not had
any activity for 14 days. It will be closed in another 14 days if no further activity occurs.
Thank you for your contributions.

You can always ask for help on our discussion forum or Ray's public slack channel.

If you'd like to keep this open, just leave any comment, and the stale label will be removed.

simonsays1980 and others added 30 commits October 18, 2025 19:58

[Core] Reschedule leases in local lease manager when draining the node (

993139e

ray-project#57834) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

[core] use invoke_result_t in cpp worker example (ray-project#57885)

697c7bc

`result_of_t` is deprecated Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[serve][llm][refactor] Align Ray Serve LLM Code Structure with Archit…

de50b23

…ectural Design (ray-project#57889) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

[Doc][Serve] Import AutoscalingContext in autoscaling policy example (r…

b988ce4

…ay-project#57876) ## Description ## Related issues Closes ray-project#57847 ## Additional information Signed-off-by: daiping8 <dai.ping88@zte.com.cn>

removed adding the TaskPoolStrategy as it's not needed here (ray-proj…

b4f7a70

…ect#57897)

[docs][serve][llm] Add comprehensive architecture documentation for R…

3287523

…ay Serve LLM (ray-project#57830) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

[release auto] remove x86_64 wheel verification (ray-project#57913)

6d51184

we are not releasing `x86_64` wheels anymore Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[core] Kill raylet file and just keep node manager file (ray-project#…

532ac12

…57817) Signed-off-by: dayshah <dhyey2019@gmail.com>

[core] Make DrainRaylet + ShutdownRaylet Fault Tolerant (ray-project#…

2bbd13a

…57861) Signed-off-by: joshlee <joshlee@anyscale.com>

[release] Group all hello world tests together (ray-project#57920)

670151e

It used to be in 3 different groups, now unionized in 1. Signed-off-by: kevin <kevin@anyscale.com>

[ci] fix postmerge tests that require credentials (ray-project#57915)

034c54f

use awscli directly; stop installing extra dependencies Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[core] Make ReleaseUnusedBundles Fault Tolerant (ray-project#57786)

a9065a3

Signed-off-by: joshlee <joshlee@anyscale.com>

[doc] remove "Note that" in dataset.py documentation (ray-project#57884)

f2aa5a8

## Description Found this while reading the docs. Not sure what this "Note that" is referring to or why it is there. Signed-off-by: Max van Dijck <50382570+MaxVanDijck@users.noreply.github.com>

[codeowners] Reorder CODEOWNERS for resolution order + organization (…

2c680d7

…ray-project#57891) Signed-off-by: Seiji Eicher <seiji@anyscale.com>

[ci] change macos bisect job to use arm64 (ray-project#57914)

65bb37d

it should not run on macos intel silicon anymore Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[doc build] use rayci.anyscale.dev to fetch doc build cache (ray-proj…

4badd82

…ect#57877) so that we are not tied to using public s3 buckets Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[train] bump test_util timeout (ray-project#57939)

f97b6a6

## Description Bumping from small to medium because it's timing out for Python 3.12. Signed-off-by: Matthew Deng <matthew.j.deng@gmail.com>

[core] Don't log actor restart warning if arg is detached actor (ray-…

d86484d

…project#57931) Signed-off-by: dayshah <dhyey2019@gmail.com>

my-vegetable-has-exploded and others added 18 commits November 1, 2025 21:48

[Docs][KubeRay] Add Volcano RayJob gang scheduling example (ray-proje…

91ac4c7

…ct#58320) Signed-off-by: win5923 <ken89@kimo.com>

[docker] Update latest Docker dependencies for 2.51.0 release (ray-pr…

c90aacc

…oject#58329) Created by release automation bot. Update with commit a69004e Signed-off-by: kevin <kevin@anyscale.com>

[wheel] stop uploading python 3.9 wheels on release (ray-project#58363)

a64b756

python 3.9 is now out of the support window all using python 3.12 wheel names for unit testing Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci] stop verifying python 3.9 wheels (ray-project#58365)

8f466d7

we will stop releasing them Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[bazel] rename python runtime to py39 runtime (ray-project#58362)

44e8b1d

and move them into bazel dir. getting ready for python version upgrade Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[image] stop building python 3.9 release images (ray-project#58374)

d3d6b6b

python 3.9 is out of support window Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci] reef tests on py310 (ray-project#58379)

01ad74f

upgrading reef tests to run on 3.10 Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

antfin-oss requested review from SongGuyang and kfstorm as code owners November 5, 2025 02:56

antfin-oss added auto-generated daily-merge labels Nov 5, 2025

antfin-oss assigned ffbin Nov 5, 2025

sourcery-ai bot reviewed Nov 5, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 5, 2025

View reviewed changes

github-actions bot added the stale label Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-05 #670

🔄 daily merge: master → main 2025-11-05 #670

Uh oh!

antfin-oss commented Nov 5, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 5, 2025

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

82 participants

🔄 daily merge: master → main 2025-11-05 #670

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-05 #670

Uh oh!

Conversation

antfin-oss commented Nov 5, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 5, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

github-actions bot commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

82 participants