🔄 daily merge: master → main 2025-11-20 #684

antfin-oss · 2025-11-20T02:54:40Z

This Pull Request was created automatically to merge the latest changes from master into main branch.

📅 Created: 2025-11-20
🔀 Merge direction: master → main
🤖 Triggered by: Scheduled

Please review and merge if everything looks good.

## Description Replace `map_batches` and numpy invocations with `with_column` and arrow kernels Release test: https://buildkite.com/ray-project/release/builds/66243#019a37da-4d9d-4f19-9180-e3f3dc3f8043 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>

…collate_fn (ray-project#58327) Signed-off-by: Gang Zhao <gang@gang-JQ62HD2C37.local> Co-authored-by: Gang Zhao <gang@gang-JQ62HD2C37.local>

## Description This fixes the symmetric-run cli workflow. Right now if you use `ray symmetric-run` on 2.51 like ``` ray symmetric-run --address 127.0.0.1:6379 -- python my_script.py ``` it will throw since the `symmetric-run` arg is not caught. This was only caught once it became part of the CLI. ## Related issues ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Richard Liaw <rliaw@berkeley.edu>

…t#58247) Updating hello world release & cluster release tests to run on py3.10 Passing release tests: https://buildkite.com/ray-project/release/builds/65844 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

Fix typos Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

The current examples describe that label bundles are written as: `[{"ray.io/accelerator-type": "H100"}* 2]`, i.e. a dict * integer. This is wrong it has to be the list that is multiplied. This PR fixes this. Signed-off-by: Daraan <github.blurry@9ox.net>

## Description In this function, `Result::from_path` is implemented in ray train v2, which reconstructs a `Result` object from the checkpoints. This implementation leverages `CheckpointManager` and refers to https://github.com/ray-project/ray/blob/master/python/ray/train/v2/_internal/execution/controller/controller.py#L512-L540 --------- Signed-off-by: xgui <xgui@anyscale.com> Signed-off-by: Justin Yu <justinvyu@anyscale.com> Co-authored-by: Justin Yu <justinvyu@anyscale.com>

Add "WORKDIR /home/ray" in build-docker.sh. If "WORKDIR" is not set, it defaults to /root, causing permission issues with conda init. ``` 31.00 # >>>>>>>>>>>>>>>>>>>>>> ERROR REPORT <<<<<<<<<<<<<<<<<<<<<< 31.00 31.00 Traceback (most recent call last): 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/conda/exception_handler.py", line 18, in __call__ 31.00 return func(*args, **kwargs) 31.00 ^^^^^^^^^^^^^^^^^^^^^ 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/conda/cli/main.py", line 44, in main_subshell 31.00 context.__init__(argparse_args=pre_args) 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/conda/base/context.py", line 517, in __init__ 31.00 self._set_search_path( 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/conda/common/configuration.py", line 1430, in _set_search_path 31.00 self._search_path = IndexedSet(self._expand_search_path(search_path, **kwargs)) 31.00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/boltons/setutils.py", line 118, in __init__ 31.00 self.update(other) 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/boltons/setutils.py", line 351, in update 31.00 for o in other: 31.00 ^^^^^ 31.00 File "/home/ray/anaconda3/lib/python3.12/site-packages/conda/common/configuration.py", line 1403, in _expand_search_path 31.00 if path.is_file() and ( 31.00 ^^^^^^^^^^^^^^ 31.00 File "/home/ray/anaconda3/lib/python3.12/pathlib.py", line 892, in is_file 31.00 return S_ISREG(self.stat().st_mode) 31.00 ^^^^^^^^^^^ 31.00 File "/home/ray/anaconda3/lib/python3.12/pathlib.py", line 840, in stat 31.00 return os.stat(self, follow_symlinks=follow_symlinks) 31.00 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ 31.00 PermissionError: [Errno 13] Permission denied: '$XDG_CONFIG_HOME/conda/.condarc' 31.00 31.00 `$ /home/ray/anaconda3/bin/conda init` 31.00 31.00 environment variables: 31.00 CIO_TEST=<not set> 31.00 CONDA_ROOT=/home/ray/anaconda3 31.00 CURL_CA_BUNDLE=<not set> 31.00 HTTPS_PROXY=<set> 31.00 HTTP_PROXY=<set> 31.00 LD_LIBRARY_PATH=:/usr/local/nvidia/lib64 31.00 LD_PRELOAD=<not set> 31.00 NO_PROXY=<set> 31.00 PATH=/home/ray/anaconda3/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/ 31.00 bin:/sbin:/bin:/usr/local/nvidia/bin 31.00 PYTHON_VERSION=3.9 31.00 REQUESTS_CA_BUNDLE=<not set> 31.00 SSL_CERT_FILE=<not set> 31.00 http_proxy=<set> 31.00 https_proxy=<set> 31.00 no_proxy=<set> ``` Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…ct#58320) Signed-off-by: win5923 <ken89@kimo.com>

…oject#58329) Created by release automation bot. Update with commit a69004e Signed-off-by: kevin <kevin@anyscale.com>

… and GRPO. (ray-project#57961) ## Description Example for first blog in the RDT series using NIXL for GPU-GPU tensor transfers. --------- Signed-off-by: Ricardo Decal <public@ricardodecal.com> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Ricardo Decal <public@ricardodecal.com> Co-authored-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Qiaolin Yu <liin1211@outlook.com>

python 3.9 is now out of the support window all using python 3.12 wheel names for unit testing Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

we will stop releasing them Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

and move them into bazel dir. getting ready for python version upgrade Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

python 3.9 is out of support window Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…ect#58375) Starting with KubeRay 1.5.0, KubeRay supports gang scheduling for RayJob custom resources. Just add a mention for Yunikorn scheduler. Related to ray-project/kuberay#3948. Signed-off-by: win5923 <ken89@kimo.com>

This PR adds support for token-based authentication in the Ray bi-directional syncer, for both client and server sides. It also includes tests to verify the functionality. --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

Support token based authentication in runtime env (client and server). refactor existing dashboard head code so that the utils and midleware can be reused by runtime env agent as well --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…un. (ray-project#58335) ## Description > Add spark master model validation to let Ray run on Spark-On-YARN mode. ## Why need this? > If we directly run Ray on a YARN cluster, we need to do more tests and integration, and also need to setup related tools and environments. If support ray-on-spark-on-yarn and we already have Spark envs setup, we don't need to do other things, can use Spark and let the user run pyspark. Signed-off-by: Cai Zhanqi <zhanqi.cai@shopee.com> Co-authored-by: Cai Zhanqi <zhanqi.cai@shopee.com>

upgrading reef tests to run on 3.10 Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

The issue with the current implementation of core worker HandleKillActor is that it won't send a reply when the RPC completes because the worker is dead. The application code from the GCS doesn't really care since it just logs the response if one is received, a response is only sent if the actor ID of the actor on the worker and in the RPC don't match, and the GCS will just log it and move on with its life. Hence we can't differentiate in the case of a transient network failure whether there was a network issue, or the actor was successfully killed. What I think is the most straightforward approach is instead of the GCS directly calling core worker KillActor, we have the GCS talk to the raylet instead and call a new RPC KillLocalActor that in turn calls KillActor. Since the raylet that receives KillLocalActor is local to the worker that the actor is on, we're guaranteed to kill it at that point (either through using KillActor, or if it hangs falling back to SIGKILL). Thus the main intuition is that the GCS now talks to the raylet, and this layer implements retries. Once the raylet receives the KillLocalActor request, it routes this to KillActor. This layer between the raylet and core worker does not have retries enabled because we can assume that RPCs between the local raylet and worker won't fail (same machine). We then check on the status of the worker after a while (5 seconds via kill_worker_timeout_milliseconds) and if it still hasn't been killed then we call DestroyWorker that in turn sends the SIGKILL. --------- Signed-off-by: joshlee <joshlee@anyscale.com>

upgrading data ci tests to py3.10 postmerge build: https://buildkite.com/ray-project/postmerge/builds/14192 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

upgrading serve tests to run on python 3.10 Post merge run: https://buildkite.com/ray-project/postmerge/builds/14190 --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

@jjyao

…roject#58307) There was a video object detection Ray Data workload hang reported. An initial investigation by @jjyao and @dayshah observed that it was due to an actor restart and the actor creation task was being spilled to a raylet that had an outdated resource view. This was found by looking at the raylet state dump. This actor creation task required 1 GPU and 1 CPU, and the raylet where this actor creation task was being spilled to had a cluster view that reported no available GPUs. However there were many available GPUs, and all the other raylet state dumps correctly reported this. Furthermore in the raylet logs for the oudated raylet there was a "Failed to send a message to node: " originating from the ray syncer. Hence an initial hypothesis was formed that the ray syncer retry policy was not working as intended. A follow up investigation by @edoakes and I revealed an incorrect usage of the grpc streaming callback API. Currently how retries works in the ray syncer on fail to send/write is: - OnWriteDone/OnReadDone(ok = false) is called after a failed read/write - Disconnect() (the one in *_bidi_reactor.h!) is called which flips _disconnected to true and calls DoDisconnect() - DoDisconnect() notifies grpc we will no longer write to the channel via StartWritesDone() and removes the hold via RemoveHold() - GRPC will see that the channel is idle and has no hold so will call OnDone() - we've overriden OnDone() to hold a cleanup_cb that contains the retry policy that reinitializes the bidi reactor and connects to the same server at a repeated interval of 2 seconds until it succeeds - fault tolerance accomplished! :) However from logs that we added we weren't seeing OnDone() being called after DoDisconnect() happens. From reading the grpc streaming callback best practices here: https://grpc.io/docs/languages/cpp/best_practices/#callback-streaming-api it states that "The best practice is always to read until ok=false on the client side" From the OnDone grpc documentation: https://grpc.github.io/grpc/cpp/classgrpc_1_1_client_bidi_reactor.html#a51529f76deeda6416ce346291577ffa9: it states that "Notifies the application that all operations associated with this RPC have completed and all Holds have been removed" Since we call StartWritesDone() and removed the hold, this should notify grpc that all operations associated with this bidi reactor are completed. HOWEVER reads may not be finished, i.e. we have not read all incoming data. Consider the following scenario: 1.) We receive a bunch of resource view messages from the GCS and have not processed all of them 2.) OnWriteDone(ok = false) is called => Disconnected() => disconnected_ = false 3.) OnReadDone(ok = true) is called however because disconnected_ = true we early return and STOP processing any more reads as shown below: https://github.com/ray-project/ray/blob/275a585203bef4e48c04b46b2b7778bd8265cf46/src/ray/ray_syncer/ray_syncer_bidi_reactor_base.h#L178-L180 4.) Pending reads left in queue, and prevent grpc from calling OnDone since not all operations are done 5.) Hang, we're left in a zombie state and drop all incoming resource view messages and don't send any resource view updates due to the disconnected check Hence the solution is to remove the disconnected check in OnReadDone and simply allow all incoming data to be read. There's a couple of interesting observations/questions remaining: 1.) The raylet with the outdated view is the local raylet to the gcs and we're seeing read/write errors despite being on the same node 2.) From the logs I see that the gcs syncer thinks that the channel to the raylet syncer is still available. There's no error logs on the gcs side, its still sending messages to the raylet. Hence even though the raylet gets the "Failed to write error: " we don't see a corresponding error log on the GCS side. --------- Signed-off-by: joshlee <joshlee@anyscale.com>

…project#58161) ## Description kai-scheduler supports gang scheduling at [v0.9.3](NVIDIA/KAI-Scheduler#500 (comment)). But gang scheduling doesn't work at v0.9.4. However, it works again at v0.10.0-rc1. ## Related issues ## Additional information The reason might be as follow. The `numOfHosts` is taken into consideration at v0.9.3. https://github.com/NVIDIA/KAI-Scheduler/blob/0a680562b3cdbae7d81688a81ab4d829332abd0a/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162 The snippet of code is missing at v0.9.4. https://github.com/NVIDIA/KAI-Scheduler/blob/281f4269b37ad864cf7213f44c1d64217a31048f/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L131-L140 Then, it shows up at v0.10.0-rc1. https://github.com/NVIDIA/KAI-Scheduler/blob/96b4d22c31d5ec2b7375b0de0e78e59a57baded6/pkg/podgrouper/podgrouper/plugins/ray/ray_grouper.go#L156-L162 --------- Signed-off-by: fscnick <fscnick.dev@gmail.com>

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

It is sometimes intuitive for users to provide their extensions with '.' at the start. This PR takes care of that and removed the '.' when it is provided. For example, when using `ray.data.read_parquet`, the parameter `file_extensions` needs to be something like `['parquet']`. However, intuitively some users may interpret this parameter as being able to use `['.parquet']`. This commit allows users to switch from: ```python train_data = ray.data.read_parquet( 'example_parquet_folder/', file_extensions=['parquet'], ) ``` to ```python train_data = ray.data.read_parquet( 'example_parquet_folder/', file_extensions=['.parquet'], # Now will read files, instead of silently not reading anything ) ```

…roject#58372) When starting a Ray cluster in a Kuberay environment, the startup process may sometimes be slow. In such cases, it is necessary to increase the timeout duration for proper startup, otherwise, the error "ray client connection timeout" will occur. Therefore, we need to make the timeout and retry policies for the Ray worker configurable. --------- Signed-off-by: OneSizeFitsQuorum <txypotato@gmail.com>

…#58277) ## Description Rich progress currently doesn't support reporting progress from worker. As this is expected to take a lot of design into consideration, default to using tqdm progress (which supports progress reporting from worker) furthermore, we don't have an auto-detect to set `use_ray_tqdm`, so the requirement is for that to be disabled as well. In summary, requirements for rich progress as of now: - rich progress bars enabled - use_ray_tqdm disabled. ## Related issues Fixes ray-project#58250 ## Additional information N/A --------- Signed-off-by: kyuds <kyuseung1016@gmail.com> Signed-off-by: Daniel Shin <kyuseung1016@gmail.com>

…#58381) and also use 12.8.1 cuda base image for default Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…t#58568) ## Description Document that we can now use Kueue + Ray autoscaler in KubeRay for RayCluster and RayService. ## Related issues Closes [kuberay-ray-project#4186](ray-project/kuberay#4186) ## Additional information --------- Signed-off-by: Ping <fourhundredping@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Signed-off-by: 400Ping <fourhundredping@gmail.com> Co-authored-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Jun-Hao Wan <ken89@kimo.com>

Fix unordered list rendering issues in reStructuredText (`.rst`) files. Need to insert a blank line before each unordered list group. --------- Signed-off-by: curiosity-hyf <1184581135@qq.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ay-project#58763) the old versions do not run on python 3.10+ Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…th token in dashboard (ray-project#58819) support `X-Ray-Authorization` header for accepting auth token. this is used by kuberay to pass auth token when it is making requests to ray dashboard through Kubernetes API via proxy. this only affects api's using the middleware (dashboard head and runtime env agent server) --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

…t + `rows_same` handle complex types (ray-project#58752) ## Description 1. Make `rows_same` handle unhashable types 2. Use that in `test_namespace_expressions` 3. Guard `test_datatype` tests to use pyarrow >= 19.0 ## Related issues Closes ray-project#58727 ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Goutam <goutam@anyscale.com>

…roject#58477) **Summary** Modified replica rank assignment to defer rank allocation until the replica is actually allocated, rather than assigning it during the startup call. This is necessary when we want to add node local rank in future, in order to support node rank and node local rank we need to know the node_id which is only known after replica is allocated. **Changes** - Changed `start()` method signature to accept `assign_rank_callback` instead of a pre-assigned `rank` parameter - Rank is now assigned after `_allocated_obj_ref` is resolved, ensuring the replica is allocated before rank assignment - Pass rank to `initialize_and_get_metadata()` method on the replica actor, allowing rank to be set during initialization - Updated `ReplicaBase.initialize()` to accept rank as a parameter and set it along with the internal replica context - Added `PENDING_INITIALIZATION` status check to handle cases where `_ready_obj_ref` is not yet set Next PR ray-project#58479 --------- Signed-off-by: abrar <abrar@anyscale.com>

…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com>

…ct#58818) Replace placeholder. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

to latest version 6.0.2 4.9.4 does not work with python 3.13 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

…-project#58659) ## Description Internally, the `ApproximateTopK` aggregator uses `frequent_strings_sketch` to implement efficient top-k calculations. As hinted in the name `frequent_strings_sketch`, the current implementation casts all data to string before inputting it into the sketch, so the output data is also in string. Therefore, when we have numeric data, for instance, we would get: ``` [{"id": "1", "count": 5} ... ] # notice 1 is not an integer, but string ``` instead of ``` [{"id": 1, "count": 5} ... ] ``` which would be expected. Other types, like lists, tuples, etc will also be cast to string, making it hard for users to recover data. This PR (with offline discussion with some Ray Data team members) attempts to use the `pickle` library to pickle and unpickle data so that when you input the data to `frequent_strings_sketch`, you insert the hex string of the pickle dump. As further improvements, this PR also supports `encode_lists` flag to encode individual list values. This will be useful for our encoders (specifically `MultiHotEncoder` and `OrdinalEncoder`) in the future. ## Related issues N/A ## Additional information N/A --------- Signed-off-by: Daniel Shin <kyuseung1016@gmail.com> Signed-off-by: kyuds <kyuseung1016@gmail.com>

## Description - Document the internal logic of Streaming Repartition Implementation - Add `num_rows_per_block` to Streaming Repartition name ## Related issues ## Additional information --------- Signed-off-by: You-Cheng Lin <mses010108@gmail.com> Signed-off-by: You-Cheng Lin <106612301+owenowenisme@users.noreply.github.com> Co-authored-by: Xinyuan <43737116+xinyuangui2@users.noreply.github.com>

…#58271) `get_all_reported_checkponts` can have 2 different levels: 1) Return when all reported checkpoints have been assembled. Those were the intended semantics before this PR, though there was a bug with get_all_reported_checkpoints + async checkpointing in which we might not wait for the most recently reported checkpoint to be assembled, which this PR also fixes. This could be useful if users want to end training after they have their desired checkpoint. 2) Return when all reported checkpoints have been validated. This is useful for the original purpose of `get_all_reported_checkpoints`, which was to wait until every single checkpoint has been reported/validated before saving them to experiment tracking from the workers themselves (not the driver). This PR toggles between these semantics with the new `CheckpointConsistencyMode` enum. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com>

…ion (ray-project#58729) ## Description Replace existing KubeRay authentication guide based on kube-rbac-proxy with native Ray token authentication being introduced in Ray 2.52.0 ## Related issues > Link related issues: "Fixes ray-project#1234", "Closes ray-project#1234", or "Related to ray-project#1234". ## Additional information > Optional: Add implementation details, API changes, usage examples, screenshots, etc. --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…ect#57047)   ## Why are these changes needed?  ### [Data] Handle FS serialization issue in get_parquet_dataset ### Issue - In read_parquet, FS resolution is done by calling _resolve_paths_and_filesystem on the driver node. - However, on the worker nodes, get_parquet_dataset may not be able to deserialize the FS. ``` 2025-09-24 14:02:35,151 ERROR worker.py:421 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::ReadFiles() (pid=10215, ip=10.242.32.217) for b_out in map_transformer.apply_transform(iter(blocks), ctx): File "/models/ray_pipelines/container/container_bin.runfiles/rules_python~~pip~python_deps_311_ray_cp311_cp311_manylinux_2_17_x86_64_d3f8d8c8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 532, in __call__ for data in iter: File "/models/ray_pipelines/container/container_bin.runfiles/rules_python~~pip~python_deps_311_ray_cp311_cp311_manylinux_2_17_x86_64_d3f8d8c8/site-packages/ray/data/_internal/execution/operators/map_transformer.py", line 327, in __call__ yield from self._batch_fn(input, ctx) File "/models/ray_pipelines/container/container_bin.runfiles/rules_python~~pip~python_deps_311_ray_cp311_cp311_manylinux_2_17_x86_64_d3f8d8c8/site-packages/ray/anyscale/data/_internal/planner/plan_read_files_op.py", line 79, in read_paths yield from reader.read_paths( File "/models/ray_pipelines/container/container_bin.runfiles/rules_python~~pip~python_deps_311_ray_cp311_cp311_manylinux_2_17_x86_64_d3f8d8c8/site-packages/ray/anyscale/data/_internal/readers/parquet_reader.py", line 137, in read_paths fragments = self._create_fragments(paths, filesystem=filesystem) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/models/ray_pipelines/container/container_bin.runfiles/rules_python~~pip~python_deps_311_ray_cp311_cp311_manylinux_2_17_x86_64_d3f8d8c8/site-packages/ray/anyscale/data/_internal/readers/parquet_reader.py", line 193, in _create_fragments parquet_dataset = call_with_retry( ^^^^^^^^^^^^^^^^ File "/models/ray_pipelines/container/container_bin.runfiles/rules_python~~pip~python_deps_311_ray_cp311_cp311_manylinux_2_17_x86_64_d3f8d8c8/site-packages/ray/data/_internal/util.py", line 1400, in call_with_retry raise e from None File "/models/ray_pipelines/container/container_bin.runfiles/rules_python~~pip~python_deps_311_ray_cp311_cp311_manylinux_2_17_x86_64_d3f8d8c8/site-packages/ray/data/_internal/util.py", line 1386, in call_with_retry return f() ^^^ File "/models/ray_pipelines/container/container_bin.runfiles/rules_python~~pip~python_deps_311_ray_cp311_cp311_manylinux_2_17_x86_64_d3f8d8c8/site-packages/ray/anyscale/data/_internal/readers/parquet_reader.py", line 194, in <lambda> lambda: get_parquet_dataset(paths, filesystem, self._dataset_kwargs), ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/models/ray_pipelines/container/container_bin.runfiles/rules_python~~pip~python_deps_311_ray_cp311_cp311_manylinux_2_17_x86_64_d3f8d8c8/site-packages/ray/data/_internal/datasource/parquet_datasource.py", line 628, in get_parquet_dataset dataset = pq.ParquetDataset( ^^^^^^^^^^^^^^^^^^ File "/models/ray_pipelines/container/container_bin.runfiles/rules_python~~pip~python_deps_311_pyarrow_cp311_cp311_manylinux_2_17_x86_64_06ff1264/site-packages/pyarrow/parquet/core.py", line 1793, in __new__ return _ParquetDatasetV2( ^^^^^^^^^^^^^^^^^^ File "/models/ray_pipelines/container/container_bin.runfiles/rules_python~~pip~python_deps_311_pyarrow_cp311_cp311_manylinux_2_17_x86_64_06ff1264/site-packages/pyarrow/parquet/core.py", line 2498, in __init__ filesystem=fragment.filesystem ^^^^^^^^^^^^^^^^^^^ File "pyarrow/_dataset.pyx", line 1901, in pyarrow._dataset.FileFragment.filesystem.__get__ File "pyarrow/_fs.pyx", line 500, in pyarrow._fs.FileSystem.wrap TypeError: Cannot wrap FileSystem pointer 2025-09-24 14:02:35,151 ERROR worker.py:421 -- Unhandled error (suppress with 'RAY_IGNORE_UNHANDLED_ERRORS=1'): ray::ReadFiles() (pid=10215, ip=1 ``` **Fix** Upon TYPE_ERROR, invoke _resolve_paths_and_filesystem on the worker again. ## Related issue number  ## Checks - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run pre-commit jobs to lint the changes in this PR. ([pre-commit setup](https://docs.ray.io/en/latest/ray-contribute/getting-involved.html#lint-and-formatting)) - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :(  --- > [!NOTE] > Adds a worker-side filesystem resolution fallback in `get_parquet_dataset` to handle PyArrow FS serialization errors, with a new test validating the behavior. > > - **Parquet datasource**: > - `get_parquet_dataset`: On `TypeError`, re-resolves `paths`/filesystem via `_resolve_paths_and_filesystem` on the worker and wraps with `RetryingPyFileSystem` using `DataContext.retried_io_errors`; continues to handle `OSError` via `_handle_read_os_error`. > - Import `_resolve_paths_and_filesystem` for local resolution. > - **Tests**: > - Add `test_get_parquet_dataset_fs_serialization_fallback` validating failure with a problematic fsspec-backed FS and success via the helper fallback (uses `PyFileSystem(FSSpecHandler(...))`). > > <sup>Written by [Cursor Bugbot](https://cursor.com/dashboard?tab=bugbot) for commit de68ddd. This will update automatically on new commits. Configure [here](https://cursor.com/dashboard?tab=bugbot).</sup>  --------- Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…ject#58757) ## Description In addition to Block shaping by Block Size and Num Rows, add an option to skip Block Shaping altogether in BlockOutputBuffer. Signed-off-by: Srinath Krishnamachari <srinath.krishnamachari@anyscale.com>

…8760) building platform independent wheel to enable users to run raydepsets on any platform Ray_img lock files unchanged only the headers have changed --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>

Found some code that was put up for the old cgroup implementation. We can delete it now since it's not used. Signed-off-by: irabbani <irabbani@anyscale.com>

…on enum (ray-project#58509) PR ray-project#56871 introduced the AggregationFunction enum for autoscaling metrics aggregation. However, the serve build command's YAML serialization path was not updated to handle Enum objects, causing RepresenterError when serializing AggregationFunction enum values in autoscaling config. ex. `yaml.representer.RepresenterError: ('cannot represent an object', <AggregationFunction.MEAN: 'mean'>)` This fix adds 1. a helper function that recursively converts Enum objects to their string values before YAML dump, ensuring proper serialization of all enum types in the configuration. 2. a test that (a) Creates a temporary output YAML file (b) Reads the config from that file (c) Verifies that AggregationFunction.MEAN is correctly serialized as "mean" (string) Fixes ray-project#58485 --------- Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

…th Ray Data (ray-project#58492) This pull request updates the documentation for reading Hugging Face datasets, recommending the use of ray.data.read_parquet with HfFileSystem for better performance and scalability. --------- Signed-off-by: Robert Nishihara <rkn@anyscale.com>

- introduce new exception class for authentication related exceptions (`ray.exceptions.AuthenticationError`) - add docs link in auth exception - fix minor papercuts - use `RAY_LOG(FATAL)` instead of `RAY_CHECK` to avoid including `An unexpected system state has occurred. You have likely discovered a bug in Ray ...` in error message - update dashboard dialogue - update ray start error message --------- Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>

properly add `wait_cluster.py` in wait for nodes Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

sourcery-ai

The pull request #684 has too many files changed.

The GitHub API will only let us fetch up to 300 changed files, and this pull request has 5434.

gemini-code-assist · 2025-11-20T03:07:31Z

Summary of Changes

Hello @antfin-oss, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request is an automated daily merge from the master branch into main, incorporating a wide array of updates across the Ray project. The changes primarily focus on modernizing the build and continuous integration (CI) infrastructure, streamlining Python dependency management, and refining C++ build targets. These updates aim to improve build determinism, enhance CI pipeline efficiency, and prepare for future development by adopting newer tools and practices.

Highlights

Bazel Configuration Updates: The default Bazel build configuration now enables --incompatible_strict_action_env globally. Windows builds will use /utf-8 for C++ compilation, and warnings for src/ray/thirdparty/ files are now suppressed. Linux builds will execute a workspace status command, and macOS builds will suppress deprecated declaration errors. A new .user.bazelrc file can be imported for custom user options.
Buildkite CI Pipeline Refactoring: The Buildkite CI configuration has undergone significant restructuring. Image building steps previously defined in _forge.rayci.yml have been moved to a new dedicated _images.rayci.yml file. New ray-core-build, ray-dashboard-build, and ray-java-build steps have been introduced for core components. Python 3.10 has been added to various build matrices across different CI configurations, and several docbuild and docgpubuild steps have been relocated to a new doc.rayci.yml.
Python Dependency Management with uv and raydepsets: The CI now leverages uv for Python dependency management, replacing miniconda with miniforge in several Dockerfiles and scripts. A new dependencies.rayci.yml file has been added to manage pip-compile and raydepsets compilation. The raydepsets tool is introduced for building and managing Python dependency sets, enabling more granular control over requirements for different Ray components and environments.
C++ Build and Runtime Enhancements: The C++ build system has been refactored, removing numerous RPC-related ray_cc_library and cc_grpc_library targets, indicating a move towards more modular C++ components. The _raylet Cython extension now includes setproctitle and uses updated GCS and Raylet RPC clients. C++ tests now include RAYCI_DISABLE_TEST_DB=1 to prevent test database interactions in certain contexts. The RemoteFunctionHolder in the C++ API now uses lang_type_ for consistency.
Docker Image Tagging and Registry Updates: Docker image tagging logic has been updated to include RAYCI_BUILD_ID in internal tags and to use DEFAULT_PYTHON_TAG_VERSION (3.10) as the default. Azure Container Registry login and image pushing have been integrated into the Anyscale Docker container build process. New 'extra' image types (ray-extra, ray-ml-extra, ray-llm-extra) have been introduced for base dependencies and ML/LLM components.

Ignored Files

Ignored by pattern: .gemini/** (1)
- .gemini/config.yaml
Ignored by pattern: .github/workflows/** (1)
- .github/workflows/stale_pull_request.yaml

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This PR is a massive and impressive refactoring of the entire CI/CD and build system. The changes are well-structured and introduce many improvements, such as:

Modularizing CI configurations into separate files (_images.rayci.yml, dependencies.rayci.yml, doc.rayci.yml).
Standardizing on miniforge and uv for dependency management.
Refactoring the root BUILD.bazel and WORKSPACE files for better structure and use of standard Bazel practices like pkg_* rules.
Introducing the raydepsets tool for more robust dependency set management.
Improving the linting setup with pre-commit and adding new hooks like semgrep and eslint.
Cleaning up and modernizing C++ code.
Moving macOS CI to arm64.

The overall direction is excellent and will significantly improve the developer experience and CI stability. The changes are extensive, but they appear to be consistent and well-motivated. I have one minor suggestion for a newly added script.

gemini-code-assist · 2025-11-20T03:14:02Z

ci/ray_ci/automation/get_contributors.py

+def _find_pr_number(line: str) -> str:
+    start = line.find("(#")
+    if start < 0:
+        return ""
+    end = line.find(")", start + 2)
+    if end < 0:
+        return ""
+    return line[start + 2 : end]


This implementation is a bit fragile as it would also match non-numeric PR identifiers like (#abc). Adding a check with isdigit() would make it more robust.

Suggested change

def _find_pr_number(line: str) -> str:

start = line.find("(#")

if start < 0:

return ""

end = line.find(")", start + 2)

if end < 0:

return ""

return line[start + 2 : end]

def _find_pr_number(line: str) -> str:

start = line.find('(#')

if start < 0:

return ''

end = line.find(')', start + 2)

if end < 0:

return ''

pr_number_str = line[start + 2 : end]

if pr_number_str.isdigit():

return pr_number_str

return ''

goutamvenkat-anyscale and others added 30 commits October 31, 2025 14:54

[Template] Update image-search-and-classification to pass device for …

b6e6210

…collate_fn (ray-project#58327) Signed-off-by: Gang Zhao <gang@gang-JQ62HD2C37.local> Co-authored-by: Gang Zhao <gang@gang-JQ62HD2C37.local>

Fix typos (ray-project#58349)

4b64508

Fix typos Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>

[Docs][KubeRay] Add Volcano RayJob gang scheduling example (ray-proje…

91ac4c7

…ct#58320) Signed-off-by: win5923 <ken89@kimo.com>

[docker] Update latest Docker dependencies for 2.51.0 release (ray-pr…

c90aacc

…oject#58329) Created by release automation bot. Update with commit a69004e Signed-off-by: kevin <kevin@anyscale.com>

[wheel] stop uploading python 3.9 wheels on release (ray-project#58363)

a64b756

python 3.9 is now out of the support window all using python 3.12 wheel names for unit testing Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci] stop verifying python 3.9 wheels (ray-project#58365)

8f466d7

we will stop releasing them Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[bazel] rename python runtime to py39 runtime (ray-project#58362)

44e8b1d

and move them into bazel dir. getting ready for python version upgrade Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[image] stop building python 3.9 release images (ray-project#58374)

d3d6b6b

python 3.9 is out of support window Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[ci] reef tests on py310 (ray-project#58379)

01ad74f

upgrading reef tests to run on 3.10 Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

[docs][serve][llm] added touch ups (ray-project#58406)

52915af

Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>

[image] change build-docker.sh script to use python 3.10 (ray-project…

dcdbe74

…#58381) and also use 12.8.1 cuda base image for default Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

400Ping and others added 22 commits November 19, 2025 09:14

[data][llm] Ray Data LLM Config Refactor (ray-project#58298)

367c7fe

Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> Signed-off-by: Nikhil G <nrghosh@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

[core] use more recent requests version for uv runtime env testing (r…

50d1666

…ay-project#58763) the old versions do not run on python 3.10+ Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[serve][llm] Ray LLM Cloud Filesystem Restructuring: Provider-Specifi…

e42d621

…c Implementations (ray-project#58469) Signed-off-by: ahao-anyscale <ahao@anyscale.com>

[core] Use secrets.token_hex(32) to generate auth tokens (ray-proje…

b75fb44

…ct#58818) Replace placeholder. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>

[deps] upgrade lxml (ray-project#58808)

817afda

to latest version 6.0.2 4.9.4 does not work with python 3.13 Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

[core] deleting unused cgroup code. (ray-project#58822)

79d9439

Found some code that was put up for the old cgroup implementation. We can delete it now since it's not used. Signed-off-by: irabbani <irabbani@anyscale.com>

[release test] fix release test launching (ray-project#58823)

cf2331b

properly add `wait_cluster.py` in wait for nodes Signed-off-by: Lonnie Liu <lonnie@anyscale.com>

antfin-oss requested review from SongGuyang and kfstorm as code owners November 20, 2025 02:54

antfin-oss added auto-generated daily-merge labels Nov 20, 2025

antfin-oss assigned ffbin Nov 20, 2025

sourcery-ai bot reviewed Nov 20, 2025

View reviewed changes

gemini-code-assist bot reviewed Nov 20, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

🔄 daily merge: master → main 2025-11-20 #684

🔄 daily merge: master → main 2025-11-20 #684

Uh oh!

antfin-oss commented Nov 20, 2025

Uh oh!

sourcery-ai bot left a comment

Uh oh!

gemini-code-assist bot commented Nov 20, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

77 participants

🔄 daily merge: master → main 2025-11-20 #684

Are you sure you want to change the base?

🔄 daily merge: master → main 2025-11-20 #684

Uh oh!

Conversation

antfin-oss commented Nov 20, 2025

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot commented Nov 20, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 20, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

77 participants