-
Notifications
You must be signed in to change notification settings - Fork 7k
[Train] Implement a JaxTrainer to support SPMD with TPUs #55207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Train] Implement a JaxTrainer to support SPMD with TPUs #55207
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @ryanaoleary, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
I've implemented a new JaxTrainer in Ray Train to enable distributed JAX training, specifically targeting SPMD workloads on Google TPUs. This involved significant enhancements to Ray's TPU accelerator management, including new APIs for detecting TPU topology and worker IDs, and integrating these details into Ray's node labeling system. I've also updated the ScalingConfig to allow users to specify TPU usage and topology, and introduced specialized placement group logic to correctly schedule multi-host TPU slices. Additionally, a mechanism to handle TPU lock files has been added to improve robustness.
Highlights
- New JaxTrainer for TPU Workloads: I've introduced a new
JaxTrainerclass withinray.train.jax, specifically designed to facilitate distributed JAX training on TPUs. This trainer leverages Ray's distributed capabilities to manage JAX workloads across multiple TPU devices. - Expanded Scaling Configuration for TPUs: I've enhanced the
ScalingConfigin bothray.air.configandray.train.v2.api.configby addinguse_tpuandtopologyparameters. These allow users to explicitly request TPU resources and specify the desired TPU topology, which is crucial for proper resource allocation and worker placement in SPMD scenarios. - Improved TPU Accelerator Management and Metadata Discovery: Significant updates have been made to the
TPUAcceleratorManagerinpython/ray/_private/accelerators/tpu.py. This includes adding aSPMDHostdataclass, functions to infer TPU pod types from topology, and a mechanism to fetch TPU metadata from remote hosts. Crucially, I've exposed previously private methods and addedget_current_node_tpu_topologyandget_current_node_accelerator_labelsto provide detailed TPU-specific node information for scheduling. - Integration of TPU-Specific Ray Node Labels: To support advanced TPU scheduling, I've integrated new TPU-specific node labels into Ray's core. New constants like
kLabelKeyTpuTopology,kLabelKeyTpuSliceName,kLabelKeyTpuWorkerId, andkLabelKeyTpuPodTypehave been added tosrc/ray/common/constants.hand exposed in Python. These labels enable precise placement of TPU workers for SPMD execution. - Specialized Placement Group Creation for SPMD on TPUs: I've implemented a specialized placement group creation logic,
create_placement_group_with_spmd, within_backend_executor.py. This method handles the complex task of reserving a head node in a multi-host TPU slice and then atomically scheduling the remaining workers to that specific slice, ensuring proper SPMD setup. - TPU Lock File Management Utility: A new utility,
release_tpu_lock, has been added topython/ray/train/jax/config.py. This function helps manage and release potential TPU lock files, which can prevent JAX processes from utilizing TPUs, especially after job restarts or failures.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a JaxTrainer to support SPMD workloads on TPUs, which is a significant feature addition. The changes include new configurations in ScalingConfig, TPU-specific logic in the TPUAcceleratorManager, and a new JAX backend for Ray Train. The implementation looks mostly correct, but there are several critical issues that need to be addressed. These include a TypeError due to a function call with an incorrect signature in the JAX backend, another TypeError from incorrect list creation for placement group selectors, and a missing method call that would lead to an AttributeError. Additionally, there's a high-severity issue with the use of sudo in the code, which poses a security risk. I've also found an incorrect test case and some minor issues. Addressing these points will be crucial for the stability and security of this new feature.
8fac670 to
e898ed9
Compare
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> respect encapsulation and add tests Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix comments Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Remove leading underscores and update how topology is retrieved from GCE Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Remove leading underscores and update how topology is retrieved from GCE Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Replace tpu head label with tpu pod type Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> JaxTrainer support with SPMD in V2 Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix errors and remove release_tpu_lock logic Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix errors and move jax trainer to v2 Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Move all JaxTrainer logic to V2 Train Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
e898ed9 to
76983cf
Compare
|
cc: @matthewdeng @andrewsykim I moved all the code under V2 since this change adds new API fields to the |
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
…group Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
| placement_group = None | ||
| backend_config = self._train_run_context.backend_config | ||
|
|
||
| if getattr(backend_config, "use_tpu", False): | ||
| try: | ||
| placement_group = reserve_tpu_slice( | ||
| num_workers=num_workers, | ||
| resources_per_worker=resources_per_worker, | ||
| topology=getattr(backend_config, "topology", None), | ||
| accelerator_type=getattr(backend_config, "accelerator_type", None), | ||
| ) | ||
| except Exception as e: | ||
| return ControllerError(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic is a bit specific to TPUs so I don't know if it belongs in this generic Controller code.
I think we should aim to find a way to modularize this more, perhaps as part of a Callback that gets created/injected whenever use_tpu=True.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a341722, I added a new callback TPUReservationCallback to ControllerCallback that gets called in _start_worker_group before creating the immutable WorkerGroupContext. This new callback returns a bundle_label_selector to influence the scheduling of the WorkerGroup. Currently the only callback that ends up getting called is TPUReservationCallback, but this way the logic in the controller does not directly reference TPUs and is extensible to other backends.
| num_workers: int | ||
| resources_per_worker: Dict[str, float] | ||
| placement_strategy: str = "PACK" | ||
| placement_group: Optional[PlacementGroup] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if it's a good idea to pass in a created PlacementGroup. The divergence in PlacementGroup creation logic may lead to unexpected behavior depending on which path it goes down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was worried about the head PlacementGroup being released before the full slice was scheduled, causing a race condition. Thinking about it more I think that the PG shouldn't be removed until the JaxTrainer is cleaned up, so this shouldn't be a concern. I'll update the implementation to just return the slice name and pass a bundle_label_selector to the WorkerGroup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done in a341722.
python/ray/train/v2/api/config.py
Outdated
| topology: [Experimental] If specified, Ray Train will launch the training | ||
| coordinator and workers on nodes with the specified topology. Topology is | ||
| auto-detected for TPUs and added as Ray node labels. This arg enables | ||
| SPMD execution of the training workload. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is topology a first-class Ray Core concept? We'd want to make sure it's easy to understand from the API what inputs this takes in and how it'll be used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also for TPU users how familiar are topology/accelerator? Would it be easier for the user to just specify the pod type directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see topology used in Ray core at all, except to configure TPU env vars and node labels - but any users of multi-host TPUs should be familiar with the concept. The concept is also already introduced in KubeRay through the numOfHosts field: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/tpu.html.
I think topology and accelerator type are the best top-level variables for users to specify, since currently in GKE these are the two values users configure when creating their GKE nodepool and when scheduling pods to it using the cloud.google.com/gke-tpu-accelerator and cloud.google.com/gke-tpu-topology nodeSelectors: https://cloud.google.com/kubernetes-engine/docs/how-to/tpus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Topology is quite standard TPU concept. TPU type / Pod Type is in some cases not uniquely mapped to a topology.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome, is it safe to say that this API would then be super intuitive for a TPU user? Is there any other grouping/organization that might be more natural to how a user thinks about setting up their workload?
scaling_config=ScalingConfig(
use_tpu=True,
num_workers=4,
topology="2x2x4",
accelerator_type="TPU-V4",
resources_per_worker={"TPU": 4},
placement_strategy="SPREAD",
),
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I think this top level API should be clear for TPU users - the only thing I can think of is that we could have num_workers, resources_per_worker and placement_strategy be auto-set based on the topology if not provided. For example, if we have a multi-host topology of 4x4 v6e we could automatically detect that num_workers should be 4, resources_per_worker should be TPU: 4 since that's the number of chips on each host, and placement_strategy should be SPREAD.
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
|
Validated the JaxTrainer manually using the following MaxText example:
Verified that TPU specific Ray node labels are set:
|
python/ray/train/v2/jax/tpu_utils.py
Outdated
| head_placement_group = ray.util.placement_group( | ||
| bundles=[{f"TPU-{pod_type}-head": 1}], | ||
| bundle_label_selector=[head_label_selector], | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are only using this for finding the head host, we can simplify the logic to schedule an Actor rather than a Placement Group.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know when the Ray Actor will get cleaned up? My concern is that the Ray Actor will go out of scope and release the head resource before the full slice schedules, causing a race condition. Conversely, if this does work is it okay to create a bunch of Ray Actors that are never cleaned up, running on each TPU head? I guess I wasn't sure whether that case or placement groups had a lower overhead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MengjinYan do you know what is recommended from Core side?
I believe the lifecycle of Actors will be easier to track due to reference counting. We can create a local attribute in this Callback which will hold the head resource actor reference. If this Callback goes out of scope or if another job is launched, that reference would go away and get cleaned up.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ray will be automatically cleaned up when all the handle of it goes out of scope. So as long as the actor handle exists, the actor will be cleaned up.
I might misunderstood this but from the code, looks like after we create the placement group for the TPU head, we won't schedule anything to it other than getting the slice id from the head and when the function returns, we also didn't return the placement group for further training task scheduling. Wondering is it the expected behavior? If that's the case, then I think if using an actor, the actor will actually be out-of-scope and be released.
| "JAX_PLATFORMS": "tpu", | ||
| "ENABLE_PJRT_COMPATIBILITY": "true", | ||
| "TPU_SLICE_BUILDER_DUMP_CHIP_FORCE": "true", | ||
| "TPU_SLICE_BUILDER_DUMP_ICI": "true", | ||
| "XLA_FLAGS": "--xla_dump_to=/tmp/xla_dump_file --xla_dump_hlo_as_proto", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For my understanding, are these always needed? Should they be set up automatically as part of the JaxBackend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the requirements differ across TPU generations, I think they're only currently used on TPU v6e. We could start passing the accelerator type into the JaxConfig and set the pjrt vars automatically for v6e? Or we could just leave it to the user to configure, my initial thought was to leave the JaxConfig as user specified as possible, and then add logic to autodetect the values for fields like num_workers, placement_strategy, reosurces_per_worker, and env vars to automatically set later based on the topology and accelerator type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. We can keep it manual here for now but over time as we discover patterns/boiler plate code that we can abstract away from the user by including into the default behavior, we can add this directly into the JaxBackend logic to reduce the barrier of entry for users.
| per worker). Defaults to False. The number of TPUs reserved by each | ||
| worker can be overridden with the ``resources_per_worker`` | ||
| argument. This arg enables SPMD execution of the training workload. | ||
| topology: [Experimental] If specified, Ray Train will launch the training |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this specific to TPUs? Should it be tpu_topology instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not super familiar with GPUs, but I think the field can probably be extended to set fields automatically in the Config (when left out) for GPUs too - so leaving it as topology might be fine. I don't have much of a preference either way though.
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
matthewdeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice!
| if scaling_config.use_tpu and ( | ||
| num_workers > 1 or scaling_config.num_workers > 1 | ||
| ): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Going back on what I said before (oops), we can just check num_workers here.
| if scaling_config.use_tpu and ( | |
| num_workers > 1 or scaling_config.num_workers > 1 | |
| ): | |
| if scaling_config.use_tpu and num_workers > 1 : |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
python/ray/train/v2/api/config.py
Outdated
| from ray.tune.search.sample import Domain | ||
|
|
||
| SampleRange = Union["Domain", Dict[str, List]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can remove this.
| from ray.tune.search.sample import Domain | |
| SampleRange = Union["Domain", Dict[str, List]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
|
@ryanaoleary don't forget to add |
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
2f12fa5 to
1028bfe
Compare
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
commit a86bb60df41987bfee65b227fcce69a7eee44b9e
Author: Justin Yu <justinvyu@anyscale.com>
Date: Tue Aug 19 08:58:10 2025 -0700
[core] Fix actor import error message for async actors (#55722)
When the Ray actor class fails to import upon actor creation, we create
a TemporaryActor in its place to emit an error message. However, for
async actors, the TemporaryActor creation fails to initialize due having
no async methods. This PR adds a dummy async method to handle this case.
```python
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "<string>", line 35, in <module>
File "/Users/justin/Developer/ray/python/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/Users/justin/Developer/ray/python/ray/_private/client_mode_hook.py", line 104, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/Users/justin/Developer/ray/python/ray/_private/worker.py", line 2896, in get
values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/justin/Developer/ray/python/ray/_private/worker.py", line 970, in get_objects
raise value
ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::Foo.__init__() (pid=42078, ip=127.0.0.1, actor_id=7000b00899a3a8b1d05bbdc601000000, repr=<__main__.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x10732dc10>)
ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
class_name: TemporaryActor
actor_id: 7000b00899a3a8b1d05bbdc601000000
Failed to create actor. You set the async flag, but the actor does not have any coroutine functions.
(TemporaryActor pid=42078) The original cause of the RayTaskError (<class 'ray.exceptions.ActorDiedError'>) isn't serializable: cannot pickle 'google._upb._message.Descriptor' object. Overwriting the cause to a RayError.
```
---------
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
commit f53e38b119ab19c27db6f32d76710a3dd8c6e9c1
Author: tannerdwood <71387269+tannerdwood@users.noreply.github.com>
Date: Tue Aug 19 08:44:44 2025 -0700
[Core] Update DLAMI Information in aws.md (#55702)
Signed-off-by: Tanner Wood <tanwood@amazon.com>
Co-authored-by: Tanner Wood <tanwood@amazon.com>
commit c4482d2fc6d7956104c5b0208a7cc14120737652
Author: Ibrahim Rabbani <irabbani@anyscale.com>
Date: Tue Aug 19 07:47:57 2025 -0700
[core] Remove job submission code for using JobAgent on a random worker node. (#55718)
When a Job is submitted through the SDK/JobClient, the request goes to
the dashboard's JobHead.
The JobHead submits a request to a JobAgent which has a JobManager. The
JobManager creates a JobSupervisor actor which manages the lifecycle of
the job.
In #47147, the `RAY_JOB_AGENT_USE_HEAD_NODE_ONLY` feature flag to force
head node's JobAgent to be used for job submission. The flag was
intended to be a temporary kill switch if head_node only scheduling had
issues.
Now that #47147 has been merged for over a year, I'm cleaning up the
flag in this PR and making it the default (and only behavior).
---------
Signed-off-by: irabbani <irabbani@anyscale.com>
commit f797480b014262ffdf7b33a431fcbc34c0d95b2f
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Tue Aug 19 00:10:43 2025 -0700
[core] Correct bytes in flight when objects <5mb (#54349)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit be33b6fb411b21d2bb2cadfc8755a66c195d2272
Author: avigyabb <98926738+avigyabb@users.noreply.github.com>
Date: Mon Aug 18 21:41:43 2025 -0700
[Core] Bind runtime env agent and dashboard agent http server to specified ip instead of 0.0.0.0 (#55431)
Signed-off-by: avigyabb <avigyabb@stanford.edu>
Signed-off-by: avibasnet31 <avigyabb@anyscale.com>
Co-authored-by: avibasnet31 <avigyabb@anyscale.com>
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
commit 69bc6c1e8394ef5846bf3d3d36a7fd384441c5a1
Author: Ibrahim Rabbani <irabbani@anyscale.com>
Date: Mon Aug 18 21:38:58 2025 -0700
[core] ray.put returns an ObjectRef without an owner_address. (#55636)
Signed-off-by: irabbani <irabbani@anyscale.com>
Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com>
Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 28d1dc9fbdc57b3c33dcc244924e520fa158104b
Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Date: Mon Aug 18 21:36:12 2025 -0700
[Serve.llm] Support colocating local DP ranks in DPRankAssigner (#55720)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
commit 30c8122962dcb1285fd4324313770a53693ce863
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Aug 18 16:46:06 2025 -0700
[image] refactor apt package installation (#55701)
avoid reinstalling packages that are already installed in the base image
also rename the saved requirements file to `extra-test-requirements.txt`
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 6993ba79da529a44fb23b1717acac3d83aa5dcef
Author: Jeffrey Wang <jeffrey31415926@gmail.com>
Date: Mon Aug 18 16:19:27 2025 -0700
[data.llm] Adjust LLM engine timing logic (#55595)
Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com>
commit 7424ffbdc7b5df15141c66c23c1adafa36cd431b
Author: vincenthhan <46981434+BestVIncent@users.noreply.github.com>
Date: Tue Aug 19 07:18:28 2025 +0800
[llm] support custom s3 endpoint when downloading models from remote (#55458)
Signed-off-by: vincenthhan <vincenthhan@tencent.com>
Co-authored-by: vincenthhan <vincenthhan@tencent.com>
commit e9160b72338c4d682af2eb0249f442bd1ff4992d
Author: Qiaolin Yu <liin1211@outlook.com>
Date: Mon Aug 18 15:39:46 2025 -0700
[core] Not overriding accelerator id env vars when num_accelerators is 0 or not set (#54928)
commit fd3f23593de38fec41c8321da7c169b08eb768cc
Author: Edward Oakes <ed.nmi.oakes@gmail.com>
Date: Mon Aug 18 17:32:25 2025 -0500
[core] Remove unnecessary dependency of raylet->gcs (#55710)
The raylet binary was depending on all of the `gcs/` directory for
absolutely no reason :(
---------
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 3ea021227eaeb0404c42cf09015bc685eb097cfb
Author: Edward Oakes <ed.nmi.oakes@gmail.com>
Date: Mon Aug 18 17:28:48 2025 -0500
[core] Separate targets for pubsub interfaces (#55681)
Move publisher & subscriber interfaces into their own header files &
build targets.
Update relevant callsites to use them.
Unfortunately, `reference_count_test` reaches into internal
implementation details of the publisher and this dependency was a little
tricky to break, so not touching it here.
---------
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 1cb4c2c212e5a153e74d86f1e0d2e48942a19502
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Mon Aug 18 15:12:37 2025 -0700
[core] rename ray/telemetry to ray/observability (#55703)
As title. According to @edoakes, ray telemetry has a different meaning
in the ray eco-system. Observability directory will consists for
metrics, events and log related infra.
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit 9128c40da7a8166bb7a9ca7025b01d8a7a5e38db
Author: Sven Mika <svenmika1977@gmail.com>
Date: Mon Aug 18 22:40:55 2025 +0200
[RLlib] Fix MetricsLogger/Stats throughput bugs. (#55696)
commit 01b9e5b1a6b913041b299d7cd262254cfc99503a
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Aug 18 12:33:07 2025 -0700
[ci] release test: use rayci build id for image tags (#55619)
rather than using commit based tags.
this avoids runs across different runs on the same commit to crosstalk
to each other.
---------
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 6326b2c539d4337019dc5107b569e272fc8a8fcf
Author: Sagar Sumit <sagarsumit09@gmail.com>
Date: Tue Aug 19 00:34:29 2025 +0530
[core] Call `__ray_shutdown__` method during actor graceful shutdown (#54584)
This PR introduces a new `__ray_shutdown__ ` method mechanism for Ray
actors to perform deterministic resource cleanup before actor
termination. This addresses issue #53169 by providing a reliable
alternative to `__del__` methods for critical cleanup operations.
The new `__ray_shutdown__ ` method can be explicitly overriden and
provides:
- Deterministic execution: Called explicitly by Ray during actor
shutdown.
- Reliable timing: Executes at the exact right moment before process
termination.
- Optionality: Actors without the method continue to work normally.
Main changes:
1. `core_worker.cc` - Add cleanup call in Shutdown()
2. `_raylet.pyx` - Add callback registration
3. `worker.py` - Register callback when actor is created
Closes #53169
---------
Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com>
commit ec4056ea67e4226fea2f11abaf4e16bf5a3aba14
Author: Edward Oakes <ed.nmi.oakes@gmail.com>
Date: Mon Aug 18 13:53:47 2025 -0500
[ci] Add ability for users to include `.user.bazelrc` file (#55698)
I wanted a way to turn on `--incompatible_strict_action_env` by default
without having an untracked change in my `.bazelrc` constantly and
without needing to pass the `--config` flag all the time. This PR allows
users to define a `.user.bazelrc` file for such changes.
For example, to turn on `--incompatible_strict_action_env` by default,
I've added this file:
```
build --config=strict
test --config=strict
```
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 5afa2abcb65980e2ab558076b39e9a44bd2e3566
Author: Potato <tanxinyu@apache.org>
Date: Tue Aug 19 02:46:17 2025 +0800
[Data]Fix sort_benchmark url not found error (#55692)
The url is invalid as we changed the name for `sort.py` in
https://github.com/ray-project/ray/pull/49017
---------
Signed-off-by: Potato <tanxinyu@apache.org>
commit 81856dfad0ab26dffc5d9209ae297f8acd16ce9a
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Mon Aug 18 11:28:28 2025 -0700
[wheel] when `RAY_DISABLE_EXTRA_CPP=1`, do not build cpp stuff (#55697)
this gives us a way to safely skip the ray-cpp building parts when
building ray wheel.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit b95fc3e0757a89dea38f243c9a29f3768f82b98f
Author: Sampan S Nayak <sampansnayak2@gmail.com>
Date: Mon Aug 18 23:47:39 2025 +0530
[core] Add logic to convert TaskProfileEvent to RayEvent before sending to event aggregator (#55138)
As part of oneEvent effort, all individual task event objects (such as
task definition event, task execution event, etc) are being consolidated
under one type: RayEvent.
This pr adds the translation logic to convert the `TaskProfileEvent` ->`
rpc::events::RayEvent object` + tests to verify that the translation and
subsequent section of the `TaskEventBufferImpl` correctly deal with the
constructed RayEvent.
Signed-off-by: sampan <sampan@anyscale.com>
Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com>
Co-authored-by: sampan <sampan@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com>
commit 6c90c0de34f5b3f618db076c4f3197f78aefc8bf
Author: yi wang <48236141+my-vegetable-has-exploded@users.noreply.github.com>
Date: Tue Aug 19 02:00:27 2025 +0800
[Data] explain API for dataset (#55482)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
Introduce explain() for dataset, which output logical plan and physical
plan.
<!-- Please give a short summary of the change and the problem this
solves. -->
part of #55052
<!-- For example: "Closes #1234" -->
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [x] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
commit a81c7c9fbd5ed70e8aaae5cc2f1bc3e284ec8723
Author: Timothy Seah <timothy.seah777@yahoo.com>
Date: Mon Aug 18 10:38:44 2025 -0700
[train][tune] Train Controller is always actor + fix tune integration to enable this (#55556)
In the past, we used `RUN_CONTROLLER_AS_ACTOR_ENV_VAR` to toggle whether
to run the controller as a separate actor (we want this in most cases)
or on the current actor (we wanted this in Tune so we can propagate
`ray.train.report` from Train to Tune using the `TuneReportCallback`).
However, in order to implement `get_all_reported_checkpoints`
(https://github.com/ray-project/ray/pull/54555), we need to pass the
Train Controller actor to all the Train Worker actors. This method
wouldn't work when using Train from Tune because the Train Controller
actor handle would be the Tune Trainable actor handle which does not
have the async `get_all_reported_checkpoints` method.
This PR gets rid of `RUN_CONTROLLER_AS_ACTOR_ENV_VAR` once and for all
by making all communication between Train and Tune happen through a
lightweight `ray.util.Queue` actor instead of forcing Train and Tune to
happen on the same process.
---------
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: Timothy Seah <tseah@anyscale.com>
commit 796858a91c9a98b7785fdf012096b4a3e5f22cca
Author: simonsays1980 <simon.zehnder@gmail.com>
Date: Mon Aug 18 19:10:31 2025 +0200
[RLlib] Set default to 'log_gradients=False' to stabilize tests (#55695)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
Right now `log_gradients` is by default `True` and this appears to
destabilize tests (see #47717). This PR switches the default to `False`.
Closes #47717
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
commit ae0d4fc04f7d56e77c080a24bf998a67a3e88631
Author: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com>
Date: Mon Aug 18 09:58:37 2025 -0700
[Serve] Update test_deploy_2.py with get_application_url (#55665)
We remove the hardcoded url within the test to use
`get_application_url()`
---------
Signed-off-by: doyoung <doyoung@anyscale.com>
commit be423b042d0370456a8abead58fd6502eeb6c6d4
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Mon Aug 18 09:37:47 2025 -0700
[ci] allowing spaces in append args field on depsets (3/4) (#55625)
- Allowing for spaces in append args (splitting append arg flags before
appending)
- adding a couple unit tests
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
commit faf06e09e55558fb36c72e91a5cf8a7e3da8b8c6
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Mon Aug 18 07:33:47 2025 -0700
[core] Follow-up to address comments of BaseException PR #55602 (#55690)
Address comments from #55602
- Moving the base exception and exception group tests into their own
file so they can use a shared fixture
- Adding comment for SystemExit and KeyboardInterrupt behavior
- Adding tests to test behavior if user code raises SystemExit or
KeyboardInterrupt
---------
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit e0d8e6f46a8734e16f28831941d937d5961c1d12
Author: simonsays1980 <simon.zehnder@gmail.com>
Date: Mon Aug 18 15:26:58 2025 +0200
[RLlib] - Fix `TensorType` (#55694)
commit 1e5094fd5cbfef1de738243b84436b94a7499304
Author: simonsays1980 <simon.zehnder@gmail.com>
Date: Mon Aug 18 15:13:05 2025 +0200
[RLlib - Offline RL] Fix bug in `return_iterator` in multi-learner settings. (#55693)
commit b830b8d3ee64f7c661d4bfa5fb0e7be99ff871a5
Author: simonsays1980 <simon.zehnder@gmail.com>
Date: Mon Aug 18 12:30:24 2025 +0200
[RLlib - Offline] Fix some bugs in the docs for IQL and CQL (#55614)
commit dde4dbad440ada233d5b3e13a990cf25c20ec60e
Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Date: Sun Aug 17 21:33:48 2025 -0700
[Serve.llm] Fix DPServer allocation to CPU node (#55688)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
commit 7321aeed2957a5a71ccb34c2212cd8f4c63a9fab
Author: Edward Oakes <ed.nmi.oakes@gmail.com>
Date: Sun Aug 17 18:34:27 2025 -0500
[core] Remove unnecessary publisher dependency from raylet (#55678)
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 6561061f79b31be4f7cecb20e34bdc92e374ef16
Author: Jason Li <57246540+JasonLi1909@users.noreply.github.com>
Date: Sun Aug 17 11:56:41 2025 -0700
Fixing Circular Import in ray.train.v2.lightning.lightning_utils (#55668)
Importing `RayTrainReportCallback` from
`ray.train.lightning._lightning_utils` in
`ray.train.v2.lightning.lightning_utils` causes a circular import in the
case that `ray.train.v2.lightning.lightning_utils` is loaded before
`ray.train.lightning`.
This PR removes the `ray.train.v2.lightning` module and migrates the
changes upstream to the original `RayTrainReportCallback` class.
---------
Signed-off-by: JasonLi1909 <jasli1909@gmail.com>
commit 03b07db82ab52c5886edd94885fa12d7c30b7b39
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Sat Aug 16 19:41:24 2025 -0700
[core] Fix test_failure on windows (#55687)
Mixing ray_start_regular and ray_start_regular_shared in the same file
can lead to unexpected behavior where cluster state can unexpectedly
carry over into setup for another test. Here on windows
*test_put_error1, test_put_error2,* and *test_version_mismatch are*
skipped so *test_export_large_objects* runs directly after
*test_baseexception_actor_creation* causing issues during its setup.
In a follow up will just create another test file for all basexception
related tests so they can use a shared cluster.
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 9ae08276c6c466557281dca28477e9ad1d374687
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Sat Aug 16 11:16:44 2025 -0700
[core] Update base exception group tests (#55684)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit a44df1655f3031860f3afd4cc81fc0dc6ab5d6f0
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Fri Aug 15 23:38:03 2025 -0700
[ci] release test: fix to use small for test init (#55677)
otherwise the permission is incorrect
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 418f56258e2085a3f370696930a04ae83e7e0103
Author: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
Date: Sat Aug 16 03:19:20 2025 +0200
[serve.llm] Add reset_prefix_cache remote method to llm server (#55658)
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
commit 628df247832fa0e51274a6d53ae750eb9b54a794
Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Date: Fri Aug 15 17:12:20 2025 -0700
[serve.llm] Handle push telemetry race conditions (#55558)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
commit 4c6993ee347e3a4d1ff9a26fb3daddd9bf50783c
Author: Balaji Veeramani <balaji@anyscale.com>
Date: Fri Aug 15 16:51:13 2025 -0700
[Data] Decouple actor and node autoscaling (#55673)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
Actor pool autoscaling and node autoscaling are currently tied together
in a single `Autoscaler` base class, even though they work mostly
independently. This coupling makes testing harder (you have to mock
unused dependencies), complicates the interface, and forces you to touch
unrelated code when extending one type of autoscaling.
This PR splits `Autoscaler` into `ActorAutoscaler` and
`ClusterAutoscaler` to simplify testing, reduce complexity, and make
future extensions easier.
<!-- For example: "Closes #1234" -->
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu>
commit 9fdea0314ef90cedc341285398bb51d79475b6fd
Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Date: Fri Aug 15 15:16:27 2025 -0700
[Serve.llm] Support multi-node data parallel with set_dp_master_info() (#55653)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com>
commit 5fbeff61f889af7eddb7ca7b55ec6a6c8939bc2b
Author: Edward Oakes <ed.nmi.oakes@gmail.com>
Date: Fri Aug 15 16:48:53 2025 -0500
[core] Unify test directory layout on `.../tests/` (#55652)
We currently have multiple different patterns for test files:
- `*_test.cc` in the same file as the implementation.
- `test/*_test.cc` (with `BUILD.bazel` in the test dir or sometimes in
the parent dir).
- `tests/*_test.cc` (with `BUILD.bazel` in the test dir or sometimes in
the parent dir).
Unifying on:
- `tests/*_test.cc`
- `tests/BUILD.bazel` for test targets
---------
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit fc967a55018cf85d7f73381985273f429d14cb81
Author: Jiajun Yao <jeromeyjj@gmail.com>
Date: Fri Aug 15 13:30:44 2025 -0700
[Core] Simplify get_event_aggregator_grpc_stub to not depend on webui_url (#55640)
Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com>
commit d6dce722f0ff25a55a3b3a4749bd32821bcccbec
Author: Edward Oakes <ed.nmi.oakes@gmail.com>
Date: Fri Aug 15 14:54:28 2025 -0500
[serve] Fix easy `ray._private` dependency (#55659)
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 10af9d897bbdaae4202580ba14dea1d6efcb525b
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Fri Aug 15 12:37:21 2025 -0700
[ci] raydepsets: generating llm lock files (4/4) (#55500)
- generating llm lock files with raydepsets
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
commit b819ed4add79492dcdc58d7df277bbd1d438f11b
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Fri Aug 15 12:07:22 2025 -0700
[core] Fix objects_valid with except from BaseException (#55602)
We would encounter a ray check failure on `objects_valid` whenever we
get a function throws an exception that extends from `BaseException`
instead of `Exception`. Fixing that by just excepting `BaseException`
instead of `Exception` when we are vulnerable to exceptions thrown from
user Python code. We still have to special case `SystemExit` and
`KeyboardInterrupt` because we can consider those as critical errors
ourselves and treat them as worker shutdown or task cancellation signals
respectively.
Closes https://github.com/ray-project/ray/issues/43411
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 44e0aea628f1f221345aeddaafce3b82d91cf9fa
Author: simonsays1980 <simon.zehnder@gmail.com>
Date: Fri Aug 15 20:44:34 2025 +0200
[RLlib] Fix `ImportError` in Atari examples. (#54967)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
Running Atari with RLlib results in error described in #53836 . This
related to the version of `gymnasium` installed when calling
`ray[rllib]` and then later installing
`gymnasium[atari,accept-rom-license]`. Using `gymnasium=1.1.1` resolves
this error.
Closes #53836
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: simonsays1980 <simon.zehnder@gmail.com>
commit 2cdb27e49d3c4935fe90236f9affa15b5696a42f
Author: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com>
Date: Fri Aug 15 11:07:19 2025 -0700
[Serve] Update route prefix assignment for ReplicaBase.reconfigure() (#55657)
Update assigning value that was slipped from #55407
---------
Signed-off-by: doyoung <doyoung@anyscale.com>
commit 616b9a19b42305ba5602e4f3bcab81c1e19cf3a0
Author: Edward Oakes <ed.nmi.oakes@gmail.com>
Date: Fri Aug 15 13:05:36 2025 -0500
[core] Clean up `RayletIpcClientInterface` (#55651)
Splits out `raylet_ipc_client_interface.h` into its own target.
Sub-interfaces that use the client should only depend on this interface,
not the full `raylet_ipc_client` target.
This improves incremental builds. For example, now if
`raylet_ipc_client.{h,cc}` changes (including any of its transitive
dependencies), the core worker `store_provider` targets no longer need
to be recompiled. They'll only be recompiled if
`raylet_ipc_client_interface.h` changes, which should be much less
frequent.
I've also moved the `FakeRayletIpcClient` into the source tree.
---------
Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com>
commit 8d6d9fa4c63e7d1e7ecd7f14347c1a565efe4d95
Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Date: Fri Aug 15 10:56:53 2025 -0700
[serve.llm] Correct Pyright lints for Ray Serve LLM examples (#55284)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
commit 0b77c72a0133d407fb58a9114764e652a37e963c
Author: Justin Yu <justinvyu@anyscale.com>
Date: Fri Aug 15 10:48:54 2025 -0700
[data] Wrap batch index in a `BatchMetadata` class (#55643)
Wrap batch metadata in a dataclass that we can extend in the future.
Signed-off-by: Justin Yu <justinvyu@anyscale.com>
commit a39bc679bace4dfaa334c88572effbc5b952a59f
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Fri Aug 15 10:14:33 2025 -0700
[serve] pin the version of wrk used in serve ci base (#55650)
and clone with depth=1
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 20c84e6193d22d29f25cc36e76ea455417349562
Author: akyang-anyscale <alexyang@anyscale.com>
Date: Fri Aug 15 09:56:09 2025 -0700
[serve] Add model composition serve benchmarks (#55549)
Model composition is a common paradigm we should also track performance
for.
---------
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
commit c5a16768c71c354738fc4bef552bd4a58c6b3089
Author: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com>
Date: Fri Aug 15 09:43:09 2025 -0700
[Serve] Update test_http_routes to use get_application_url (#55623)
Updates one of the serve tests, test_http_routes, so it can start using
get_application_url instead of hardcoded urls.
---------
Signed-off-by: doyoung <doyoung@anyscale.com>
Signed-off-by: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com>
commit c7a7d41b4bbd7509b0cb7cc112fd5ac9af5e55af
Author: Aleksei Starikov <aleksei.starikov.ax@gmail.com>
Date: Fri Aug 15 18:42:41 2025 +0200
[serve] Add a function with a Warning to migrate constants that use `or` expression. (#55464)
In the `serve` package some of the constants which are initialized from
environment variables are silently replaced empty values as `0` with
their default values even if a user set them to `0` explicitly. In
addition, they are also can be set to negative values which is likely
not expected.
The list of the constants:
```
PROXY_HEALTH_CHECK_TIMEOUT_S
PROXY_HEALTH_CHECK_PERIOD_S
PROXY_READY_CHECK_TIMEOUT_S
PROXY_MIN_DRAINING_PERIOD_S
--
RAY_SERVE_KV_TIMEOUT_S
```
It happens because of the `or value` structure.
This PR introduces:
- temporary function `get_env_float_non_zero_with_warning` with
`FutureWarning`. The function is showing a warning in the following
format in case of unexpected value:
```
FutureWarning: Got unexpected value `0.0` for `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` environment variable! Starting from version `2.50.0`, the environment variable will require a positive value. Setting `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` to `10.0`.
PROXY_HEALTH_CHECK_TIMEOUT_S = get_env_float_non_zero_with_warning(
-- or
FutureWarning: Got unexpected value `-1.0` for `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` environment variable! Starting from version `2.50.0`, the environment variable will require a positive value. Setting `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` to `-1.0`.
PROXY_HEALTH_CHECK_TIMEOUT_S = get_env_float_non_zero_with_warning(
-- or
FutureWarning: Got unexpected value `0.0` for `RAY_SERVE_KV_TIMEOUT_S` environment variable! Starting from version `2.50.0`, the environment variable will require a positive value. Setting `RAY_SERVE_KV_TIMEOUT_S` to `None`.
RAY_SERVE_KV_TIMEOUT_S = get_env_float_non_zero_with_warning(
```
If the input value is positive, no warning will be emit.
- `None` default value support for env variables (introduced for the
`RAY_SERVE_KV_TIMEOUT_S`)
- `todo` comment for removing the function: `todo: replace this function
with 'get_env_float_positive' for the '2.50.0' release.`
<!-- For example: "Closes #1234" -->
Closes #55454
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [x] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com>
commit de1494e57497b6c57037edf83044ee507fb80159
Author: akyang-anyscale <alexyang@anyscale.com>
Date: Fri Aug 15 09:34:30 2025 -0700
[serve] Refactor the router and handle (#55635)
Refactor Serve deployment handle and router.
---------
Signed-off-by: akyang-anyscale <alexyang@anyscale.com>
commit d95ef0c74138e5a529b5f4b0134177d5aa9bdee0
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Fri Aug 15 00:00:49 2025 -0700
[ci] release test: use rayci to perform test init (#55629)
so that rayci buildid can be populated
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit fe54c9554106b1e4b89c52833b6251143b0092e5
Author: Qiaolin Yu <liin1211@outlook.com>
Date: Thu Aug 14 22:02:24 2025 -0700
[ci] Add hook to clean the Ray address file before the test run starts (#54715)
Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com>
commit 486935db5ede79b419623f29e2593c76a0df57c9
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Thu Aug 14 19:25:41 2025 -0700
[core] add test rules for container tests (#55622)
The `core: container` test is pretty flaky on premerge and block PRs
from time to time. This PR add a test rule to only run this test on a
change that touches `python/ray/runtime_env`.
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit c62889c8d2c72e4e3466f31995c43d2f0189b10e
Author: goutamvenkat-anyscale <goutam@anyscale.com>
Date: Thu Aug 14 18:53:49 2025 -0700
[Train] - Bump up test size for test_data_integration (#55633)
Signed-off-by: Goutam V <goutam@anyscale.com>
commit c7c7e7c8fb99bd1081fe4949ccdff2614e6ce8ca
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Thu Aug 14 17:45:05 2025 -0700
[ci] upgrading uv binary and updating test (2/4) (#55626)
- upgrading uv from 0.7.20 -> 0.8.10 to gain parity with uv used compile
llm lock files job
- updating unit test
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
commit 69f421884419c8c39a363eeb6b459bd77b6f0017
Author: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com>
Date: Thu Aug 14 17:35:01 2025 -0700
[Serve] Add route_prefix field to DeploymentVersion (#55407)
This PR adds `route_prefix` to `DeploymentVersion` class to allow robust
light weight config update with `route_prefix`.
---------
Signed-off-by: doyoung <doyoung@anyscale.com>
commit f8ee5c9629f99c88af1e919a8ba2191a0c07f607
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Aug 14 16:44:58 2025 -0700
[ci] pipe through `RAYCI_DISABLE_JAVA` for manylinux base image building (#55606)
so that when we do not need java, we can skip installing JDK in the
image.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 078d055ad2520b433db28ddc5e48a45bdc0d64a2
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Thu Aug 14 16:44:08 2025 -0700
[ci] raydepsets changing load to build (1/4) (#55627)
updating cli command from load to build
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
commit 21bc4528339420623c2f2a1958c7fb68b5dd8a8c
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Thu Aug 14 14:42:57 2025 -0700
[core] Fix ubsan for publisher_test (#55621)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 1c55991ce455632e1ab9839cb4c25f3e4ddc379c
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Thu Aug 14 14:10:44 2025 -0700
[core][otel] change+simplify the feature flag for open telemetry (#55592)
Change and simplify the feature flag to enable open telemetry. This will
enable us to enable open telemetry for the next Ray release version,
without worrying about messing up previous Ray release versions.
Test:
- CI
Signed-off-by: Cuong Nguyen <can@anyscale.com>
commit fc4ace25a81cf68b71e21c00f1be2532d5c6c148
Author: Kevin H. Luu <kevin@anyscale.com>
Date: Thu Aug 14 13:59:45 2025 -0700
[release] Script to build custom BYOD image (#55577)
Add `custom_byod_build` as a python binary that the Buildkite jobs can
call to build & push custom BYOD images
---------
Signed-off-by: kevin <kevin@anyscale.com>
commit 61bc2e8139e21429d487b0824391c26dcd596cc3
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Aug 14 12:56:37 2025 -0700
[ci] read gce credentials file from global config when building anyscale images (#55580)
rather than using the hard-coded filename
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
commit 49d336cb332da4cdfff894e95ea6f0189f1b05ff
Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Date: Thu Aug 14 11:53:36 2025 -0700
[Serve.llm] Improve PrefixCacheAffinityRouter text normalization compat (#55588)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
commit 37158a22a44edb10d499b53d1f38f00315234a14
Author: harshit-anyscale <harshit@anyscale.com>
Date: Fri Aug 15 00:21:29 2025 +0530
skip test task processor for windows (#55616)
- skipping test task processor for windows to unblock
Signed-off-by: harshit <harshit@anyscale.com>
commit 400ea7716c50afe006ab69a5398fa5d3c2e08373
Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Date: Thu Aug 14 11:46:59 2025 -0700
[serve.llm][docs] Documentation for prefix cache-aware router (#55218)
Signed-off-by: Seiji Eicher <seiji@anyscale.com>
Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com>
Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com>
commit 6d7234b1b54ebc8d77ed9a127ce02b9ff4f9854c
Author: coqian <cong.qian@anyscale.com>
Date: Thu Aug 14 11:06:05 2025 -0700
[Data] Update the export API to refresh the dataset and operator states (#55355)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
This PR is a revert of
[#55333](https://github.com/ray-project/ray/pull/55333) and resolves
conflict by [#55163](https://github.com/ray-project/ray/pull/55163)
Original description:
Some frequently used metadata fields are missing in the export API
schema:
- For both dataset and operator: state, execution start and end time
These fields are important for us to observe the lifecycle of the
datasets and operators, and can be used to improve the accuracy of
reported metrics, such as throughput, which relies on the duration.
<!-- Please give a short summary of the change and the problem this
solves. -->
Summary of change:
- Add state, execution start and end time at the export API schema
- Add a new state enum `PENDING` for dataset and operator, to represent
the state when they are not running yet.
- Refresh the metadata when ever the state of dataset/operator gets
updated. And the event will always contains the latest snapshot of all
the metadata.
<!-- For example: "Closes #1234" -->
- [X] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [X] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [X] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
Signed-off-by: cong.qian <cong.qian@anyscale.com>
commit 6a9938a73ff6d39ee72dcb68667a52b0ba658e8b
Author: Mengjin Yan <mengjinyan3@gmail.com>
Date: Thu Aug 14 11:05:39 2025 -0700
[Core] Add Logic to Check Label Selector in PG Scheduling (#55599)
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
commit c4d990cafe01ce4f6caec38e814217310fcc0a1c
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Aug 14 11:02:48 2025 -0700
[ci] add rayci build id tags for release test images (#55605)
in addition to current tags.
first step to migrate to use rayci build id tags to stop release test
jobs from cross-talking to each other
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit af41960a49e85863709ef36fb4968f0021d730b3
Author: Stephanie Wang <smwang@cs.washington.edu>
Date: Thu Aug 14 10:02:16 2025 -0700
[core][gpu-object] Add a user-facing call to wait for tensor to be freed (#55076)
This adds a call `ray.experimental.wait_tensor_freed` that allows user
code to check when a tensor that it put into Ray's GPU object store has
been freed. Unlike the normal Ray object store, the GPU object store is
just a Python data structure on the actor, which allows us to avoid
copying. This means that the actor can keep a reference to an object in
its store. The API call allows the actor to check when the object has
been freed from the store, so that it can safely write to the tensor
again.
Closes #52341.
---------
Signed-off-by: Stephanie wang <smwang@cs.washington.edu>
Signed-off-by: Stephanie Wang <smwang@cs.washington.edu>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
commit f0b0aadd65b3a842ed42ef870ac3067ea42f30af
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Aug 14 10:01:39 2025 -0700
[image] add base-extra for aarch64 images (#55586)
for easier use on ray cluster hosters like anyscale.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit ea27578265182b3b721b0b6b5a9f2d6a49e6e61b
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Aug 14 10:01:25 2025 -0700
[ci] remove unused `use_base_extra` (#55604)
added incorrectly in a past change
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 7518fd8be262c5f1bdc8246e0a3c5cc7db5d1bd6
Author: Jun-Hao Wan <ken89@kimo.com>
Date: Fri Aug 15 00:09:47 2025 +0800
[Doc][KubeRay] Add InteractiveMode description for `ray-job-quick-start.md` (#55570)
Signed-off-by: win5923 <ken89@kimo.com>
Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org>
Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org>
commit 6afaeda7dc7eb700076ae98b5b356568a293cde2
Author: simonsays1980 <simon.zehnder@gmail.com>
Date: Thu Aug 14 17:08:01 2025 +0200
[RLlib] Add docs for Implicit Q-Learning. (#55422)
commit 4b6dba34d50d647a7929b1e9079954511a69c759
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Thu Aug 14 00:59:20 2025 -0700
[ci] fix incorrect ml-baseextra depends_on (#55596)
to depends on the right wanda job
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit e4410d09cd0de2a7b2e6e507c12b92d2741cd6ea
Author: Nikhil G <nrghosh@users.noreply.github.com>
Date: Wed Aug 13 22:52:11 2025 -0700
[serve.llm] fix: improve error handling for invalid model_id (#55589)
Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com>
commit 02340e1f402b8ebde104e92c9941b149e5555acb
Author: harshit-anyscale <harshit@anyscale.com>
Date: Thu Aug 14 10:21:53 2025 +0530
add support for async inference (#54824)
This PR aims to provide basic support for asynchronous inference in the
ray serve.
RFC can be found at: https://github.com/ray-project/ray/issues/54652
The PR doesn't contains all the implementation pieces as having all the
code changes in a single PR would be very difficult to review. Missing
pieces are
- implementation of failed and unprocessed task queue for the celery
task processor
- add more detailed and thorough tests for the same.
These missing pieces will be taken care of in the subsequent PRs.
---------
Signed-off-by: harshit <harshit@anyscale.com>
commit 4dd73213096635cf78a1a69db84f244bb05ec50f
Author: lkchen <github@lkchen.net>
Date: Wed Aug 13 21:39:54 2025 -0700
[data.llm] Add FAQ to doc, explain STRICT_PACK strategy used in data.llm (#55505)
Signed-off-by: Linkun <github@lkchen.net>
commit 15887001ded1eca621f6890952c5c2a90d4e58a8
Author: Joshua Lee <73967497+Sparks0219@users.noreply.github.com>
Date: Wed Aug 13 20:56:08 2025 -0700
[core] Store local_raylet_rpc_client in raylet_client_pool (#55490)
Signed-off-by: joshlee <joshlee@anyscale.com>
commit fd681ee6e3a74f08918eec34ea7a5d2f9b502f39
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Wed Aug 13 20:36:49 2025 -0700
[ci] raydepsets: implementing build arg sets (2/2) (#55423)
1/2 here: https://github.com/ray-project/ray/pull/55408
- implementing get depset by name and optional build arg set
- adding unit tests
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: Elliot Barnwell <elliot.barnwell@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit f677f564cc56c07e7c93d29c33e2f7314ef34fa1
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Wed Aug 13 19:42:02 2025 -0700
[core] Improve gcs publish perf + clean up publisher in general (#55560)
This PR is focused on two things removing a lot of unnecessary copies
when publishing from the GCS + when subscribing to the GCS from +
cleaning up publisher related code, e.g. publish functions took
callbacks that were always nullptr, always returned Status::OK, etc.
There's no actual functional changes in this PR.
Copy killing that matters:
https://github.com/ray-project/ray/blob/4e5f03e7a1d06b9da8f3a9329400d426055f8ea4/src/ray/gcs/gcs_server/pubsub_handler.cc#L49-L59
Every GCS publish will result in an extra copy here because the
`pubsub_reply` we create is heap allocated while the actual reply is
arena allocated, so the swap will result in a copy of everything every
time we publish to every subscriber.
Also, there were multiple extra copies of messages inside gcs_pub_sub.cc
when the PythonGcsPublisher publishes and when the PythonGcsSubscriber
gets messages.
---------
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 1699dc367f71ac05db8486ac70758090c37403a7
Author: Neil Girdhar <mistersheik@gmail.com>
Date: Wed Aug 13 21:33:45 2025 -0400
Suppress type error (#50994)
Signed-off-by: Neil Girdhar <mistersheik@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
commit ceaa4fb6f5db3189f77a1ed0f2c407de47ce4792
Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Date: Wed Aug 13 18:22:54 2025 -0700
[Serve.llm] Use DEFAULT_MAX_ONGOING_REQUESTS for DPServer (#55583)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
commit 54ae92386d2b4600e1a9327b4f83c4c48742a412
Author: Timothy Seah <timothy.seah777@yahoo.com>
Date: Wed Aug 13 17:40:01 2025 -0700
[train] Change DEFAULT variables from strings to bools (#55581)
All of these constants are used as the default value of
[`env_bool`](https://github.com/ray-project/ray/blob/master/python/ray/_private/ray_constants.py#L41),
which returns a bool.
Technically this is a no-op since "1" evaluates to True anyway, but this
is misleading because "0" actually also evaluates to True.
Signed-off-by: Timothy Seah <tseah@anyscale.com>
Co-authored-by: Timothy Seah <tseah@anyscale.com>
commit 9838ad64d43dbd25b77acfd834500cd96f793e28
Author: yi wang <48236141+my-vegetable-has-exploded@users.noreply.github.com>
Date: Thu Aug 14 08:32:54 2025 +0800
[DOC][Tune] fix: remove extra space in tune documentation (#55125)
Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com>
Co-authored-by: matthewdeng <matt@anyscale.com>
commit 1216e15c32de9ab44cbc9c5532b0571c6499732f
Author: Elliot Barnwell <elliot.barnwell@anyscale.com>
Date: Wed Aug 13 17:06:31 2025 -0700
[ci] raydepsets: implementing build arg sets (1/2) (#55408)
- converting build arg sets into a dictionary instead of a list
- updating naming convention for depsets with build_arg_sets ( suffix:
_${BUILD_ARG_SET} for depset name in the config)
- adding unit tests
---------
Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
Signed-off-by: Elliot Barnwell <elliot.barnwell@anyscale.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit ecc4c93af0308ccf4b5e08135865766e9a1fbd30
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Aug 13 16:35:17 2025 -0700
[image] add base-extra layer (#55513)
this the layer required to run on anyscale cloud and for running in ray
release tests.
we have been sourcing this layer from a tarball in s3; this change
builds it from the source.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit 3e34885814e4da9a83123e22b042a7ee684074ad
Author: Kishanthan Thangarajah <kshanth2101@gmail.com>
Date: Wed Aug 13 19:21:05 2025 -0400
[serve] Support custom autoscaling at deployment level for ray serve (#55253)
This PR adds initial changes to support custom auto scaling with ray
serve. Two new classes (AutoscalingContext and AutoscalingPolicy) have
been introduced as per discussions in
https://docs.google.com/document/d/1KtMUDz1O3koihG6eh-QcUqudZjNAX3NsqqOMYh3BoWA/edit?usp=sharing.
Related RFC
https://github.com/ray-project/ray/issues/41135#issuecomment-3156717488
The changes will have two phases.
Phase1 is to add required changes to support custom autoscaling at
deployment level. Phase2 is to extend the changes to support custom
autoscaling at application level. This PR is part of Phase1 (deployment
level custom autoscaling).
Related to #41135
- [x] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [x] I've run `scripts/format.sh` to lint the changes in this PR.
- [x] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [x] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: Kishanthan Thangarajah <kshanth2101@gmail.com>
commit 2c7bd7d06930e5cc302a01c5baedef43911e3582
Author: Dhyey Shah <dhyey2019@gmail.com>
Date: Wed Aug 13 14:35:25 2025 -0700
[core][ci] Kill debug wheel step (#55571)
Signed-off-by: dayshah <dhyey2019@gmail.com>
commit 52bef607fd4349e70a1874fb2d6a8a9f6d447111
Author: Matvei Pashkovskii <matvei.pashkovskii@amd.com>
Date: Thu Aug 14 00:10:21 2025 +0300
[Serve.llm] Add LMCacheConnectorV1 support for kv_transfer_config (#54579)
Signed-off-by: Matvei Pashkovskii <mpashkov@amd.com>
Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com>
commit 32304ab50a5f1c94504d2610a338fef1e84ecef7
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Aug 13 13:21:41 2025 -0700
[release test] remove "multi" test frequency (#55561)
not used anywhere any more
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit c47048e6ebf1b7a705cdb1be18b027889623e1a4
Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Date: Wed Aug 13 12:56:01 2025 -0700
[core][obsclean/02] de-static more internal ray metrics (#55537)
Ray core currently offers two APIs for defining internal metrics: a
static object-oriented (OO) API and a template/extern-based API. The OO
API is also used for defining custom metrics at the Ray application
level, and I personally find it easier to read. This series of PRs aims
to unify all metric definitions under the OO API.
---------
This PR migrates **all** metric from static to runtime definition, as
part of the effort to eliminate all statically defined metrics.
Currently, the OO interface attempts to register a metric at the same
time its first value is recorded, due to the [C++ static initialization
order fiasco](https://en.cppreference.com/w/cpp/language/siof.html),
which is awkward and potentially inefficient. We can fix this by
removing all statically defined metrics.
Test:
- CI
---------
Signed-off-by: Cuong Nguyen <can@anyscale.com>
Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
commit 6ebd7d013933dfa990b11ffcad63cfd6f78db6cd
Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com>
Date: Wed Aug 13 12:22:56 2025 -0700
[data] Sanitization of Dataset Metadata Export (#55379)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
A couple of things that have been improved
- updating structs should have string keys
- More tests for bytes, bytearrays, dataclasses
<!-- Please give a short summary of the change and the problem this
solves. -->
<!-- For example: "Closes #1234" -->
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: iamjustinhsu <jhsu@anyscale.com>
commit 3d44e3d17b56e993f1fd7407bdf1288c852c8c41
Author: Mengjin Yan <mengjinyan3@gmail.com>
Date: Wed Aug 13 11:57:54 2025 -0700
[Core][TaskEventFollowup/03] Improve the Target Http Endpoint in Aggregator Agent (#55529)
This PR improves the target http endpoint in the aggregator_agent.py:
Merge the address and port as one env var to specify the target http endpoint
Set the default value of the endpoint to be empty. And only when the endpoint is specified, we send the events out to the endpoint
Update corresponding tests
-----------
Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com>
Signed-off-by: myan <myan@anyscale.com>
commit 8d810e2667fc728e45ca990ff7d7dc8547eae99b
Author: Alexey Kudinkin <ak@anyscale.com>
Date: Wed Aug 13 14:32:25 2025 -0400
[Data] Fixing `AutoscalingActorPool` to properly downscale upon completion of the execution (#55565)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
In 2.48 change introduced debouncing handling that disallows downscaling
for Actor Pool for 30s after latest upscaling to give AP Operator enough
time to start utilizing upscaled actor.
However, that affected ability of the Actor Pool to downscale upon
completion of the execution: when operator completes execution it should
start downscaling immediately. This change addresses that.
<!-- For example: "Closes #1234" -->
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: Alexey Kudinkin <ak@anyscale.com>
commit 64feab4b01583023cec89bc2d199b0ff0de4c3cd
Author: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com>
Date: Wed Aug 13 18:09:39 2025 +0000
[Train] Implement a JaxTrainer to support SPMD with TPUs (#55207)
This PR builds off previous efforts to add a `JaxTrainer` and the
[ray-tpu package](https://github.com/AI-Hypercomputer/ray-tpu/tree/main)
to implement support for a `JaxTrainer` in RayTrain that supports SPMD
workloads with TPUs. Support for more types of workloads (i.e. better
support for CPU and GPU) can be added incrementally.
In order to support SPMD locality-aware scheduling at the TPU slice
level, we alter the `WorkerGroup` construction in V2 Ray Train to
optionally accept multiple placement groups specs to apply to a range of
workers. This enables us to reserve the "TPU head" using a placement
group with label selectors, retrieve its unique `ray.io/tpu-slice-name`,
and then schedule the remaining workers on that slice in a separate
placement group.
---------
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Co-authored-by: Andrew Sy Kim <andrewsy@google.com>
commit 6d318ce84ddeacf67dc0c66f6e2fb6f6a8fef2e4
Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com>
Date: Wed Aug 13 10:57:54 2025 -0700
[Serve.llm] Add missing data_parallel/__init__.py (#55573)
Signed-off-by: Rui Qiao <ruisearch42@gmail.com>
commit 3c1314afb82128f30e5a445462c7277717e62863
Author: William Lin <SolitaryThinker@users.noreply.github.com>
Date: Wed Aug 13 10:55:47 2025 -0700
[docs] Add documentation for using type hints in Ray Core (#55013)
<!-- Thank you for your contribution! Please review
https://github.com/ray-project/ray/blob/master/CONTRIBUTING.rst before
opening a pull request. -->
<!-- Please add a reviewer to the assignee section when you create a PR.
If you don't have the access to it, we will shortly find a reviewer and
assign them to your PR. -->
<!-- Please give a short summary of the change and the problem this
solves. -->
<!-- For example: "Closes #1234" -->
- [ ] I've signed off every commit(by using the -s flag, i.e., `git
commit -s`) in this PR.
- [ ] I've run `scripts/format.sh` to lint the changes in this PR.
- [ ] I've included any doc changes needed for
https://docs.ray.io/en/master/.
- [ ] I've added any new APIs to the API Reference. For example, if I
added a
method in Tune, I've added it in `doc/source/tune/api/` under the
corresponding `.rst` file.
- [ ] I've made sure the tests are passing. Note that there might be a
few flaky tests, see the recent failures at https://flakey-tests.ray.io/
- Testing Strategy
- [ ] Unit tests
- [ ] Release tests
- [ ] This PR is not tested :(
---------
Signed-off-by: will.lin <will.lin@anyscale.com>
Signed-off-by: Richard Liaw <rliaw@berkeley.edu>
Co-authored-by: Richard Liaw <rliaw@berkeley.edu>
commit a24defd4c4773879a834762ba414d3c0cea9b1e9
Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com>
Date: Wed Aug 13 10:51:07 2025 -0700
[release test] remove release image build step from postmerge (#55564)
they should be always building from release test pipeline directly
we used to run release tests on postmerge; we are no longer doing it any
more.
also add oss tag for those steps.
Signed-off-by: Lonnie Liu <lonnie@anyscale.com>
commit dda42b2d97768dbebdbaf766a7ed2e2e2372cc8b
Author: William Lin <SolitaryThinker@users.noreply.github.com>
Date: Wed Aug 13 08:36:19 2025 -0700
[core] Add return type to ActorClass.options (#55563)
Currently the following pattern throws many lint errors as
`ActorDemoRay.options(name="demo_ray")` returns an instance of
`ActorOptionWrapper` which messes with the IDE's static type checker:
```python
import ray
from ray import ObjectRef
from ray.actor import ActorProxy, ActorClass
class DemoRay:
def __init__(self, init: int):
self.init = init
@ray.method
def calculate(self, v1: int, v2: int) -> int:
return self.init + v1 + v2
ActorDemoRay: ActorClass[DemoRay] = ray.remote(DemoRay)
def main():
p: ActorProxy[DemoRay] = ActorDemoRay.options(name="demo_ray").remote(1)
actor: ActorProxy[DemoRay] = ray.get_actor("demo_ray")
a = actor.calculate.remote(1, 2)
print(ray.get(a))
return
if __name__ == "__main__":
main()
```
This PR changes ActorClass[T].options(...) to return a new instance of
ActorClass[T] instead, allow IDEs to correct infer the type of
subsequent `.remote(...)` calls
https://github.com/ray-project/ray/issues/54149
---------
Signed-off-by: will.lin <will.lin@anyscale.com>
…#55207) This PR builds off previous efforts to add a `JaxTrainer` and the [ray-tpu package](https://github.com/AI-Hypercomputer/ray-tpu/tree/main) to implement support for a `JaxTrainer` in RayTrain that supports SPMD workloads with TPUs. Support for more types of workloads (i.e. better support for CPU and GPU) can be added incrementally. In order to support SPMD locality-aware scheduling at the TPU slice level, we alter the `WorkerGroup` construction in V2 Ray Train to optionally accept multiple placement groups specs to apply to a range of workers. This enables us to reserve the "TPU head" using a placement group with label selectors, retrieve its unique `ray.io/tpu-slice-name`, and then schedule the remaining workers on that slice in a separate placement group. --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Co-authored-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: Andrew Grosser <dioptre@gmail.com>
…#55207) This PR builds off previous efforts to add a `JaxTrainer` and the [ray-tpu package](https://github.com/AI-Hypercomputer/ray-tpu/tree/main) to implement support for a `JaxTrainer` in RayTrain that supports SPMD workloads with TPUs. Support for more types of workloads (i.e. better support for CPU and GPU) can be added incrementally. In order to support SPMD locality-aware scheduling at the TPU slice level, we alter the `WorkerGroup` construction in V2 Ray Train to optionally accept multiple placement groups specs to apply to a range of workers. This enables us to reserve the "TPU head" using a placement group with label selectors, retrieve its unique `ray.io/tpu-slice-name`, and then schedule the remaining workers on that slice in a separate placement group. --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Co-authored-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>
…#55207) This PR builds off previous efforts to add a `JaxTrainer` and the [ray-tpu package](https://github.com/AI-Hypercomputer/ray-tpu/tree/main) to implement support for a `JaxTrainer` in RayTrain that supports SPMD workloads with TPUs. Support for more types of workloads (i.e. better support for CPU and GPU) can be added incrementally. In order to support SPMD locality-aware scheduling at the TPU slice level, we alter the `WorkerGroup` construction in V2 Ray Train to optionally accept multiple placement groups specs to apply to a range of workers. This enables us to reserve the "TPU head" using a placement group with label selectors, retrieve its unique `ray.io/tpu-slice-name`, and then schedule the remaining workers on that slice in a separate placement group. --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Co-authored-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>
## Why are these changes needed? With #55207 Ray Train now has support for training functions with a JAX backend through the new `JaxTrainer` API. This guide provides a short overview of the API, how to configure with TPUs, and how to edit a JAX script to use Ray Train. TODO: I will link a longer e2e guide with KubeRay, MaxText, and the JaxTrainer on TPUs in GKE --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com>
Original PR #55207 by ryanaoleary Original: ray-project/ray#55207
… with TPUs Merged from original PR #55207 Original: ray-project/ray#55207
## Why are these changes needed? With #55207 Ray Train now has support for training functions with a JAX backend through the new `JaxTrainer` API. This guide provides a short overview of the API, how to configure with TPUs, and how to edit a JAX script to use Ray Train. TODO: I will link a longer e2e guide with KubeRay, MaxText, and the JaxTrainer on TPUs in GKE --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>
## Why are these changes needed? With ray-project#55207 Ray Train now has support for training functions with a JAX backend through the new `JaxTrainer` API. This guide provides a short overview of the API, how to configure with TPUs, and how to edit a JAX script to use Ray Train. TODO: I will link a longer e2e guide with KubeRay, MaxText, and the JaxTrainer on TPUs in GKE --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com>
## Why are these changes needed? With ray-project#55207 Ray Train now has support for training functions with a JAX backend through the new `JaxTrainer` API. This guide provides a short overview of the API, how to configure with TPUs, and how to edit a JAX script to use Ray Train. TODO: I will link a longer e2e guide with KubeRay, MaxText, and the JaxTrainer on TPUs in GKE --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

Why are these changes needed?
This PR builds off previous efforts to add a
JaxTrainerand the ray-tpu package to implement support for aJaxTrainerin RayTrain that supports SPMD workloads with TPUs. Support for more types of workloads (i.e. better support for CPU and GPU) can be added incrementally.In order to support SPMD locality-aware scheduling at the TPU slice level, we alter the
WorkerGroupconstruction in V2 Ray Train to optionally accept multiple placement groups specs to apply to a range of workers. This enables us to reserve the "TPU head" using a placement group with label selectors, retrieve its uniqueray.io/tpu-slice-name, and then schedule the remaining workers on that slice in a separate placement group.TODO: I need to add good tests and my manual testing method with a real workload in the comments.
Related issue number
#55162
Checks
git commit -s) in this PR.scripts/format.shto lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/under thecorresponding
.rstfile.