[Train] Implement a JaxTrainer to support SPMD with TPUs #55207

ryanaoleary · 2025-08-04T18:18:00Z

Why are these changes needed?

This PR builds off previous efforts to add a JaxTrainer and the ray-tpu package to implement support for a JaxTrainer in RayTrain that supports SPMD workloads with TPUs. Support for more types of workloads (i.e. better support for CPU and GPU) can be added incrementally.

In order to support SPMD locality-aware scheduling at the TPU slice level, we alter the WorkerGroup construction in V2 Ray Train to optionally accept multiple placement groups specs to apply to a range of workers. This enables us to reserve the "TPU head" using a placement group with label selectors, retrieve its unique ray.io/tpu-slice-name, and then schedule the remaining workers on that slice in a separate placement group.

TODO: I need to add good tests and my manual testing method with a real workload in the comments.

Related issue number

#55162

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

gemini-code-assist

Summary of Changes

Hello @ryanaoleary, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

I've implemented a new JaxTrainer in Ray Train to enable distributed JAX training, specifically targeting SPMD workloads on Google TPUs. This involved significant enhancements to Ray's TPU accelerator management, including new APIs for detecting TPU topology and worker IDs, and integrating these details into Ray's node labeling system. I've also updated the ScalingConfig to allow users to specify TPU usage and topology, and introduced specialized placement group logic to correctly schedule multi-host TPU slices. Additionally, a mechanism to handle TPU lock files has been added to improve robustness.

Highlights

New JaxTrainer for TPU Workloads: I've introduced a new JaxTrainer class within ray.train.jax, specifically designed to facilitate distributed JAX training on TPUs. This trainer leverages Ray's distributed capabilities to manage JAX workloads across multiple TPU devices.
Expanded Scaling Configuration for TPUs: I've enhanced the ScalingConfig in both ray.air.config and ray.train.v2.api.config by adding use_tpu and topology parameters. These allow users to explicitly request TPU resources and specify the desired TPU topology, which is crucial for proper resource allocation and worker placement in SPMD scenarios.
Improved TPU Accelerator Management and Metadata Discovery: Significant updates have been made to the TPUAcceleratorManager in python/ray/_private/accelerators/tpu.py. This includes adding a SPMDHost dataclass, functions to infer TPU pod types from topology, and a mechanism to fetch TPU metadata from remote hosts. Crucially, I've exposed previously private methods and added get_current_node_tpu_topology and get_current_node_accelerator_labels to provide detailed TPU-specific node information for scheduling.
Integration of TPU-Specific Ray Node Labels: To support advanced TPU scheduling, I've integrated new TPU-specific node labels into Ray's core. New constants like kLabelKeyTpuTopology, kLabelKeyTpuSliceName, kLabelKeyTpuWorkerId, and kLabelKeyTpuPodType have been added to src/ray/common/constants.h and exposed in Python. These labels enable precise placement of TPU workers for SPMD execution.
Specialized Placement Group Creation for SPMD on TPUs: I've implemented a specialized placement group creation logic, create_placement_group_with_spmd, within _backend_executor.py. This method handles the complex task of reserving a head node in a multi-host TPU slice and then atomically scheduling the remaining workers to that specific slice, ensuring proper SPMD setup.
TPU Lock File Management Utility: A new utility, release_tpu_lock, has been added to python/ray/train/jax/config.py. This function helps manage and release potential TPU lock files, which can prevent JAX processes from utilizing TPUs, especially after job restarts or failures.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a JaxTrainer to support SPMD workloads on TPUs, which is a significant feature addition. The changes include new configurations in ScalingConfig, TPU-specific logic in the TPUAcceleratorManager, and a new JAX backend for Ray Train. The implementation looks mostly correct, but there are several critical issues that need to be addressed. These include a TypeError due to a function call with an incorrect signature in the JAX backend, another TypeError from incorrect list creation for placement group selectors, and a missing method call that would lead to an AttributeError. Additionally, there's a high-severity issue with the use of sudo in the code, which poses a security risk. I've also found an incorrect test case and some minor issues. Addressing these points will be crucial for the stability and security of this new feature.

python/ray/train/_internal/backend_executor.py

python/ray/train/jax/config.py

python/ray/tests/accelerators/test_tpu.py

python/ray/train/jax/config.py

python/ray/_private/accelerators/tpu.py

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> respect encapsulation and add tests Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix comments Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Remove leading underscores and update how topology is retrieved from GCE Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Remove leading underscores and update how topology is retrieved from GCE Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Replace tpu head label with tpu pod type Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> JaxTrainer support with SPMD in V2 Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix errors and remove release_tpu_lock logic Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Fix errors and move jax trainer to v2 Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Move all JaxTrainer logic to V2 Train Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2025-08-05T12:14:54Z

cc: @matthewdeng @andrewsykim I moved all the code under V2 since this change adds new API fields to the ScalingConfig and changes the logic in the Train controller to accept multiple placement group specs. I'll update this PR with my manual tests with Jax today.

python/ray/train/v2/jax/jax_trainer.py

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

…group Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

matthewdeng · 2025-08-07T01:50:58Z

python/ray/train/v2/_internal/execution/controller/controller.py

+        placement_group = None
+        backend_config = self._train_run_context.backend_config
+
+        if getattr(backend_config, "use_tpu", False):
+            try:
+                placement_group = reserve_tpu_slice(
+                    num_workers=num_workers,
+                    resources_per_worker=resources_per_worker,
+                    topology=getattr(backend_config, "topology", None),
+                    accelerator_type=getattr(backend_config, "accelerator_type", None),
+                )
+            except Exception as e:
+                return ControllerError(e)


This logic is a bit specific to TPUs so I don't know if it belongs in this generic Controller code.

I think we should aim to find a way to modularize this more, perhaps as part of a Callback that gets created/injected whenever use_tpu=True.

Done in a341722, I added a new callback TPUReservationCallback to ControllerCallback that gets called in _start_worker_group before creating the immutable WorkerGroupContext. This new callback returns a bundle_label_selector to influence the scheduling of the WorkerGroup. Currently the only callback that ends up getting called is TPUReservationCallback, but this way the logic in the controller does not directly reference TPUs and is extensible to other backends.

python/ray/train/v2/_internal/callbacks/accelerators.py

matthewdeng · 2025-08-07T01:57:21Z

python/ray/train/v2/_internal/execution/worker_group/worker_group.py

    num_workers: int
    resources_per_worker: Dict[str, float]
    placement_strategy: str = "PACK"
+    placement_group: Optional[PlacementGroup] = None


Not sure if it's a good idea to pass in a created PlacementGroup. The divergence in PlacementGroup creation logic may lead to unexpected behavior depending on which path it goes down.

I was worried about the head PlacementGroup being released before the full slice was scheduled, causing a race condition. Thinking about it more I think that the PG shouldn't be removed until the JaxTrainer is cleaned up, so this shouldn't be a concern. I'll update the implementation to just return the slice name and pass a bundle_label_selector to the WorkerGroup.

Done in a341722.

matthewdeng · 2025-08-07T02:00:18Z

python/ray/train/v2/api/config.py

+        topology: [Experimental] If specified, Ray Train will launch the training
+            coordinator and workers on nodes with the specified topology. Topology is
+            auto-detected for TPUs and added as Ray node labels. This arg enables
+            SPMD execution of the training workload.


Is topology a first-class Ray Core concept? We'd want to make sure it's easy to understand from the API what inputs this takes in and how it'll be used.

Also for TPU users how familiar are topology/accelerator? Would it be easier for the user to just specify the pod type directly?

I don't see topology used in Ray core at all, except to configure TPU env vars and node labels - but any users of multi-host TPUs should be familiar with the concept. The concept is also already introduced in KubeRay through the numOfHosts field: https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/tpu.html.

I think topology and accelerator type are the best top-level variables for users to specify, since currently in GKE these are the two values users configure when creating their GKE nodepool and when scheduling pods to it using the cloud.google.com/gke-tpu-accelerator and cloud.google.com/gke-tpu-topology nodeSelectors: https://cloud.google.com/kubernetes-engine/docs/how-to/tpus.

Topology is quite standard TPU concept. TPU type / Pod Type is in some cases not uniquely mapped to a topology.

Awesome, is it safe to say that this API would then be super intuitive for a TPU user? Is there any other grouping/organization that might be more natural to how a user thinks about setting up their workload?

scaling_config=ScalingConfig( use_tpu=True, num_workers=4, topology="2x2x4", accelerator_type="TPU-V4", resources_per_worker={"TPU": 4}, placement_strategy="SPREAD", ),

Yeah I think this top level API should be clear for TPU users - the only thing I can think of is that we could have num_workers, resources_per_worker and placement_strategy be auto-set based on the topology if not provided. For example, if we have a multi-host topology of 4x4 v6e we could automatically detect that num_workers should be 4, resources_per_worker should be TPU: 4 since that's the number of chips on each host, and placement_strategy should be SPREAD.

python/ray/train/v2/jax/config.py

python/ray/train/v2/jax/jax_trainer.py

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

python/ray/_private/accelerators/tpu.py

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary · 2025-08-07T14:31:34Z

Validated the JaxTrainer manually using the following MaxText example:

Setup a GKE cluster with a "2x2x2" v4 TPU nodepool, Workload Identity enabled, and GCS Fuse enabled.
Enable the Ray operator addon to install KubeRay and the KubeRay TPU webhook.
Setup kubernetes service account and GS bucket for checkpoints, set the names in your env under KSA and GSBUCKET respectively.
Create a RayCluster with the following spec:

apiVersion: ray.io/v1
kind: RayCluster
metadata:
  name: ray-tpu-cluster
spec:
  rayVersion: "2.48.0"
  headGroupSpec:
    rayStartParams: {}
    template:
      metadata:
        annotations:
          gke-gcsfuse/volumes: "true"
          gke-gcsfuse/cpu-limit: "0"
          gke-gcsfuse/memory-limit: "0"
          gke-gcsfuse/ephemeral-storage-limit: "0"
      spec:
        serviceAccountName: $KSA_NAME
        containers:
          - name: ray-head
            image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/ray:ray-maxtext-image
            imagePullPolicy: Always
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265
              name: dashboard
            - containerPort: 10001
              name: client
            resources:
              limits:
                memory: "16Gi"
              requests:
                cpu: "8"
                memory: "16Gi"
            volumeMounts:
            - name: gcs-fuse-csi-ephemeral
              mountPath: /data
            - name: dshm
              mountPath: /dev/shm
        volumes:
        - name: gke-gcsfuse-cache
          emptyDir:
            medium: Memory
        - name: dshm
          emptyDir:
            medium: Memory
        - name: gcs-fuse-csi-ephemeral
          csi:
            driver: gcsfuse.csi.storage.gke.io
            volumeAttributes:
              bucketName: $GSBUCKET
              mountOptions: "implicit-dirs"
  workerGroupSpecs:
    - replicas: 1
      numOfHosts: 4
      groupName: tpu-group
      rayStartParams: {}
      template:
        metadata:
          annotations:
            gke-gcsfuse/volumes: "true"
            gke-gcsfuse/cpu-limit: "0"
            gke-gcsfuse/memory-limit: "0"
            gke-gcsfuse/ephemeral-storage-limit: "0"
        spec:
          serviceAccountName: $KSA_NAME
          containers:
            - name: ray-worker
              image: us-central1-docker.pkg.dev/ryanaoleary-gke-dev/ryanaoleary-ray/ray:ray-maxtext-image
              imagePullPolicy: Always
              resources:
                limits:
                  memory: 200G
                  google.com/tpu: "4"
                requests:
                  cpu: "8"
                  memory: 200G
                  google.com/tpu: "4"
              env:
                - name: JAX_PLATFORMS
                  value: tpu,cpu
                - name: ENABLE_PJRT_COMPATIBILITY
                  value: "true"
              volumeMounts:
              - name: gcs-fuse-csi-ephemeral
                mountPath: /data
              - name: dshm
                mountPath: /dev/shm
          volumes:
          - name: gke-gcsfuse-cache
            emptyDir:
              medium: Memory
          - name: dshm
            emptyDir:
              medium: Memory
          - name: gcs-fuse-csi-ephemeral
            csi:
              driver: gcsfuse.csi.storage.gke.io
              volumeAttributes:
                bucketName: $GSBUCKET
                mountOptions: "implicit-dirs"
          nodeSelector:
            cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
            cloud.google.com/gke-tpu-topology: 2x2x4

Apply the spec with envsubst to replace env var names with those you've set in your environment:

envsubst < ray-cluster-tpu-maxtext.yaml | kubectl apply -f -

Connect to the RayCluster:

kubectl port-forward svc/ray-tpu-cluster-head-svc 8265:8265 2>&1 >/dev/null &

Verified that TPU specific Ray node labels are set:

Create the following training script with MaxText and the new JaxTrainer:

import os
from absl import app
import logging
from typing import Sequence

import ray
from ray.train.v2.api.config import ScalingConfig, RunConfig
from ray.train.v2.jax import JaxTrainer
from MaxText.train import main as maxtext_main

def train_loop_per_worker(config):
    argv = config["argv"]
    maxtext_main(argv)

def main(argv: Sequence[str]):
    logging.basicConfig(level=logging.INFO)
    ray.init()

    trainer = JaxTrainer(
        train_loop_per_worker=train_loop_per_worker,
        train_loop_config={"argv": absolute_argv},
        scaling_config=ScalingConfig(
            use_tpu=True,
            num_workers=4,
            topology="2x2x4",
            accelerator_type="TPU-V4",
            resources_per_worker={"TPU": 4},
            placement_strategy="SPREAD",
        ),
        run_config=RunConfig(
            name="maxtext_jaxtrainer",
            worker_runtime_env={
                "env_vars": {
                    "JAX_PLATFORMS": "tpu",
                    "ENABLE_PJRT_COMPATIBILITY": "true",
                    "TPU_SLICE_BUILDER_DUMP_CHIP_FORCE": "true",
                    "TPU_SLICE_BUILDER_DUMP_ICI": "true",
                    "XLA_FLAGS": "--xla_dump_to=/tmp/xla_dump_file --xla_dump_hlo_as_proto",
                }
            },
        ),
    )

    result = trainer.fit()
    logging.info("Training complete!")

    ray.shutdown()

if __name__ == "__main__":
    app.run(main)

Submit the RayJob:

ray job submit \
  --address http://localhost:8265 \
  -- python maxtext_ray_trainer.py \
       MaxText/configs/base.yml \
       base_output_directory=gs://ray-maxtext-checkpoints/tmp/jax-trainer \
       dataset_type=synthetic \
       per_device_batch_size=2 \
       max_target_length=8192 \
       model_name=default \
       steps=100 \
       run_name=rayjob-with-maxtext

Training succeds:
jax-trainer-maxtext-logs.txt

python/ray/train/v2/_internal/callbacks/tpu_reservation_callback.py

python/ray/train/v2/jax/tpu_utils.py

matthewdeng · 2025-08-08T00:52:57Z

python/ray/train/v2/jax/tpu_utils.py

+    head_placement_group = ray.util.placement_group(
+        bundles=[{f"TPU-{pod_type}-head": 1}],
+        bundle_label_selector=[head_label_selector],
+    )


Since we are only using this for finding the head host, we can simplify the logic to schedule an Actor rather than a Placement Group.

Do we know when the Ray Actor will get cleaned up? My concern is that the Ray Actor will go out of scope and release the head resource before the full slice schedules, causing a race condition. Conversely, if this does work is it okay to create a bunch of Ray Actors that are never cleaned up, running on each TPU head? I guess I wasn't sure whether that case or placement groups had a lower overhead.

@MengjinYan do you know what is recommended from Core side?

I believe the lifecycle of Actors will be easier to track due to reference counting. We can create a local attribute in this Callback which will hold the head resource actor reference. If this Callback goes out of scope or if another job is launched, that reference would go away and get cleaned up.

Ray will be automatically cleaned up when all the handle of it goes out of scope. So as long as the actor handle exists, the actor will be cleaned up.

I might misunderstood this but from the code, looks like after we create the placement group for the TPU head, we won't schedule anything to it other than getting the slice id from the head and when the function returns, we also didn't return the placement group for further training task scheduling. Wondering is it the expected behavior? If that's the case, then I think if using an actor, the actor will actually be out-of-scope and be released.

python/ray/train/v2/jax/config.py

matthewdeng · 2025-08-08T00:58:01Z

python/ray/train/v2/jax/jax_trainer.py

+                            "JAX_PLATFORMS": "tpu",
+                            "ENABLE_PJRT_COMPATIBILITY": "true",
+                            "TPU_SLICE_BUILDER_DUMP_CHIP_FORCE": "true",
+                            "TPU_SLICE_BUILDER_DUMP_ICI": "true",
+                            "XLA_FLAGS": "--xla_dump_to=/tmp/xla_dump_file --xla_dump_hlo_as_proto",


For my understanding, are these always needed? Should they be set up automatically as part of the JaxBackend?

I believe the requirements differ across TPU generations, I think they're only currently used on TPU v6e. We could start passing the accelerator type into the JaxConfig and set the pjrt vars automatically for v6e? Or we could just leave it to the user to configure, my initial thought was to leave the JaxConfig as user specified as possible, and then add logic to autodetect the values for fields like num_workers, placement_strategy, reosurces_per_worker, and env vars to automatically set later based on the topology and accelerator type.

I see. We can keep it manual here for now but over time as we discover patterns/boiler plate code that we can abstract away from the user by including into the default behavior, we can add this directly into the JaxBackend logic to reduce the barrier of entry for users.

matthewdeng · 2025-08-08T00:59:20Z

python/ray/train/v2/api/config.py

+            per worker). Defaults to False. The number of TPUs reserved by each
+            worker can be overridden with the ``resources_per_worker``
+            argument. This arg enables SPMD execution of the training workload.
+        topology: [Experimental] If specified, Ray Train will launch the training


Is this specific to TPUs? Should it be tpu_topology instead?

I'm not super familiar with GPUs, but I think the field can probably be extended to set fields automatically in the Config (when left out) for GPUs too - so leaving it as topology might be fine. I don't have much of a preference either way though.

python/ray/train/v2/api/config.py

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

matthewdeng

Very nice!

matthewdeng · 2025-08-13T03:52:46Z

python/ray/train/v2/_internal/callbacks/tpu_reservation_callback.py

+        if scaling_config.use_tpu and (
+            num_workers > 1 or scaling_config.num_workers > 1
+        ):


Going back on what I said before (oops), we can just check num_workers here.

Suggested change

if scaling_config.use_tpu and (

num_workers > 1 or scaling_config.num_workers > 1

):

if scaling_config.use_tpu and num_workers > 1 :

matthewdeng · 2025-08-13T03:57:55Z

python/ray/train/v2/api/config.py

+    from ray.tune.search.sample import Domain

+SampleRange = Union["Domain", Dict[str, List]]


We can remove this.

Suggested change

from ray.tune.search.sample import Domain

SampleRange = Union["Domain", Dict[str, List]]

python/ray/_private/accelerators/tpu.py

python/ray/train/v2/jax/jax_trainer.py

python/ray/train/v2/tests/test_jax_trainer.py

python/requirements/ml/jax-tpu-requirements.txt

python/ray/train/v2/BUILD

python/ray/train/v2/api/config.py

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

edoakes · 2025-08-13T14:58:49Z

@ryanaoleary don't forget to add go label for CI

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

commit a86bb60df41987bfee65b227fcce69a7eee44b9e Author: Justin Yu <justinvyu@anyscale.com> Date: Tue Aug 19 08:58:10 2025 -0700 [core] Fix actor import error message for async actors (#55722) When the Ray actor class fails to import upon actor creation, we create a TemporaryActor in its place to emit an error message. However, for async actors, the TemporaryActor creation fails to initialize due having no async methods. This PR adds a dummy async method to handle this case. ```python Traceback (most recent call last): File "<string>", line 1, in <module> File "<string>", line 35, in <module> File "/Users/justin/Developer/ray/python/ray/_private/auto_init_hook.py", line 22, in auto_init_wrapper return fn(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^ File "/Users/justin/Developer/ray/python/ray/_private/client_mode_hook.py", line 104, in wrapper return func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "/Users/justin/Developer/ray/python/ray/_private/worker.py", line 2896, in get values, debugger_breakpoint = worker.get_objects(object_refs, timeout=timeout) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/justin/Developer/ray/python/ray/_private/worker.py", line 970, in get_objects raise value ray.exceptions.ActorDiedError: The actor died because of an error raised in its creation task, ray::Foo.__init__() (pid=42078, ip=127.0.0.1, actor_id=7000b00899a3a8b1d05bbdc601000000, repr=<__main__.FunctionActorManager._create_fake_actor_class.<locals>.TemporaryActor object at 0x10732dc10>) ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task. class_name: TemporaryActor actor_id: 7000b00899a3a8b1d05bbdc601000000 Failed to create actor. You set the async flag, but the actor does not have any coroutine functions. (TemporaryActor pid=42078) The original cause of the RayTaskError (<class 'ray.exceptions.ActorDiedError'>) isn't serializable: cannot pickle 'google._upb._message.Descriptor' object. Overwriting the cause to a RayError. ``` --------- Signed-off-by: Justin Yu <justinvyu@anyscale.com> commit f53e38b119ab19c27db6f32d76710a3dd8c6e9c1 Author: tannerdwood <71387269+tannerdwood@users.noreply.github.com> Date: Tue Aug 19 08:44:44 2025 -0700 [Core] Update DLAMI Information in aws.md (#55702) Signed-off-by: Tanner Wood <tanwood@amazon.com> Co-authored-by: Tanner Wood <tanwood@amazon.com> commit c4482d2fc6d7956104c5b0208a7cc14120737652 Author: Ibrahim Rabbani <irabbani@anyscale.com> Date: Tue Aug 19 07:47:57 2025 -0700 [core] Remove job submission code for using JobAgent on a random worker node. (#55718) When a Job is submitted through the SDK/JobClient, the request goes to the dashboard's JobHead. The JobHead submits a request to a JobAgent which has a JobManager. The JobManager creates a JobSupervisor actor which manages the lifecycle of the job. In #47147, the `RAY_JOB_AGENT_USE_HEAD_NODE_ONLY` feature flag to force head node's JobAgent to be used for job submission. The flag was intended to be a temporary kill switch if head_node only scheduling had issues. Now that #47147 has been merged for over a year, I'm cleaning up the flag in this PR and making it the default (and only behavior). --------- Signed-off-by: irabbani <irabbani@anyscale.com> commit f797480b014262ffdf7b33a431fcbc34c0d95b2f Author: Dhyey Shah <dhyey2019@gmail.com> Date: Tue Aug 19 00:10:43 2025 -0700 [core] Correct bytes in flight when objects <5mb (#54349) Signed-off-by: dayshah <dhyey2019@gmail.com> commit be33b6fb411b21d2bb2cadfc8755a66c195d2272 Author: avigyabb <98926738+avigyabb@users.noreply.github.com> Date: Mon Aug 18 21:41:43 2025 -0700 [Core] Bind runtime env agent and dashboard agent http server to specified ip instead of 0.0.0.0 (#55431) Signed-off-by: avigyabb <avigyabb@stanford.edu> Signed-off-by: avibasnet31 <avigyabb@anyscale.com> Co-authored-by: avibasnet31 <avigyabb@anyscale.com> Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> commit 69bc6c1e8394ef5846bf3d3d36a7fd384441c5a1 Author: Ibrahim Rabbani <irabbani@anyscale.com> Date: Mon Aug 18 21:38:58 2025 -0700 [core] ray.put returns an ObjectRef without an owner_address. (#55636) Signed-off-by: irabbani <irabbani@anyscale.com> Signed-off-by: Ibrahim Rabbani <israbbani@gmail.com> Co-authored-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit 28d1dc9fbdc57b3c33dcc244924e520fa158104b Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Mon Aug 18 21:36:12 2025 -0700 [Serve.llm] Support colocating local DP ranks in DPRankAssigner (#55720) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> commit 30c8122962dcb1285fd4324313770a53693ce863 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Mon Aug 18 16:46:06 2025 -0700 [image] refactor apt package installation (#55701) avoid reinstalling packages that are already installed in the base image also rename the saved requirements file to `extra-test-requirements.txt` Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 6993ba79da529a44fb23b1717acac3d83aa5dcef Author: Jeffrey Wang <jeffrey31415926@gmail.com> Date: Mon Aug 18 16:19:27 2025 -0700 [data.llm] Adjust LLM engine timing logic (#55595) Signed-off-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> Co-authored-by: jeffreyjeffreywang <jeffjeffreywang@gmail.com> commit 7424ffbdc7b5df15141c66c23c1adafa36cd431b Author: vincenthhan <46981434+BestVIncent@users.noreply.github.com> Date: Tue Aug 19 07:18:28 2025 +0800 [llm] support custom s3 endpoint when downloading models from remote (#55458) Signed-off-by: vincenthhan <vincenthhan@tencent.com> Co-authored-by: vincenthhan <vincenthhan@tencent.com> commit e9160b72338c4d682af2eb0249f442bd1ff4992d Author: Qiaolin Yu <liin1211@outlook.com> Date: Mon Aug 18 15:39:46 2025 -0700 [core] Not overriding accelerator id env vars when num_accelerators is 0 or not set (#54928) commit fd3f23593de38fec41c8321da7c169b08eb768cc Author: Edward Oakes <ed.nmi.oakes@gmail.com> Date: Mon Aug 18 17:32:25 2025 -0500 [core] Remove unnecessary dependency of raylet->gcs (#55710) The raylet binary was depending on all of the `gcs/` directory for absolutely no reason :( --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit 3ea021227eaeb0404c42cf09015bc685eb097cfb Author: Edward Oakes <ed.nmi.oakes@gmail.com> Date: Mon Aug 18 17:28:48 2025 -0500 [core] Separate targets for pubsub interfaces (#55681) Move publisher & subscriber interfaces into their own header files & build targets. Update relevant callsites to use them. Unfortunately, `reference_count_test` reaches into internal implementation details of the publisher and this dependency was a little tricky to break, so not touching it here. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit 1cb4c2c212e5a153e74d86f1e0d2e48942a19502 Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Date: Mon Aug 18 15:12:37 2025 -0700 [core] rename ray/telemetry to ray/observability (#55703) As title. According to @edoakes, ray telemetry has a different meaning in the ray eco-system. Observability directory will consists for metrics, events and log related infra. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com> commit 9128c40da7a8166bb7a9ca7025b01d8a7a5e38db Author: Sven Mika <svenmika1977@gmail.com> Date: Mon Aug 18 22:40:55 2025 +0200 [RLlib] Fix MetricsLogger/Stats throughput bugs. (#55696) commit 01b9e5b1a6b913041b299d7cd262254cfc99503a Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Mon Aug 18 12:33:07 2025 -0700 [ci] release test: use rayci build id for image tags (#55619) rather than using commit based tags. this avoids runs across different runs on the same commit to crosstalk to each other. --------- Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 6326b2c539d4337019dc5107b569e272fc8a8fcf Author: Sagar Sumit <sagarsumit09@gmail.com> Date: Tue Aug 19 00:34:29 2025 +0530 [core] Call `__ray_shutdown__` method during actor graceful shutdown (#54584) This PR introduces a new `__ray_shutdown__ ` method mechanism for Ray actors to perform deterministic resource cleanup before actor termination. This addresses issue #53169 by providing a reliable alternative to `__del__` methods for critical cleanup operations. The new `__ray_shutdown__ ` method can be explicitly overriden and provides: - Deterministic execution: Called explicitly by Ray during actor shutdown. - Reliable timing: Executes at the exact right moment before process termination. - Optionality: Actors without the method continue to work normally. Main changes: 1. `core_worker.cc` - Add cleanup call in Shutdown() 2. `_raylet.pyx` - Add callback registration 3. `worker.py` - Register callback when actor is created Closes #53169 --------- Signed-off-by: Sagar Sumit <sagarsumit09@gmail.com> commit ec4056ea67e4226fea2f11abaf4e16bf5a3aba14 Author: Edward Oakes <ed.nmi.oakes@gmail.com> Date: Mon Aug 18 13:53:47 2025 -0500 [ci] Add ability for users to include `.user.bazelrc` file (#55698) I wanted a way to turn on `--incompatible_strict_action_env` by default without having an untracked change in my `.bazelrc` constantly and without needing to pass the `--config` flag all the time. This PR allows users to define a `.user.bazelrc` file for such changes. For example, to turn on `--incompatible_strict_action_env` by default, I've added this file: ``` build --config=strict test --config=strict ``` Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit 5afa2abcb65980e2ab558076b39e9a44bd2e3566 Author: Potato <tanxinyu@apache.org> Date: Tue Aug 19 02:46:17 2025 +0800 [Data]Fix sort_benchmark url not found error (#55692) The url is invalid as we changed the name for `sort.py` in https://github.com/ray-project/ray/pull/49017 --------- Signed-off-by: Potato <tanxinyu@apache.org> commit 81856dfad0ab26dffc5d9209ae297f8acd16ce9a Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Mon Aug 18 11:28:28 2025 -0700 [wheel] when `RAY_DISABLE_EXTRA_CPP=1`, do not build cpp stuff (#55697) this gives us a way to safely skip the ray-cpp building parts when building ray wheel. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit b95fc3e0757a89dea38f243c9a29f3768f82b98f Author: Sampan S Nayak <sampansnayak2@gmail.com> Date: Mon Aug 18 23:47:39 2025 +0530 [core] Add logic to convert TaskProfileEvent to RayEvent before sending to event aggregator (#55138) As part of oneEvent effort, all individual task event objects (such as task definition event, task execution event, etc) are being consolidated under one type: RayEvent. This pr adds the translation logic to convert the `TaskProfileEvent` ->` rpc::events::RayEvent object` + tests to verify that the translation and subsequent section of the `TaskEventBufferImpl` correctly deal with the constructed RayEvent. Signed-off-by: sampan <sampan@anyscale.com> Signed-off-by: Sampan S Nayak <sampansnayak2@gmail.com> Co-authored-by: sampan <sampan@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: Mengjin Yan <mengjinyan3@gmail.com> commit 6c90c0de34f5b3f618db076c4f3197f78aefc8bf Author: yi wang <48236141+my-vegetable-has-exploded@users.noreply.github.com> Date: Tue Aug 19 02:00:27 2025 +0800 [Data] explain API for dataset (#55482)   Introduce explain() for dataset, which output logical plan and physical plan.  part of #55052  - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> commit a81c7c9fbd5ed70e8aaae5cc2f1bc3e284ec8723 Author: Timothy Seah <timothy.seah777@yahoo.com> Date: Mon Aug 18 10:38:44 2025 -0700 [train][tune] Train Controller is always actor + fix tune integration to enable this (#55556) In the past, we used `RUN_CONTROLLER_AS_ACTOR_ENV_VAR` to toggle whether to run the controller as a separate actor (we want this in most cases) or on the current actor (we wanted this in Tune so we can propagate `ray.train.report` from Train to Tune using the `TuneReportCallback`). However, in order to implement `get_all_reported_checkpoints` (https://github.com/ray-project/ray/pull/54555), we need to pass the Train Controller actor to all the Train Worker actors. This method wouldn't work when using Train from Tune because the Train Controller actor handle would be the Tune Trainable actor handle which does not have the async `get_all_reported_checkpoints` method. This PR gets rid of `RUN_CONTROLLER_AS_ACTOR_ENV_VAR` once and for all by making all communication between Train and Tune happen through a lightweight `ray.util.Queue` actor instead of forcing Train and Tune to happen on the same process. --------- Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: Timothy Seah <tseah@anyscale.com> commit 796858a91c9a98b7785fdf012096b4a3e5f22cca Author: simonsays1980 <simon.zehnder@gmail.com> Date: Mon Aug 18 19:10:31 2025 +0200 [RLlib] Set default to 'log_gradients=False' to stabilize tests (#55695)   Right now `log_gradients` is by default `True` and this appears to destabilize tests (see #47717). This PR switches the default to `False`. Closes #47717 - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: simonsays1980 <simon.zehnder@gmail.com> commit ae0d4fc04f7d56e77c080a24bf998a67a3e88631 Author: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com> Date: Mon Aug 18 09:58:37 2025 -0700 [Serve] Update test_deploy_2.py with get_application_url (#55665) We remove the hardcoded url within the test to use `get_application_url()` --------- Signed-off-by: doyoung <doyoung@anyscale.com> commit be423b042d0370456a8abead58fd6502eeb6c6d4 Author: Elliot Barnwell <elliot.barnwell@anyscale.com> Date: Mon Aug 18 09:37:47 2025 -0700 [ci] allowing spaces in append args field on depsets (3/4) (#55625) - Allowing for spaces in append args (splitting append arg flags before appending) - adding a couple unit tests --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Co-authored-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> commit faf06e09e55558fb36c72e91a5cf8a7e3da8b8c6 Author: Dhyey Shah <dhyey2019@gmail.com> Date: Mon Aug 18 07:33:47 2025 -0700 [core] Follow-up to address comments of BaseException PR #55602 (#55690) Address comments from #55602 - Moving the base exception and exception group tests into their own file so they can use a shared fixture - Adding comment for SystemExit and KeyboardInterrupt behavior - Adding tests to test behavior if user code raises SystemExit or KeyboardInterrupt --------- Signed-off-by: dayshah <dhyey2019@gmail.com> commit e0d8e6f46a8734e16f28831941d937d5961c1d12 Author: simonsays1980 <simon.zehnder@gmail.com> Date: Mon Aug 18 15:26:58 2025 +0200 [RLlib] - Fix `TensorType` (#55694) commit 1e5094fd5cbfef1de738243b84436b94a7499304 Author: simonsays1980 <simon.zehnder@gmail.com> Date: Mon Aug 18 15:13:05 2025 +0200 [RLlib - Offline RL] Fix bug in `return_iterator` in multi-learner settings. (#55693) commit b830b8d3ee64f7c661d4bfa5fb0e7be99ff871a5 Author: simonsays1980 <simon.zehnder@gmail.com> Date: Mon Aug 18 12:30:24 2025 +0200 [RLlib - Offline] Fix some bugs in the docs for IQL and CQL (#55614) commit dde4dbad440ada233d5b3e13a990cf25c20ec60e Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Sun Aug 17 21:33:48 2025 -0700 [Serve.llm] Fix DPServer allocation to CPU node (#55688) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> commit 7321aeed2957a5a71ccb34c2212cd8f4c63a9fab Author: Edward Oakes <ed.nmi.oakes@gmail.com> Date: Sun Aug 17 18:34:27 2025 -0500 [core] Remove unnecessary publisher dependency from raylet (#55678) Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit 6561061f79b31be4f7cecb20e34bdc92e374ef16 Author: Jason Li <57246540+JasonLi1909@users.noreply.github.com> Date: Sun Aug 17 11:56:41 2025 -0700 Fixing Circular Import in ray.train.v2.lightning.lightning_utils (#55668) Importing `RayTrainReportCallback` from `ray.train.lightning._lightning_utils` in `ray.train.v2.lightning.lightning_utils` causes a circular import in the case that `ray.train.v2.lightning.lightning_utils` is loaded before `ray.train.lightning`. This PR removes the `ray.train.v2.lightning` module and migrates the changes upstream to the original `RayTrainReportCallback` class. --------- Signed-off-by: JasonLi1909 <jasli1909@gmail.com> commit 03b07db82ab52c5886edd94885fa12d7c30b7b39 Author: Dhyey Shah <dhyey2019@gmail.com> Date: Sat Aug 16 19:41:24 2025 -0700 [core] Fix test_failure on windows (#55687) Mixing ray_start_regular and ray_start_regular_shared in the same file can lead to unexpected behavior where cluster state can unexpectedly carry over into setup for another test. Here on windows *test_put_error1, test_put_error2,* and *test_version_mismatch are* skipped so *test_export_large_objects* runs directly after *test_baseexception_actor_creation* causing issues during its setup. In a follow up will just create another test file for all basexception related tests so they can use a shared cluster. Signed-off-by: dayshah <dhyey2019@gmail.com> commit 9ae08276c6c466557281dca28477e9ad1d374687 Author: Dhyey Shah <dhyey2019@gmail.com> Date: Sat Aug 16 11:16:44 2025 -0700 [core] Update base exception group tests (#55684) Signed-off-by: dayshah <dhyey2019@gmail.com> commit a44df1655f3031860f3afd4cc81fc0dc6ab5d6f0 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Fri Aug 15 23:38:03 2025 -0700 [ci] release test: fix to use small for test init (#55677) otherwise the permission is incorrect Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 418f56258e2085a3f370696930a04ae83e7e0103 Author: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> Date: Sat Aug 16 03:19:20 2025 +0200 [serve.llm] Add reset_prefix_cache remote method to llm server (#55658) Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> commit 628df247832fa0e51274a6d53ae750eb9b54a794 Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Fri Aug 15 17:12:20 2025 -0700 [serve.llm] Handle push telemetry race conditions (#55558) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> commit 4c6993ee347e3a4d1ff9a26fb3daddd9bf50783c Author: Balaji Veeramani <balaji@anyscale.com> Date: Fri Aug 15 16:51:13 2025 -0700 [Data] Decouple actor and node autoscaling (#55673)   Actor pool autoscaling and node autoscaling are currently tied together in a single `Autoscaler` base class, even though they work mostly independently. This coupling makes testing harder (you have to mock unused dependencies), complicates the interface, and forces you to touch unrelated code when extending one type of autoscaling. This PR splits `Autoscaler` into `ActorAutoscaler` and `ClusterAutoscaler` to simplify testing, reduce complexity, and make future extensions easier.  - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: Balaji Veeramani <bveeramani@berkeley.edu> commit 9fdea0314ef90cedc341285398bb51d79475b6fd Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Fri Aug 15 15:16:27 2025 -0700 [Serve.llm] Support multi-node data parallel with set_dp_master_info() (#55653) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> Signed-off-by: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Co-authored-by: kourosh hakhamaneshi <31483498+kouroshHakha@users.noreply.github.com> commit 5fbeff61f889af7eddb7ca7b55ec6a6c8939bc2b Author: Edward Oakes <ed.nmi.oakes@gmail.com> Date: Fri Aug 15 16:48:53 2025 -0500 [core] Unify test directory layout on `.../tests/` (#55652) We currently have multiple different patterns for test files: - `*_test.cc` in the same file as the implementation. - `test/*_test.cc` (with `BUILD.bazel` in the test dir or sometimes in the parent dir). - `tests/*_test.cc` (with `BUILD.bazel` in the test dir or sometimes in the parent dir). Unifying on: - `tests/*_test.cc` - `tests/BUILD.bazel` for test targets --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit fc967a55018cf85d7f73381985273f429d14cb81 Author: Jiajun Yao <jeromeyjj@gmail.com> Date: Fri Aug 15 13:30:44 2025 -0700 [Core] Simplify get_event_aggregator_grpc_stub to not depend on webui_url (#55640) Signed-off-by: Jiajun Yao <jeromeyjj@gmail.com> commit d6dce722f0ff25a55a3b3a4749bd32821bcccbec Author: Edward Oakes <ed.nmi.oakes@gmail.com> Date: Fri Aug 15 14:54:28 2025 -0500 [serve] Fix easy `ray._private` dependency (#55659) Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit 10af9d897bbdaae4202580ba14dea1d6efcb525b Author: Elliot Barnwell <elliot.barnwell@anyscale.com> Date: Fri Aug 15 12:37:21 2025 -0700 [ci] raydepsets: generating llm lock files (4/4) (#55500) - generating llm lock files with raydepsets --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> commit b819ed4add79492dcdc58d7df277bbd1d438f11b Author: Dhyey Shah <dhyey2019@gmail.com> Date: Fri Aug 15 12:07:22 2025 -0700 [core] Fix objects_valid with except from BaseException (#55602) We would encounter a ray check failure on `objects_valid` whenever we get a function throws an exception that extends from `BaseException` instead of `Exception`. Fixing that by just excepting `BaseException` instead of `Exception` when we are vulnerable to exceptions thrown from user Python code. We still have to special case `SystemExit` and `KeyboardInterrupt` because we can consider those as critical errors ourselves and treat them as worker shutdown or task cancellation signals respectively. Closes https://github.com/ray-project/ray/issues/43411 Signed-off-by: dayshah <dhyey2019@gmail.com> commit 44e0aea628f1f221345aeddaafce3b82d91cf9fa Author: simonsays1980 <simon.zehnder@gmail.com> Date: Fri Aug 15 20:44:34 2025 +0200 [RLlib] Fix `ImportError` in Atari examples. (#54967)   Running Atari with RLlib results in error described in #53836 . This related to the version of `gymnasium` installed when calling `ray[rllib]` and then later installing `gymnasium[atari,accept-rom-license]`. Using `gymnasium=1.1.1` resolves this error. Closes #53836 - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: simonsays1980 <simon.zehnder@gmail.com> commit 2cdb27e49d3c4935fe90236f9affa15b5696a42f Author: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com> Date: Fri Aug 15 11:07:19 2025 -0700 [Serve] Update route prefix assignment for ReplicaBase.reconfigure() (#55657) Update assigning value that was slipped from #55407 --------- Signed-off-by: doyoung <doyoung@anyscale.com> commit 616b9a19b42305ba5602e4f3bcab81c1e19cf3a0 Author: Edward Oakes <ed.nmi.oakes@gmail.com> Date: Fri Aug 15 13:05:36 2025 -0500 [core] Clean up `RayletIpcClientInterface` (#55651) Splits out `raylet_ipc_client_interface.h` into its own target. Sub-interfaces that use the client should only depend on this interface, not the full `raylet_ipc_client` target. This improves incremental builds. For example, now if `raylet_ipc_client.{h,cc}` changes (including any of its transitive dependencies), the core worker `store_provider` targets no longer need to be recompiled. They'll only be recompiled if `raylet_ipc_client_interface.h` changes, which should be much less frequent. I've also moved the `FakeRayletIpcClient` into the source tree. --------- Signed-off-by: Edward Oakes <ed.nmi.oakes@gmail.com> commit 8d6d9fa4c63e7d1e7ecd7f14347c1a565efe4d95 Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Date: Fri Aug 15 10:56:53 2025 -0700 [serve.llm] Correct Pyright lints for Ray Serve LLM examples (#55284) Signed-off-by: Seiji Eicher <seiji@anyscale.com> commit 0b77c72a0133d407fb58a9114764e652a37e963c Author: Justin Yu <justinvyu@anyscale.com> Date: Fri Aug 15 10:48:54 2025 -0700 [data] Wrap batch index in a `BatchMetadata` class (#55643) Wrap batch metadata in a dataclass that we can extend in the future. Signed-off-by: Justin Yu <justinvyu@anyscale.com> commit a39bc679bace4dfaa334c88572effbc5b952a59f Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Fri Aug 15 10:14:33 2025 -0700 [serve] pin the version of wrk used in serve ci base (#55650) and clone with depth=1 Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 20c84e6193d22d29f25cc36e76ea455417349562 Author: akyang-anyscale <alexyang@anyscale.com> Date: Fri Aug 15 09:56:09 2025 -0700 [serve] Add model composition serve benchmarks (#55549) Model composition is a common paradigm we should also track performance for. --------- Signed-off-by: akyang-anyscale <alexyang@anyscale.com> commit c5a16768c71c354738fc4bef552bd4a58c6b3089 Author: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com> Date: Fri Aug 15 09:43:09 2025 -0700 [Serve] Update test_http_routes to use get_application_url (#55623) Updates one of the serve tests, test_http_routes, so it can start using get_application_url instead of hardcoded urls. --------- Signed-off-by: doyoung <doyoung@anyscale.com> Signed-off-by: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com> commit c7a7d41b4bbd7509b0cb7cc112fd5ac9af5e55af Author: Aleksei Starikov <aleksei.starikov.ax@gmail.com> Date: Fri Aug 15 18:42:41 2025 +0200 [serve] Add a function with a Warning to migrate constants that use `or` expression. (#55464) In the `serve` package some of the constants which are initialized from environment variables are silently replaced empty values as `0` with their default values even if a user set them to `0` explicitly. In addition, they are also can be set to negative values which is likely not expected. The list of the constants: ``` PROXY_HEALTH_CHECK_TIMEOUT_S PROXY_HEALTH_CHECK_PERIOD_S PROXY_READY_CHECK_TIMEOUT_S PROXY_MIN_DRAINING_PERIOD_S -- RAY_SERVE_KV_TIMEOUT_S ``` It happens because of the `or value` structure. This PR introduces: - temporary function `get_env_float_non_zero_with_warning` with `FutureWarning`. The function is showing a warning in the following format in case of unexpected value: ``` FutureWarning: Got unexpected value `0.0` for `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` environment variable! Starting from version `2.50.0`, the environment variable will require a positive value. Setting `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` to `10.0`. PROXY_HEALTH_CHECK_TIMEOUT_S = get_env_float_non_zero_with_warning( -- or FutureWarning: Got unexpected value `-1.0` for `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` environment variable! Starting from version `2.50.0`, the environment variable will require a positive value. Setting `RAY_SERVE_PROXY_HEALTH_CHECK_TIMEOUT_S` to `-1.0`. PROXY_HEALTH_CHECK_TIMEOUT_S = get_env_float_non_zero_with_warning( -- or FutureWarning: Got unexpected value `0.0` for `RAY_SERVE_KV_TIMEOUT_S` environment variable! Starting from version `2.50.0`, the environment variable will require a positive value. Setting `RAY_SERVE_KV_TIMEOUT_S` to `None`. RAY_SERVE_KV_TIMEOUT_S = get_env_float_non_zero_with_warning( ``` If the input value is positive, no warning will be emit. - `None` default value support for env variables (introduced for the `RAY_SERVE_KV_TIMEOUT_S`) - `todo` comment for removing the function: `todo: replace this function with 'get_env_float_positive' for the '2.50.0' release.`  Closes #55454 - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: axreldable <aleksei.starikov.ax@gmail.com> commit de1494e57497b6c57037edf83044ee507fb80159 Author: akyang-anyscale <alexyang@anyscale.com> Date: Fri Aug 15 09:34:30 2025 -0700 [serve] Refactor the router and handle (#55635) Refactor Serve deployment handle and router. --------- Signed-off-by: akyang-anyscale <alexyang@anyscale.com> commit d95ef0c74138e5a529b5f4b0134177d5aa9bdee0 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Fri Aug 15 00:00:49 2025 -0700 [ci] release test: use rayci to perform test init (#55629) so that rayci buildid can be populated Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit fe54c9554106b1e4b89c52833b6251143b0092e5 Author: Qiaolin Yu <liin1211@outlook.com> Date: Thu Aug 14 22:02:24 2025 -0700 [ci] Add hook to clean the Ray address file before the test run starts (#54715) Co-authored-by: Jiajun Yao <jeromeyjj@gmail.com> commit 486935db5ede79b419623f29e2593c76a0df57c9 Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Date: Thu Aug 14 19:25:41 2025 -0700 [core] add test rules for container tests (#55622) The `core: container` test is pretty flaky on premerge and block PRs from time to time. This PR add a test rule to only run this test on a change that touches `python/ray/runtime_env`. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com> commit c62889c8d2c72e4e3466f31995c43d2f0189b10e Author: goutamvenkat-anyscale <goutam@anyscale.com> Date: Thu Aug 14 18:53:49 2025 -0700 [Train] - Bump up test size for test_data_integration (#55633) Signed-off-by: Goutam V <goutam@anyscale.com> commit c7c7e7c8fb99bd1081fe4949ccdff2614e6ce8ca Author: Elliot Barnwell <elliot.barnwell@anyscale.com> Date: Thu Aug 14 17:45:05 2025 -0700 [ci] upgrading uv binary and updating test (2/4) (#55626) - upgrading uv from 0.7.20 -> 0.8.10 to gain parity with uv used compile llm lock files job - updating unit test Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> commit 69f421884419c8c39a363eeb6b459bd77b6f0017 Author: Doyoung Kim <34902420+landscapepainter@users.noreply.github.com> Date: Thu Aug 14 17:35:01 2025 -0700 [Serve] Add route_prefix field to DeploymentVersion (#55407) This PR adds `route_prefix` to `DeploymentVersion` class to allow robust light weight config update with `route_prefix`. --------- Signed-off-by: doyoung <doyoung@anyscale.com> commit f8ee5c9629f99c88af1e919a8ba2191a0c07f607 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Thu Aug 14 16:44:58 2025 -0700 [ci] pipe through `RAYCI_DISABLE_JAVA` for manylinux base image building (#55606) so that when we do not need java, we can skip installing JDK in the image. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 078d055ad2520b433db28ddc5e48a45bdc0d64a2 Author: Elliot Barnwell <elliot.barnwell@anyscale.com> Date: Thu Aug 14 16:44:08 2025 -0700 [ci] raydepsets changing load to build (1/4) (#55627) updating cli command from load to build Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> commit 21bc4528339420623c2f2a1958c7fb68b5dd8a8c Author: Dhyey Shah <dhyey2019@gmail.com> Date: Thu Aug 14 14:42:57 2025 -0700 [core] Fix ubsan for publisher_test (#55621) Signed-off-by: dayshah <dhyey2019@gmail.com> commit 1c55991ce455632e1ab9839cb4c25f3e4ddc379c Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Date: Thu Aug 14 14:10:44 2025 -0700 [core][otel] change+simplify the feature flag for open telemetry (#55592) Change and simplify the feature flag to enable open telemetry. This will enable us to enable open telemetry for the next Ray release version, without worrying about messing up previous Ray release versions. Test: - CI Signed-off-by: Cuong Nguyen <can@anyscale.com> commit fc4ace25a81cf68b71e21c00f1be2532d5c6c148 Author: Kevin H. Luu <kevin@anyscale.com> Date: Thu Aug 14 13:59:45 2025 -0700 [release] Script to build custom BYOD image (#55577) Add `custom_byod_build` as a python binary that the Buildkite jobs can call to build & push custom BYOD images --------- Signed-off-by: kevin <kevin@anyscale.com> commit 61bc2e8139e21429d487b0824391c26dcd596cc3 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Thu Aug 14 12:56:37 2025 -0700 [ci] read gce credentials file from global config when building anyscale images (#55580) rather than using the hard-coded filename Signed-off-by: Lonnie Liu <lonnie@anyscale.com> Signed-off-by: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> commit 49d336cb332da4cdfff894e95ea6f0189f1b05ff Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Date: Thu Aug 14 11:53:36 2025 -0700 [Serve.llm] Improve PrefixCacheAffinityRouter text normalization compat (#55588) Signed-off-by: Seiji Eicher <seiji@anyscale.com> commit 37158a22a44edb10d499b53d1f38f00315234a14 Author: harshit-anyscale <harshit@anyscale.com> Date: Fri Aug 15 00:21:29 2025 +0530 skip test task processor for windows (#55616) - skipping test task processor for windows to unblock Signed-off-by: harshit <harshit@anyscale.com> commit 400ea7716c50afe006ab69a5398fa5d3c2e08373 Author: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Date: Thu Aug 14 11:46:59 2025 -0700 [serve.llm][docs] Documentation for prefix cache-aware router (#55218) Signed-off-by: Seiji Eicher <seiji@anyscale.com> Signed-off-by: Seiji Eicher <58963096+eicherseiji@users.noreply.github.com> Co-authored-by: angelinalg <122562471+angelinalg@users.noreply.github.com> commit 6d7234b1b54ebc8d77ed9a127ce02b9ff4f9854c Author: coqian <cong.qian@anyscale.com> Date: Thu Aug 14 11:06:05 2025 -0700 [Data] Update the export API to refresh the dataset and operator states (#55355)   This PR is a revert of [#55333](https://github.com/ray-project/ray/pull/55333) and resolves conflict by [#55163](https://github.com/ray-project/ray/pull/55163) Original description: Some frequently used metadata fields are missing in the export API schema: - For both dataset and operator: state, execution start and end time These fields are important for us to observe the lifecycle of the datasets and operators, and can be used to improve the accuracy of reported metrics, such as throughput, which relies on the duration.  Summary of change: - Add state, execution start and end time at the export API schema - Add a new state enum `PENDING` for dataset and operator, to represent the state when they are not running yet. - Refresh the metadata when ever the state of dataset/operator gets updated. And the event will always contains the latest snapshot of all the metadata.  - [X] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [X] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [X] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Signed-off-by: cong.qian <cong.qian@anyscale.com> commit 6a9938a73ff6d39ee72dcb68667a52b0ba658e8b Author: Mengjin Yan <mengjinyan3@gmail.com> Date: Thu Aug 14 11:05:39 2025 -0700 [Core] Add Logic to Check Label Selector in PG Scheduling (#55599) Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> commit c4d990cafe01ce4f6caec38e814217310fcc0a1c Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Thu Aug 14 11:02:48 2025 -0700 [ci] add rayci build id tags for release test images (#55605) in addition to current tags. first step to migrate to use rayci build id tags to stop release test jobs from cross-talking to each other Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit af41960a49e85863709ef36fb4968f0021d730b3 Author: Stephanie Wang <smwang@cs.washington.edu> Date: Thu Aug 14 10:02:16 2025 -0700 [core][gpu-object] Add a user-facing call to wait for tensor to be freed (#55076) This adds a call `ray.experimental.wait_tensor_freed` that allows user code to check when a tensor that it put into Ray's GPU object store has been freed. Unlike the normal Ray object store, the GPU object store is just a Python data structure on the actor, which allows us to avoid copying. This means that the actor can keep a reference to an object in its store. The API call allows the actor to check when the object has been freed from the store, so that it can safely write to the tensor again. Closes #52341. --------- Signed-off-by: Stephanie wang <smwang@cs.washington.edu> Signed-off-by: Stephanie Wang <smwang@cs.washington.edu> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> commit f0b0aadd65b3a842ed42ef870ac3067ea42f30af Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Thu Aug 14 10:01:39 2025 -0700 [image] add base-extra for aarch64 images (#55586) for easier use on ray cluster hosters like anyscale. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit ea27578265182b3b721b0b6b5a9f2d6a49e6e61b Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Thu Aug 14 10:01:25 2025 -0700 [ci] remove unused `use_base_extra` (#55604) added incorrectly in a past change Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 7518fd8be262c5f1bdc8246e0a3c5cc7db5d1bd6 Author: Jun-Hao Wan <ken89@kimo.com> Date: Fri Aug 15 00:09:47 2025 +0800 [Doc][KubeRay] Add InteractiveMode description for `ray-job-quick-start.md` (#55570) Signed-off-by: win5923 <ken89@kimo.com> Signed-off-by: Kai-Hsun Chen <kaihsun@apache.org> Co-authored-by: Kai-Hsun Chen <kaihsun@apache.org> commit 6afaeda7dc7eb700076ae98b5b356568a293cde2 Author: simonsays1980 <simon.zehnder@gmail.com> Date: Thu Aug 14 17:08:01 2025 +0200 [RLlib] Add docs for Implicit Q-Learning. (#55422) commit 4b6dba34d50d647a7929b1e9079954511a69c759 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Thu Aug 14 00:59:20 2025 -0700 [ci] fix incorrect ml-baseextra depends_on (#55596) to depends on the right wanda job Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit e4410d09cd0de2a7b2e6e507c12b92d2741cd6ea Author: Nikhil G <nrghosh@users.noreply.github.com> Date: Wed Aug 13 22:52:11 2025 -0700 [serve.llm] fix: improve error handling for invalid model_id (#55589) Signed-off-by: Nikhil Ghosh <nikhil@anyscale.com> commit 02340e1f402b8ebde104e92c9941b149e5555acb Author: harshit-anyscale <harshit@anyscale.com> Date: Thu Aug 14 10:21:53 2025 +0530 add support for async inference (#54824) This PR aims to provide basic support for asynchronous inference in the ray serve. RFC can be found at: https://github.com/ray-project/ray/issues/54652 The PR doesn't contains all the implementation pieces as having all the code changes in a single PR would be very difficult to review. Missing pieces are - implementation of failed and unprocessed task queue for the celery task processor - add more detailed and thorough tests for the same. These missing pieces will be taken care of in the subsequent PRs. --------- Signed-off-by: harshit <harshit@anyscale.com> commit 4dd73213096635cf78a1a69db84f244bb05ec50f Author: lkchen <github@lkchen.net> Date: Wed Aug 13 21:39:54 2025 -0700 [data.llm] Add FAQ to doc, explain STRICT_PACK strategy used in data.llm (#55505) Signed-off-by: Linkun <github@lkchen.net> commit 15887001ded1eca621f6890952c5c2a90d4e58a8 Author: Joshua Lee <73967497+Sparks0219@users.noreply.github.com> Date: Wed Aug 13 20:56:08 2025 -0700 [core] Store local_raylet_rpc_client in raylet_client_pool (#55490) Signed-off-by: joshlee <joshlee@anyscale.com> commit fd681ee6e3a74f08918eec34ea7a5d2f9b502f39 Author: Elliot Barnwell <elliot.barnwell@anyscale.com> Date: Wed Aug 13 20:36:49 2025 -0700 [ci] raydepsets: implementing build arg sets (2/2) (#55423) 1/2 here: https://github.com/ray-project/ray/pull/55408 - implementing get depset by name and optional build arg set - adding unit tests --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Signed-off-by: Elliot Barnwell <elliot.barnwell@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit f677f564cc56c07e7c93d29c33e2f7314ef34fa1 Author: Dhyey Shah <dhyey2019@gmail.com> Date: Wed Aug 13 19:42:02 2025 -0700 [core] Improve gcs publish perf + clean up publisher in general (#55560) This PR is focused on two things removing a lot of unnecessary copies when publishing from the GCS + when subscribing to the GCS from + cleaning up publisher related code, e.g. publish functions took callbacks that were always nullptr, always returned Status::OK, etc. There's no actual functional changes in this PR. Copy killing that matters: https://github.com/ray-project/ray/blob/4e5f03e7a1d06b9da8f3a9329400d426055f8ea4/src/ray/gcs/gcs_server/pubsub_handler.cc#L49-L59 Every GCS publish will result in an extra copy here because the `pubsub_reply` we create is heap allocated while the actual reply is arena allocated, so the swap will result in a copy of everything every time we publish to every subscriber. Also, there were multiple extra copies of messages inside gcs_pub_sub.cc when the PythonGcsPublisher publishes and when the PythonGcsSubscriber gets messages. --------- Signed-off-by: dayshah <dhyey2019@gmail.com> commit 1699dc367f71ac05db8486ac70758090c37403a7 Author: Neil Girdhar <mistersheik@gmail.com> Date: Wed Aug 13 21:33:45 2025 -0400 Suppress type error (#50994) Signed-off-by: Neil Girdhar <mistersheik@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> commit ceaa4fb6f5db3189f77a1ed0f2c407de47ce4792 Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Wed Aug 13 18:22:54 2025 -0700 [Serve.llm] Use DEFAULT_MAX_ONGOING_REQUESTS for DPServer (#55583) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> commit 54ae92386d2b4600e1a9327b4f83c4c48742a412 Author: Timothy Seah <timothy.seah777@yahoo.com> Date: Wed Aug 13 17:40:01 2025 -0700 [train] Change DEFAULT variables from strings to bools (#55581) All of these constants are used as the default value of [`env_bool`](https://github.com/ray-project/ray/blob/master/python/ray/_private/ray_constants.py#L41), which returns a bool. Technically this is a no-op since "1" evaluates to True anyway, but this is misleading because "0" actually also evaluates to True. Signed-off-by: Timothy Seah <tseah@anyscale.com> Co-authored-by: Timothy Seah <tseah@anyscale.com> commit 9838ad64d43dbd25b77acfd834500cd96f793e28 Author: yi wang <48236141+my-vegetable-has-exploded@users.noreply.github.com> Date: Thu Aug 14 08:32:54 2025 +0800 [DOC][Tune] fix: remove extra space in tune documentation (#55125) Signed-off-by: my-vegetable-has-exploded <wy1109468038@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> commit 1216e15c32de9ab44cbc9c5532b0571c6499732f Author: Elliot Barnwell <elliot.barnwell@anyscale.com> Date: Wed Aug 13 17:06:31 2025 -0700 [ci] raydepsets: implementing build arg sets (1/2) (#55408) - converting build arg sets into a dictionary instead of a list - updating naming convention for depsets with build_arg_sets ( suffix: _${BUILD_ARG_SET} for depset name in the config) - adding unit tests --------- Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com> Signed-off-by: Elliot Barnwell <elliot.barnwell@anyscale.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit ecc4c93af0308ccf4b5e08135865766e9a1fbd30 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Wed Aug 13 16:35:17 2025 -0700 [image] add base-extra layer (#55513) this the layer required to run on anyscale cloud and for running in ray release tests. we have been sourcing this layer from a tarball in s3; this change builds it from the source. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit 3e34885814e4da9a83123e22b042a7ee684074ad Author: Kishanthan Thangarajah <kshanth2101@gmail.com> Date: Wed Aug 13 19:21:05 2025 -0400 [serve] Support custom autoscaling at deployment level for ray serve (#55253) This PR adds initial changes to support custom auto scaling with ray serve. Two new classes (AutoscalingContext and AutoscalingPolicy) have been introduced as per discussions in https://docs.google.com/document/d/1KtMUDz1O3koihG6eh-QcUqudZjNAX3NsqqOMYh3BoWA/edit?usp=sharing. Related RFC https://github.com/ray-project/ray/issues/41135#issuecomment-3156717488 The changes will have two phases. Phase1 is to add required changes to support custom autoscaling at deployment level. Phase2 is to extend the changes to support custom autoscaling at application level. This PR is part of Phase1 (deployment level custom autoscaling). Related to #41135 - [x] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Kishanthan Thangarajah <kshanth2101@gmail.com> commit 2c7bd7d06930e5cc302a01c5baedef43911e3582 Author: Dhyey Shah <dhyey2019@gmail.com> Date: Wed Aug 13 14:35:25 2025 -0700 [core][ci] Kill debug wheel step (#55571) Signed-off-by: dayshah <dhyey2019@gmail.com> commit 52bef607fd4349e70a1874fb2d6a8a9f6d447111 Author: Matvei Pashkovskii <matvei.pashkovskii@amd.com> Date: Thu Aug 14 00:10:21 2025 +0300 [Serve.llm] Add LMCacheConnectorV1 support for kv_transfer_config (#54579) Signed-off-by: Matvei Pashkovskii <mpashkov@amd.com> Signed-off-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> Co-authored-by: Kourosh Hakhamaneshi <kourosh@anyscale.com> commit 32304ab50a5f1c94504d2610a338fef1e84ecef7 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Wed Aug 13 13:21:41 2025 -0700 [release test] remove "multi" test frequency (#55561) not used anywhere any more Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit c47048e6ebf1b7a705cdb1be18b027889623e1a4 Author: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Date: Wed Aug 13 12:56:01 2025 -0700 [core][obsclean/02] de-static more internal ray metrics (#55537) Ray core currently offers two APIs for defining internal metrics: a static object-oriented (OO) API and a template/extern-based API. The OO API is also used for defining custom metrics at the Ray application level, and I personally find it easier to read. This series of PRs aims to unify all metric definitions under the OO API. --------- This PR migrates **all** metric from static to runtime definition, as part of the effort to eliminate all statically defined metrics. Currently, the OO interface attempts to register a metric at the same time its first value is recorded, due to the [C++ static initialization order fiasco](https://en.cppreference.com/w/cpp/language/siof.html), which is awkward and potentially inefficient. We can fix this by removing all statically defined metrics. Test: - CI --------- Signed-off-by: Cuong Nguyen <can@anyscale.com> Signed-off-by: Cuong Nguyen <128072568+can-anyscale@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> commit 6ebd7d013933dfa990b11ffcad63cfd6f78db6cd Author: iamjustinhsu <140442892+iamjustinhsu@users.noreply.github.com> Date: Wed Aug 13 12:22:56 2025 -0700 [data] Sanitization of Dataset Metadata Export (#55379)   A couple of things that have been improved - updating structs should have string keys - More tests for bytes, bytearrays, dataclasses   - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: iamjustinhsu <jhsu@anyscale.com> commit 3d44e3d17b56e993f1fd7407bdf1288c852c8c41 Author: Mengjin Yan <mengjinyan3@gmail.com> Date: Wed Aug 13 11:57:54 2025 -0700 [Core][TaskEventFollowup/03] Improve the Target Http Endpoint in Aggregator Agent (#55529) This PR improves the target http endpoint in the aggregator_agent.py: Merge the address and port as one env var to specify the target http endpoint Set the default value of the endpoint to be empty. And only when the endpoint is specified, we send the events out to the endpoint Update corresponding tests ----------- Signed-off-by: Mengjin Yan <mengjinyan3@gmail.com> Signed-off-by: myan <myan@anyscale.com> commit 8d810e2667fc728e45ca990ff7d7dc8547eae99b Author: Alexey Kudinkin <ak@anyscale.com> Date: Wed Aug 13 14:32:25 2025 -0400 [Data] Fixing `AutoscalingActorPool` to properly downscale upon completion of the execution (#55565)   In 2.48 change introduced debouncing handling that disallows downscaling for Actor Pool for 30s after latest upscaling to give AP Operator enough time to start utilizing upscaled actor. However, that affected ability of the Actor Pool to downscale upon completion of the execution: when operator completes execution it should start downscaling immediately. This change addresses that.  - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: Alexey Kudinkin <ak@anyscale.com> commit 64feab4b01583023cec89bc2d199b0ff0de4c3cd Author: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Date: Wed Aug 13 18:09:39 2025 +0000 [Train] Implement a JaxTrainer to support SPMD with TPUs (#55207) This PR builds off previous efforts to add a `JaxTrainer` and the [ray-tpu package](https://github.com/AI-Hypercomputer/ray-tpu/tree/main) to implement support for a `JaxTrainer` in RayTrain that supports SPMD workloads with TPUs. Support for more types of workloads (i.e. better support for CPU and GPU) can be added incrementally. In order to support SPMD locality-aware scheduling at the TPU slice level, we alter the `WorkerGroup` construction in V2 Ray Train to optionally accept multiple placement groups specs to apply to a range of workers. This enables us to reserve the "TPU head" using a placement group with label selectors, retrieve its unique `ray.io/tpu-slice-name`, and then schedule the remaining workers on that slice in a separate placement group. --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Co-authored-by: Andrew Sy Kim <andrewsy@google.com> commit 6d318ce84ddeacf67dc0c66f6e2fb6f6a8fef2e4 Author: Rui Qiao <161574667+ruisearch42@users.noreply.github.com> Date: Wed Aug 13 10:57:54 2025 -0700 [Serve.llm] Add missing data_parallel/__init__.py (#55573) Signed-off-by: Rui Qiao <ruisearch42@gmail.com> commit 3c1314afb82128f30e5a445462c7277717e62863 Author: William Lin <SolitaryThinker@users.noreply.github.com> Date: Wed Aug 13 10:55:47 2025 -0700 [docs] Add documentation for using type hints in Ray Core (#55013)     - [ ] I've signed off every commit(by using the -s flag, i.e., `git commit -s`) in this PR. - [ ] I've run `scripts/format.sh` to lint the changes in this PR. - [ ] I've included any doc changes needed for https://docs.ray.io/en/master/. - [ ] I've added any new APIs to the API Reference. For example, if I added a method in Tune, I've added it in `doc/source/tune/api/` under the corresponding `.rst` file. - [ ] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [ ] Unit tests - [ ] Release tests - [ ] This PR is not tested :( --------- Signed-off-by: will.lin <will.lin@anyscale.com> Signed-off-by: Richard Liaw <rliaw@berkeley.edu> Co-authored-by: Richard Liaw <rliaw@berkeley.edu> commit a24defd4c4773879a834762ba414d3c0cea9b1e9 Author: Lonnie Liu <95255098+aslonnie@users.noreply.github.com> Date: Wed Aug 13 10:51:07 2025 -0700 [release test] remove release image build step from postmerge (#55564) they should be always building from release test pipeline directly we used to run release tests on postmerge; we are no longer doing it any more. also add oss tag for those steps. Signed-off-by: Lonnie Liu <lonnie@anyscale.com> commit dda42b2d97768dbebdbaf766a7ed2e2e2372cc8b Author: William Lin <SolitaryThinker@users.noreply.github.com> Date: Wed Aug 13 08:36:19 2025 -0700 [core] Add return type to ActorClass.options (#55563) Currently the following pattern throws many lint errors as `ActorDemoRay.options(name="demo_ray")` returns an instance of `ActorOptionWrapper` which messes with the IDE's static type checker: ```python import ray from ray import ObjectRef from ray.actor import ActorProxy, ActorClass class DemoRay: def __init__(self, init: int): self.init = init @ray.method def calculate(self, v1: int, v2: int) -> int: return self.init + v1 + v2 ActorDemoRay: ActorClass[DemoRay] = ray.remote(DemoRay) def main(): p: ActorProxy[DemoRay] = ActorDemoRay.options(name="demo_ray").remote(1) actor: ActorProxy[DemoRay] = ray.get_actor("demo_ray") a = actor.calculate.remote(1, 2) print(ray.get(a)) return if __name__ == "__main__": main() ``` This PR changes ActorClass[T].options(...) to return a new instance of ActorClass[T] instead, allow IDEs to correct infer the type of subsequent `.remote(...)` calls https://github.com/ray-project/ray/issues/54149 --------- Signed-off-by: will.lin <will.lin@anyscale.com>

…#55207) This PR builds off previous efforts to add a `JaxTrainer` and the [ray-tpu package](https://github.com/AI-Hypercomputer/ray-tpu/tree/main) to implement support for a `JaxTrainer` in RayTrain that supports SPMD workloads with TPUs. Support for more types of workloads (i.e. better support for CPU and GPU) can be added incrementally. In order to support SPMD locality-aware scheduling at the TPU slice level, we alter the `WorkerGroup` construction in V2 Ray Train to optionally accept multiple placement groups specs to apply to a range of workers. This enables us to reserve the "TPU head" using a placement group with label selectors, retrieve its unique `ray.io/tpu-slice-name`, and then schedule the remaining workers on that slice in a separate placement group. --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Co-authored-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: Andrew Grosser <dioptre@gmail.com>

…#55207) This PR builds off previous efforts to add a `JaxTrainer` and the [ray-tpu package](https://github.com/AI-Hypercomputer/ray-tpu/tree/main) to implement support for a `JaxTrainer` in RayTrain that supports SPMD workloads with TPUs. Support for more types of workloads (i.e. better support for CPU and GPU) can be added incrementally. In order to support SPMD locality-aware scheduling at the TPU slice level, we alter the `WorkerGroup` construction in V2 Ray Train to optionally accept multiple placement groups specs to apply to a range of workers. This enables us to reserve the "TPU head" using a placement group with label selectors, retrieve its unique `ray.io/tpu-slice-name`, and then schedule the remaining workers on that slice in a separate placement group. --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Co-authored-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: jugalshah291 <shah.jugal291@gmail.com>

…#55207) This PR builds off previous efforts to add a `JaxTrainer` and the [ray-tpu package](https://github.com/AI-Hypercomputer/ray-tpu/tree/main) to implement support for a `JaxTrainer` in RayTrain that supports SPMD workloads with TPUs. Support for more types of workloads (i.e. better support for CPU and GPU) can be added incrementally. In order to support SPMD locality-aware scheduling at the TPU slice level, we alter the `WorkerGroup` construction in V2 Ray Train to optionally accept multiple placement groups specs to apply to a range of workers. This enables us to reserve the "TPU head" using a placement group with label selectors, retrieve its unique `ray.io/tpu-slice-name`, and then schedule the remaining workers on that slice in a separate placement group. --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Co-authored-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: Douglas Strodtman <douglas@anyscale.com>

## Why are these changes needed? With #55207 Ray Train now has support for training functions with a JAX backend through the new `JaxTrainer` API. This guide provides a short overview of the API, how to configure with TPUs, and how to edit a JAX script to use Ray Train. TODO: I will link a longer e2e guide with KubeRay, MaxText, and the JaxTrainer on TPUs in GKE --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com>

Original PR #55207 by ryanaoleary Original: ray-project/ray#55207

… with TPUs Merged from original PR #55207 Original: ray-project/ray#55207

## Why are these changes needed? With #55207 Ray Train now has support for training functions with a JAX backend through the new `JaxTrainer` API. This guide provides a short overview of the API, how to configure with TPUs, and how to edit a JAX script to use Ray Train. TODO: I will link a longer e2e guide with KubeRay, MaxText, and the JaxTrainer on TPUs in GKE --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: elliot-barn <elliot.barnwell@anyscale.com>

## Why are these changes needed? With ray-project#55207 Ray Train now has support for training functions with a JAX backend through the new `JaxTrainer` API. This guide provides a short overview of the API, how to configure with TPUs, and how to edit a JAX script to use Ray Train. TODO: I will link a longer e2e guide with KubeRay, MaxText, and the JaxTrainer on TPUs in GKE --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com>

## Why are these changes needed? With ray-project#55207 Ray Train now has support for training functions with a JAX backend through the new `JaxTrainer` API. This guide provides a short overview of the API, how to configure with TPUs, and how to edit a JAX script to use Ray Train. TODO: I will link a longer e2e guide with KubeRay, MaxText, and the JaxTrainer on TPUs in GKE --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: matthewdeng <matthew.j.deng@gmail.com> Co-authored-by: matthewdeng <matt@anyscale.com> Signed-off-by: Aydin Abiar <aydin@anyscale.com>

gemini-code-assist bot reviewed Aug 4, 2025

View reviewed changes

ryanaoleary mentioned this pull request Aug 4, 2025

[Train] JaxTrainer Implementation Tracking Issue #55162

Open

gemini-code-assist bot reviewed Aug 4, 2025

View reviewed changes

ryanaoleary force-pushed the implement-jax-trainer branch 2 times, most recently from 8fac670 to e898ed9 Compare August 5, 2025 12:07

ryanaoleary force-pushed the implement-jax-trainer branch from e898ed9 to 76983cf Compare August 5, 2025 12:10

ryanaoleary marked this pull request as ready for review August 5, 2025 12:12

ryanaoleary requested review from a team as code owners August 5, 2025 12:12

ray-gardener bot added community-contribution Contributed by the community train Ray Train Related Issue labels Aug 5, 2025

matthewdeng self-assigned this Aug 5, 2025

andrewsykim reviewed Aug 5, 2025

View reviewed changes

python/ray/train/v2/jax/jax_trainer.py Outdated Show resolved Hide resolved

python/ray/train/v2/jax/jax_trainer.py Outdated Show resolved Hide resolved

python/ray/train/v2/jax/jax_trainer.py Show resolved Hide resolved

python/ray/train/v2/jax/jax_trainer.py Show resolved Hide resolved

Fix errors and comments

703e185

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

liulehui self-assigned this Aug 6, 2025

ryanaoleary and others added 2 commits August 6, 2025 08:29

Add minimal unit tests and fix scheduling logic to use one placement …

69d349c

…group Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Merge branch 'master' into implement-jax-trainer

b9f0764

ryanaoleary requested a review from andrewsykim August 6, 2025 09:03

matthewdeng reviewed Aug 7, 2025

View reviewed changes

ryanaoleary and others added 4 commits August 7, 2025 12:31

Merge branch 'master' into implement-jax-trainer

c48734a

Fix tpu default labels population

360b952

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Add callback and fix other comments

a341722

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

Fix wording and remove unneeded change

d2ee6a1

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

ryanaoleary requested a review from matthewdeng August 7, 2025 12:49

ryanaoleary commented Aug 7, 2025

View reviewed changes

python/ray/_private/accelerators/tpu.py Show resolved Hide resolved

Fix test code string

7f165a6

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

matthewdeng reviewed Aug 8, 2025

View reviewed changes

andrewsykim added 6 commits August 12, 2025 23:18

move some TPU util functions to Ray Core to resolve import errors

116c37f

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

remove test_tpu_utils reference in ray train v2 BUILD

4a8ea0d

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

fix code format and lint failures

e069ab7

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

fix code formatting

a8b1829

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

move field assertion to after use_tpu condition

4a21677

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

Merge branch 'master' into implement-jax-trainer

64c5a95

matthewdeng approved these changes Aug 13, 2025

View reviewed changes

andrewsykim added 2 commits August 13, 2025 04:54

address nits from matthewdeng

6598487

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

remove unused import caught by lint check

b8638c4

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

edoakes approved these changes Aug 13, 2025

View reviewed changes

edoakes added the go add ONLY when ready to merge, run all tests label Aug 13, 2025

fix docstring for JaxTrainer

1028bfe

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

andrewsykim force-pushed the implement-jax-trainer branch from 2f12fa5 to 1028bfe Compare August 13, 2025 15:38

fix docstring order for JaxTrainer

f10e268

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

matthewdeng enabled auto-merge (squash) August 13, 2025 17:36

matthewdeng merged commit 64feab4 into ray-project:master Aug 13, 2025
6 checks passed

andrewsykim mentioned this pull request Aug 14, 2025

[core] Placement Group bundle label selectors are not respected #55590

Closed

ryanaoleary mentioned this pull request Oct 4, 2025

[Docs] Add JaxTrainer API Overview to Ray Docs #57182

Merged

8 tasks

snorkelopsstgtesting1-spec mentioned this pull request Oct 22, 2025

[Train] Implement a JaxTrainer to support SPMD with TPUs snorkel-marlin-repos/ray-project_ray_pr_55207_ba0d50dd-8c79-4730-980f-f905a097e1d9#1

Merged

		from ray.tune.search.sample import Domain

		SampleRange = Union["Domain", Dict[str, List]]

[Train] Implement a JaxTrainer to support SPMD with TPUs #55207

[Train] Implement a JaxTrainer to support SPMD with TPUs #55207

Uh oh!

Conversation

ryanaoleary commented Aug 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented Aug 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ryanaoleary commented Aug 7, 2025

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ryanaoleary commented Aug 4, 2025 •

edited

Loading

ryanaoleary commented Aug 5, 2025 •

edited

Loading

ryanaoleary Aug 7, 2025 •

edited

Loading

ryanaoleary Aug 7, 2025 •

edited

Loading