Skip to content

Conversation

@andrewsykim
Copy link
Member

Why are these changes needed?

Cherry-picks for v1.5.1

Related issue number

Checks

  • I've made sure the tests are passing.
  • Testing Strategy
    • Unit tests
    • Manual tests
    • This PR is not tested :(

400Ping and others added 25 commits October 29, 2025 18:51
ray-project#4141)

* [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* [Fix] Fix e2e error

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* [Fix] fix according to rueian's comment

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* [Chore] fix ci error

Signed-off-by: 400Ping <fourhundredping@gmail.com>

* Update ray-operator/controllers/ray/raycluster_controller.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com>
Signed-off-by: Ping <fourhundredping@gmail.com>

* Update ray-operator/controllers/ray/rayjob_controller.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com>
Signed-off-by: Ping <fourhundredping@gmail.com>

* update

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* update

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* Trigger CI

Signed-off-by: Future-Outlier <eric901201@gmail.com>

---------

Signed-off-by: 400Ping <fourhundredping@gmail.com>
Signed-off-by: Ping <fourhundredping@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
…i-slice (ray-project#4163)

* [Feature Enhancement] Set ordered replica index label to support multi-slice

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* rename replica-id -> replica-name

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* Separate replica index feature gate logic

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

* remove index arg in createWorkerPod

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>

---------

Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
…, CMD JSON args) (ray-project#4167)

* [ray-project#4166] improvement: Fix Dockerfile warnings (ENV format, CMD JSON args)

* extract the hostname from CMD

Signed-off-by: Neo Chien <6762509+cchung100m@users.noreply.github.com>

---------

Signed-off-by: Neo Chien <6762509+cchung100m@users.noreply.github.com>
Co-authored-by: cchung100m <cchung100m@users.noreply.github.com>
ray-project#4158)

* [Fix] Resolve int32 overflow by having the calculation in int64 and cap it if the count is over math.MaxInt32

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Test] Add unit tests for CalculateReadyReplicas

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Fix] Add a nosec comment to pass the Lint (pre-commit) test

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Refactor] Add CapInt64ToInt32 to replace #nosec directives

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Refactor] Rename function to SafeInt64ToInt32 and add a underflowing prevention (it also help pass the lint test)

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Refactor] Remove the early return as SafeInt64ToInt32 handles the int32 overflow and underflow checking.

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

---------

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
…t#4181)

Signed-off-by: Future-Outlier <eric901201@gmail.com>
…y-project#4195)

* Make replicas configurable for kuberay-operator ray-project#4180

* Make replicas configurable for kuberay-operator ray-project#4180
* feat: check if raycluster status update in rayjob

* test: e2e test to check the rayjob raycluster status update
…4173)

Signed-off-by: alimaazamat <alima.azamat2003@gmail.com>
Signed-off-by: Spencer Peterson <spencerjp@google.com>
Fast follow to ray-project#4191

Signed-off-by: Spencer Peterson <spencerjp@google.com>
* Add support for Ray token auth

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

* add e2e test for Ray cluster auth

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

* address nits from Ruiean

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

* update RAY_auth_mode -> RAY_AUTH_MODE

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

* configure auth for Ray autoscaler

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

---------

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Bumps [js-yaml](https://github.com/nodeca/js-yaml) from 4.1.0 to 4.1.1.
- [Changelog](https://github.com/nodeca/js-yaml/blob/master/CHANGELOG.md)
- [Commits](nodeca/js-yaml@4.1.0...4.1.1)

---
updated-dependencies:
- dependency-name: js-yaml
  dependency-version: 4.1.1
  dependency-type: indirect
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
ray-project#4201)

* update minimum Ray version required for token authentication to 2.52.0

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

* update RayCluster auth e2e test to use Ray v2.52

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>

---------

Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Andrew Sy Kim <andrewsy@google.com>
Signed-off-by: Future-Outlier <eric901201@gmail.com>
)

* dashboard client authentication support

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* support rayjob

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* update to fix api serverr err

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* update

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* updarte

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* Rayjob sidecar mode auth token mode support

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* RayJob support k8s job mode

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* update

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* update

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* update

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* Address Andrew's advice

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* add todo x-ray-authorization comments

Signed-off-by: Future-Outlier <eric901201@gmail.com>

---------

Signed-off-by: Future-Outlier <eric901201@gmail.com>
… verbs (ray-project#4202)

* Add authentication secret reconciliation support

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* update

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* update

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* fix flaky test

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* remove test fix

Signed-off-by: Rueian <rueiancsie@gmail.com>

---------

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
justinyeh1995 and others added 9 commits November 19, 2025 20:50
…ay-project#4144)

* [Docs] Add the draft description about feature intro, configurations, and usecases

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Fix] Update the retry walk-through

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Doc] rewrite the first 2 sections

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Doc] Revise documentation wording and add Observing Retry Behavior section

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Fix] fix linting issue by running pre-commit run berfore commiting

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Fix] fix linting errors in the Markdown linting

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Fix] Clean up the math equation

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* Update the math formula of Backoff calculation.

Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com>
Signed-off-by: JustinYeh <justinyeh1995@gmail.com>

* [Fix] Explicitly mentioned exponential backoff and removed the customization parts

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* [Docs] Clarify naming by replacing “APIServer” with “KubeRay APIServer”

Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com>
Signed-off-by: JustinYeh <justinyeh1995@gmail.com>

* [Docs] Rename retry-configuration.md to retry-behavior.md for accuracy

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

* Update Title to KubeRay APIServer Retry Behavior

Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com>
Signed-off-by: JustinYeh <justinyeh1995@gmail.com>

* [Docs] Add a note about the limitation of retry configuration

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>

---------

Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
Signed-off-by: JustinYeh <justinyeh1995@gmail.com>
Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com>
Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com>
…via proxy (ray-project#4213)

* Support X-Ray-Authorization fallback header for accepting auth token in dashboard

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* remove todo comment

Signed-off-by: Future-Outlier <eric901201@gmail.com>

---------

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: fscnick <fscnick.dev@gmail.com>
…ect#4196)

* [RayCluster] Status includes head containter status message

Signed-off-by: Spencer Peterson <spencerjp@google.com>

* lint

Signed-off-by: Spencer Peterson <spencerjp@google.com>

* [RayCluster] Containers not ready status reflects structured reason

Signed-off-by: Spencer Peterson <spencerjp@google.com>

* nit

Signed-off-by: Spencer Peterson <spencerjp@google.com>

---------

Signed-off-by: Spencer Peterson <spencerjp@google.com>
Signed-off-by: Ryan O'Leary <ryanaoleary@google.com>
…ter (ray-project#4215)

* [RayJob] light weight job submitter auth token support

Signed-off-by: Future-Outlier <eric901201@gmail.com>

* X-Ray-Authorization

Signed-off-by: Rueian <rueiancsie@gmail.com>

---------

Signed-off-by: Future-Outlier <eric901201@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Rueian <rueiancsie@gmail.com>
* feat: kubectl ray get token command

Signed-off-by: Rueian <rueiancsie@gmail.com>

* Update kubectl-plugin/pkg/cmd/get/get_token_test.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>

* Update kubectl-plugin/pkg/cmd/get/get_token.go

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>

* make sure the raycluster exists before getting the secret

Signed-off-by: Rueian <rueiancsie@gmail.com>

* better ux

Signed-off-by: Rueian <rueiancsie@gmail.com>

* Update kubectl-plugin/pkg/cmd/get/get_token.go

Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com>
Signed-off-by: Rueian <rueiancsie@gmail.com>

---------

Signed-off-by: Rueian <rueiancsie@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com>
Copy link
Member

@Future-Outlier Future-Outlier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my plan for testing this branch
all combinations from here #4203
with and without using kubernetes proxy in kuberay

@Future-Outlier
Copy link
Member

Future-Outlier commented Nov 21, 2025

my test when using kubernetes proxy in kuberay

args: "-leader-election-namespace default -use-kubernetes-proxy"
branch: this one
image: rayproject/ray:2.52.0.9527a5-extra-py310-cpu

HTTP mode, k8s job mode, cluster selector, and sidecar mode
image

light weight job submitter (k8s mode)
image

rayservice
image
image

my test when not using kubernetes proxy in kuberay

image image image image image image

my example

apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-sample-http-mode-v6
spec:
  # submissionMode specifies how RayJob submits the Ray job to the RayCluster.
  # The default value is "K8sJobMode", meaning RayJob will submit the Ray job via a submitter Kubernetes Job.
  # The alternative value is "HTTPMode", indicating that KubeRay will submit the Ray job by sending an HTTP request to the RayCluster.
  submissionMode: "HTTPMode"
  entrypoint: python /home/ray/samples/sample_code.py
  # shutdownAfterJobFinishes specifies whether the RayCluster should be deleted after the RayJob finishes. Default is false.
  # shutdownAfterJobFinishes: false

  # ttlSecondsAfterFinished specifies the number of seconds after which the RayCluster will be deleted after the RayJob finishes.
  # ttlSecondsAfterFinished: 10

  # activeDeadlineSeconds is the duration in seconds that the RayJob may be active before
  # KubeRay actively tries to terminate the RayJob; value must be positive integer.
  # activeDeadlineSeconds: 120

  # RuntimeEnvYAML represents the runtime environment configuration provided as a multi-line YAML string.
  # See https://docs.ray.io/en/latest/ray-core/handling-dependencies.html for details.
  # (New in KubeRay version 1.0.)
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
      - pendulum==2.1.2
    env_vars:
      counter_name: "test_counter"

  # Suspend specifies whether the RayJob controller should create a RayCluster instance.
  # If a job is applied with the suspend field set to true, the RayCluster will not be created and we will wait for the transition to false.
  # If the RayCluster is already created, it will be deleted. In the case of transition to false, a new RayCluster will be created.
  # suspend: false

  # rayClusterSpec specifies the RayCluster instance to be created by the RayJob controller.
  rayClusterSpec:
    rayVersion: "2.52.0" # should match the Ray version in the image of the containers+
    authOptions:
      mode: "token"
    # Ray head pod template
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      #pod template
      template:
        spec:
          containers:
          - name: ray-head
            # image: rayproject/ray:nightly-py311-cpu
            image: rayproject/ray:2.52.0.9527a5-extra-py310-cpu
            ports:
            - containerPort: 6379
              name: gcs-server
            - containerPort: 8265 # Ray dashboard
              name: dashboard
            - containerPort: 10001
              name: client
            resources:
              requests:
                cpu: "2"
            volumeMounts:
            - mountPath: /home/ray/samples
              name: code-sample
          volumes:
          # You set volumes at the Pod level, then mount them into containers inside that Pod
          - name: code-sample
            configMap:
              # Provide the name of the ConfigMap you want to mount.
              name: ray-job-code-sample
              # An array of keys from the ConfigMap to create as files
              items:
              - key: sample_code.py
                path: sample_code.py
    workerGroupSpecs:
    # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 5
      # logical group name, for this called small-group, also can be functional
      groupName: small-group
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      #pod template
      template:
        spec:
          containers:
          - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
            # image: rayproject/ray:nightly-py311-cpu
            image: rayproject/ray:2.52.0.9527a5-extra-py310-cpu
            resources:
              requests:
                cpu: "2"

  # SubmitterPodTemplate is the template for the pod that will run the `ray job submit` command against the RayCluster.
  # If SubmitterPodTemplate is specified, the first container is assumed to be the submitter container.
  # submitterPodTemplate:
  #   spec:
  #     restartPolicy: Never
  #     containers:
  #     - name: my-custom-rayjob-submitter-pod
  #       image: rayproject/ray:2.46.0
  #       # If Command is not specified, the correct command will be supplied at runtime using the RayJob spec `entrypoint` field.
  #       # Specifying Command is not recommended.
  #       # command: ["sh", "-c", "ray job submit --address=http://$RAY_DASHBOARD_ADDRESS --submission-id=$RAY_JOB_SUBMISSION_ID -- echo hello world"]


######################Ray code sample#################################
# this sample is from https://docs.ray.io/en/latest/cluster/job-submission.html#quick-start-example
# it is mounted into the container and executed to show the Ray job at work
---
apiVersion: v1
kind: ConfigMap
metadata:
  name: ray-job-code-sample
data:
  sample_code.py: |
    import ray
    import os
    import requests

    ray.init()

    @ray.remote
    class Counter:
        def __init__(self):
            # Used to verify runtimeEnv
            self.name = os.getenv("counter_name")
            assert self.name == "test_counter"
            self.counter = 0

        def inc(self):
            self.counter += 1

        def get_counter(self):
            return "{} got {}".format(self.name, self.counter)

    counter = Counter.remote()

    for _ in range(5):
        ray.get(counter.inc.remote())
        print(ray.get(counter.get_counter.remote()))

    # Verify that the correct runtime env was used for the job.
    assert requests.__version__ == "2.26.0"
apiVersion: ray.io/v1
kind: RayJob
metadata:
  name: rayjob-use-existing-raycluster-auth-token
spec:
  clusterSelector:
    ray.io/cluster: rayjob-sample-spn4v
  entrypoint: python -c "import ray; ray.init(); print(ray.cluster_resources())"
  runtimeEnvYAML: |
    pip:
      - requests==2.26.0
# Make sure to increase resource requests and limits before using this example in production.
# For examples with more realistic resource configuration, see
# ray-cluster.complete.large.yaml and
# ray-cluster.autoscaler.large.yaml.
apiVersion: ray.io/v1
kind: RayService
metadata:
  name: ray-service-auth-token-3
spec:
  # serveConfigV2 takes a yaml multi-line scalar, which should be a Ray Serve multi-application config. See https://docs.ray.io/en/latest/serve/multi-app.html.
  serveConfigV2: |
    applications:
      - name: fruit_app
        import_path: fruit.deployment_graph
        route_prefix: /fruit
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: MangoStand
            num_replicas: 2
            max_replicas_per_node: 1
            user_config:
              price: 3
            ray_actor_options:
              num_cpus: 0.1
          - name: OrangeStand
            num_replicas: 1
            user_config:
              price: 2
            ray_actor_options:
              num_cpus: 0.1
          - name: PearStand
            num_replicas: 1
            user_config:
              price: 1
            ray_actor_options:
              num_cpus: 0.1
          - name: FruitMarket
            num_replicas: 1
            ray_actor_options:
              num_cpus: 0.1
      - name: math_app
        import_path: conditional_dag.serve_dag
        route_prefix: /calc
        runtime_env:
          working_dir: "https://github.com/ray-project/test_dag/archive/78b4a5da38796123d9f9ffff59bab2792a043e95.zip"
        deployments:
          - name: Adder
            num_replicas: 1
            user_config:
              increment: 3
            ray_actor_options:
              num_cpus: 0.1
          - name: Multiplier
            num_replicas: 1
            user_config:
              factor: 5
            ray_actor_options:
              num_cpus: 0.1
          - name: Router
            num_replicas: 1
  rayClusterConfig:
    rayVersion: '2.52.0' # should match the Ray version in the image of the containers
    enableInTreeAutoscaling: true
    authOptions:
      mode: token
    ######################headGroupSpecs#################################
    # Ray head pod template.
    headGroupSpec:
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      #pod template
      template:
        spec:
          containers:
          - name: ray-head
            image: rayproject/ray:2.52.0.9527a5-py312-cpu
            resources:
              requests:
                cpu: 3
                memory: 4Gi      # Increased from 2Gi
    workerGroupSpecs:
    # the pod replicas in this group typed worker
    - replicas: 1
      minReplicas: 1
      maxReplicas: 5
      # logical group name, for this called small-group, also can be functional
      groupName: small-group
      # The `rayStartParams` are used to configure the `ray start` command.
      # See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of `rayStartParams` in KubeRay.
      # See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in `rayStartParams`.
      rayStartParams: {}
      #pod template
      template:
        spec:
          containers:
          - name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name',  or '123-abc'
            image: rayproject/ray:2.52.0.9527a5-py312-cpu
            resources:
              requests:
                cpu: 3
                memory: 6Gi      # Increased from 2Gi

Copy link
Collaborator

@rueian rueian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andrewsykim andrewsykim merged commit f68857e into ray-project:release-1.5 Nov 21, 2025
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.