-
Notifications
You must be signed in to change notification settings - Fork 663
[APIServer][Docs] Add user guide for retry behavior & configuration #4144
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[APIServer][Docs] Add user guide for retry behavior & configuration #4144
Conversation
… and usecases Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
…docs/3883-add-apiserver-rety-to-doc
Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
…docs/3883-add-apiserver-rety-to-doc
…docs/3883-add-apiserver-rety-to-doc
…docs/3883-add-apiserver-rety-to-doc
…ection Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
|
cc @machichima @dentiny - Would appreciate your reviews. Thank you! |
Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
…docs/3883-add-apiserver-rety-to-doc
Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
|
cc @CheyuWu |
| ```go | ||
| const ( | ||
| HTTPClientDefaultMaxRetry = 5 // Increase retries from 3 to 5 | ||
| HTTPClientDefaultBackoffFactor = float64(2) | ||
| HTTPClientDefaultInitBackoff = 2 * time.Second // Longer backoff makes timing visible | ||
| HTTPClientDefaultMaxBackoff = 20 * time.Second | ||
| HTTPClientDefaultOverallTimeout = 120 * time.Second // Longer timeout to allow more retries | ||
| ) | ||
| ``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like currently we do not have a way to configure it without modifying the code. I am thinking in this case we can omit the configuration part and just write about the default behavior?
cc @Future-Outlier @rueian for some advice on this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After discussing offline, we can just document the default behavior here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will remove the customization part.
Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com>
|
|
||
| ## Default Retry Behavior | ||
|
|
||
| The APIServer automatically retries for these HTTP status codes: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can explicitly mention we use exponential backoff when retrying for this transient errors
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. I will add this part into the paragraph.
…ization parts Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
554a988 to
7640567
Compare
kenchung285
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because now the document is only for description of the retry behavior without configuration part, we should rename the file
Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com>
Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds documentation for the automatic retry behavior in the KubeRay APIServer V2, which was introduced in previous PRs (#3551 and #3946). The documentation describes the default retry mechanism, including which HTTP status codes trigger retries and the exponential backoff configuration.
Key Changes:
- Added comprehensive documentation of the APIServer's automatic retry behavior for transient failures
- Documented the exponential backoff algorithm with default configuration values (3 retries, 500ms initial backoff, 2.0 backoff factor, 10s max backoff, 30s overall timeout)
- Listed the specific HTTP status codes (408, 429, 500, 502, 503, 504) that trigger automatic retries
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
apiserversdk/docs/retry-behavior.md
Outdated
| @@ -0,0 +1,31 @@ | |||
| # APIServer Retry Behavior | |||
|
|
|||
| By default, the KubeRay APIServer automatically retries failed requests to the Kubernetes API when transient errors occur. | |||
Copilot
AI
Nov 16, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[nitpick] The phrase "By default" suggests that the retry behavior can be configured or disabled, but based on the code in proxy.go, the retry configuration is hardcoded and cannot be customized by users. Consider either:
- Removing "By default" and rephrasing to: "The KubeRay APIServer automatically retries failed requests..."
- Adding a note that this behavior is currently not user-configurable
This would set accurate expectations for users reading the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@copilot open a new pull request to apply changes based on this feedback
Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com>
Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
kenchung285
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks!
…docs/3883-add-apiserver-rety-to-doc
* [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted (#4141) * [Bug] Sidecar mode shouldn't restart head pod when head pod is deleted Signed-off-by: 400Ping <fourhundredping@gmail.com> * [Fix] Fix e2e error Signed-off-by: 400Ping <fourhundredping@gmail.com> * [Fix] fix according to rueian's comment Signed-off-by: 400Ping <fourhundredping@gmail.com> * [Chore] fix ci error Signed-off-by: 400Ping <fourhundredping@gmail.com> * Update ray-operator/controllers/ray/raycluster_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> * Update ray-operator/controllers/ray/rayjob_controller.go Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * Trigger CI Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: 400Ping <fourhundredping@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> * fix: dashboard build for kuberay 1.5.0 (#4161) Signed-off-by: Future-Outlier <eric901201@gmail.com> * [Feature Enhancement] Set ordered replica index label to support multi-slice (#4163) * [Feature Enhancement] Set ordered replica index label to support multi-slice Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * rename replica-id -> replica-name Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Separate replica index feature gate logic Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * remove index arg in createWorkerPod Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> --------- Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * update stale feature gate comments (#4174) Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * [RayCluster] Add more context why we don't recreate head Pod for RayJob (#4175) Signed-off-by: Kai-Hsun Chen <khchen@x.ai> * feature: Remove empty resource list initialization. (#4168) Fixes #4142. * [Dockerfile] [KubeRay Dashboard]: Fix Dockerfile warnings (ENV format, CMD JSON args) (#4167) * [#4166] improvement: Fix Dockerfile warnings (ENV format, CMD JSON args) * extract the hostname from CMD Signed-off-by: Neo Chien <6762509+cchung100m@users.noreply.github.com> --------- Signed-off-by: Neo Chien <6762509+cchung100m@users.noreply.github.com> Co-authored-by: cchung100m <cchung100m@users.noreply.github.com> * [Fix] Resolve int32 overflow by having the calculation in int64 and c… (#4158) * [Fix] Resolve int32 overflow by having the calculation in int64 and cap it if the count is over math.MaxInt32 Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Test] Add unit tests for CalculateReadyReplicas Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] Add a nosec comment to pass the Lint (pre-commit) test Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Refactor] Add CapInt64ToInt32 to replace #nosec directives Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Refactor] Rename function to SafeInt64ToInt32 and add a underflowing prevention (it also help pass the lint test) Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Refactor] Remove the early return as SafeInt64ToInt32 handles the int32 overflow and underflow checking. Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> --------- Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * Add RayService incremental upgrade sample for guide (#4164) Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * Edit RayCluster example config for label selectors (#4151) Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * [RayJob] update light weight submitter image from quay.io (#4181) Signed-off-by: Future-Outlier <eric901201@gmail.com> * [flaky] RayJob fails when head Pod is deleted when job is running (#4182) Signed-off-by: Future-Outlier <eric901201@gmail.com> * [CI] Pin Docker api version to avoid API version mismatch (#4188) Signed-off-by: win5923 <ken89@kimo.com> * Make replicas configurable for kuberay-operator #4180 (#4195) * Make replicas configurable for kuberay-operator #4180 * Make replicas configurable for kuberay-operator #4180 * [Fix] rayjob update raycluster status (#4192) * feat: check if raycluster status update in rayjob * test: e2e test to check the rayjob raycluster status update * fix: dashboard http client tests discovered and passing (#4173) Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> * [RayJob] Lift cluster status while initializing (#4191) Signed-off-by: Spencer Peterson <spencerjp@google.com> * [RayJob] Remove updateJobStatus call (#4198) Fast follow to #4191 Signed-off-by: Spencer Peterson <spencerjp@google.com> * Add support for Ray token auth (#4179) * Add support for Ray token auth Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * add e2e test for Ray cluster auth Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * address nits from Ruiean Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * update RAY_auth_mode -> RAY_AUTH_MODE Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * configure auth for Ray autoscaler Signed-off-by: Andrew Sy Kim <andrewsy@google.com> --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * Bump js-yaml from 4.1.0 to 4.1.1 in /dashboard (#4194) Bumps [js-yaml](https://github.com/nodeca/js-yaml) from 4.1.0 to 4.1.1. - [Changelog](https://github.com/nodeca/js-yaml/blob/master/CHANGELOG.md) - [Commits](nodeca/js-yaml@4.1.0...4.1.1) --- updated-dependencies: - dependency-name: js-yaml dependency-version: 4.1.1 dependency-type: indirect ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * update minimum Ray version required for token authentication to 2.52.0 (#4201) * update minimum Ray version required for token authentication to 2.52.0 Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * update RayCluster auth e2e test to use Ray v2.52 Signed-off-by: Andrew Sy Kim <andrewsy@google.com> --------- Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * add samples for RayCluster token auth (#4200) Signed-off-by: Andrew Sy Kim <andrewsy@google.com> * update (#4208) Signed-off-by: Future-Outlier <eric901201@gmail.com> * [RayJob] Add token authentication support for All mode (#4210) * dashboard client authentication support Signed-off-by: Future-Outlier <eric901201@gmail.com> * support rayjob Signed-off-by: Future-Outlier <eric901201@gmail.com> * update to fix api serverr err Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * updarte Signed-off-by: Future-Outlier <eric901201@gmail.com> * Rayjob sidecar mode auth token mode support Signed-off-by: Future-Outlier <eric901201@gmail.com> * RayJob support k8s job mode Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * Address Andrew's advice Signed-off-by: Future-Outlier <eric901201@gmail.com> * add todo x-ray-authorization comments Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> * [RayCluster] Enable Secret informer watch/list and remove unused RBAC verbs (#4202) * Add authentication secret reconciliation support Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * update Signed-off-by: Future-Outlier <eric901201@gmail.com> * fix flaky test Signed-off-by: Future-Outlier <eric901201@gmail.com> * remove test fix Signed-off-by: Rueian <rueiancsie@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> * [APIServer][Docs] Add user guide for retry behavior & configuration (#4144) * [Docs] Add the draft description about feature intro, configurations, and usecases Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] Update the retry walk-through Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Doc] rewrite the first 2 sections Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Doc] Revise documentation wording and add Observing Retry Behavior section Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] fix linting issue by running pre-commit run berfore commiting Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] fix linting errors in the Markdown linting Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Fix] Clean up the math equation Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * Update the math formula of Backoff calculation. Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com> * [Fix] Explicitly mentioned exponential backoff and removed the customization parts Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * [Docs] Clarify naming by replacing “APIServer” with “KubeRay APIServer” Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com> * [Docs] Rename retry-configuration.md to retry-behavior.md for accuracy Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> * Update Title to KubeRay APIServer Retry Behavior Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com> * [Docs] Add a note about the limitation of retry configuration Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> --------- Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com> Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com> * Support X-Ray-Authorization fallback header for accepting auth token via proxy (#4213) * Support X-Ray-Authorization fallback header for accepting auth token in dashboard Signed-off-by: Future-Outlier <eric901201@gmail.com> * remove todo comment Signed-off-by: Future-Outlier <eric901201@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> * [RayCluster] make auth token secret name consistency (#4216) Signed-off-by: fscnick <fscnick.dev@gmail.com> * [RayCluster] Status includes head containter status message (#4196) * [RayCluster] Status includes head containter status message Signed-off-by: Spencer Peterson <spencerjp@google.com> * lint Signed-off-by: Spencer Peterson <spencerjp@google.com> * [RayCluster] Containers not ready status reflects structured reason Signed-off-by: Spencer Peterson <spencerjp@google.com> * nit Signed-off-by: Spencer Peterson <spencerjp@google.com> --------- Signed-off-by: Spencer Peterson <spencerjp@google.com> * Remove erroneous call in applyServeTargetCapacity (#4212) Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> * [RayJob] Add token authentication support for light weight job submitter (#4215) * [RayJob] light weight job submitter auth token support Signed-off-by: Future-Outlier <eric901201@gmail.com> * X-Ray-Authorization Signed-off-by: Rueian <rueiancsie@gmail.com> --------- Signed-off-by: Future-Outlier <eric901201@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Rueian <rueiancsie@gmail.com> * feat: kubectl ray get token command (#4218) * feat: kubectl ray get token command Signed-off-by: Rueian <rueiancsie@gmail.com> * Update kubectl-plugin/pkg/cmd/get/get_token_test.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Rueian <rueiancsie@gmail.com> * Update kubectl-plugin/pkg/cmd/get/get_token.go Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Signed-off-by: Rueian <rueiancsie@gmail.com> * make sure the raycluster exists before getting the secret Signed-off-by: Rueian <rueiancsie@gmail.com> * better ux Signed-off-by: Rueian <rueiancsie@gmail.com> * Update kubectl-plugin/pkg/cmd/get/get_token.go Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> Signed-off-by: Rueian <rueiancsie@gmail.com> --------- Signed-off-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> --------- Signed-off-by: 400Ping <fourhundredping@gmail.com> Signed-off-by: Ping <fourhundredping@gmail.com> Signed-off-by: Future-Outlier <eric901201@gmail.com> Signed-off-by: Ryan O'Leary <ryanaoleary@google.com> Signed-off-by: Andrew Sy Kim <andrewsy@google.com> Signed-off-by: Kai-Hsun Chen <khchen@x.ai> Signed-off-by: Neo Chien <6762509+cchung100m@users.noreply.github.com> Signed-off-by: justinyeh1995 <justinyeh1995@gmail.com> Signed-off-by: win5923 <ken89@kimo.com> Signed-off-by: alimaazamat <alima.azamat2003@gmail.com> Signed-off-by: Spencer Peterson <spencerjp@google.com> Signed-off-by: dependabot[bot] <support@github.com> Signed-off-by: Rueian <rueiancsie@gmail.com> Signed-off-by: JustinYeh <justinyeh1995@gmail.com> Signed-off-by: fscnick <fscnick.dev@gmail.com> Co-authored-by: Ping <fourhundredping@gmail.com> Co-authored-by: Han-Ju Chen (Future-Outlier) <eric901201@gmail.com> Co-authored-by: Ryan O'Leary <113500783+ryanaoleary@users.noreply.github.com> Co-authored-by: Kai-Hsun Chen <kaihsun@anyscale.com> Co-authored-by: Kavish <141061817+kash2104@users.noreply.github.com> Co-authored-by: Neo Chien <6762509+cchung100m@users.noreply.github.com> Co-authored-by: cchung100m <cchung100m@users.noreply.github.com> Co-authored-by: JustinYeh <justinyeh1995@gmail.com> Co-authored-by: Jun-Hao Wan <ken89@kimo.com> Co-authored-by: Divyam Raj <41264059+divyamraj18@users.noreply.github.com> Co-authored-by: Nary Yeh <60069744+machichima@users.noreply.github.com> Co-authored-by: Alima Azamat <92766804+alimaazamat@users.noreply.github.com> Co-authored-by: Spencer Peterson <spencerjp@google.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Rueian <rueiancsie@gmail.com> Co-authored-by: Cheng-Yeh Chung <kenchung285@gmail.com> Co-authored-by: fscnick <6858627+fscnick@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Why are these changes needed?
This PR addresses the need for documentation related to the new automatic retry feature introduced to the APIServer SDK V2 client in PRs #3551 and #3946. Currently, there is no guide for users on how to configure this essential retry functionality.
Related issue number
Closes #3883
Checks