-
Notifications
You must be signed in to change notification settings - Fork 3.6k
[opt](scheduler) Improve Graceful Shutdown Behavior for BE and FE, and Optimize Query Retry During BE Shutdown #56601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
TPC-DS: Total hot run time: 190772 ms |
ClickBench: Total hot run time: 31.27 s |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
34da79c to
3279622
Compare
|
run buildall |
TPC-DS: Total hot run time: 189841 ms |
ClickBench: Total hot run time: 30.44 s |
FE Regression Coverage ReportIncrement line coverage |
b9eff70 to
f6b418d
Compare
|
run buildall |
f6b418d to
1acd70c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This pull request implements graceful shutdown mechanisms for both Frontend (FE) and Backend (BE) servers, and refactors error handling for cloud retry logic. The key changes include adding server readiness checks, graceful shutdown hooks, and consolidating retry error handling into a centralized method.
- Adds graceful shutdown for FE and BE servers with wait logic for running queries/tasks
- Implements server readiness health checks that return appropriate status during startup/shutdown
- Refactors cloud retry error checking into a centralized
SystemInfoService.needRetryWithReplan()method - Moves backend shutdown state from
AtomicBooleanto serializedBackendStatus.isShutdownfield
Reviewed Changes
Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| SystemInfoService.java | Adds centralized retry error detection method and constant set |
| Backend.java | Refactors setters to return boolean indicating state change; moves isShutDown to BackendStatus |
| StmtExecutor.java | Updates retry logic to use centralized error detection method |
| OlapInsertExecutor.java | Updates retry logic to use centralized error detection method |
| AbstractInsertExecutor.java | Updates retry logic to use centralized error detection method |
| StreamLoadHandler.java | Changes backend filter from isAlive to isLoadAvailable |
| MTMVTask.java | Updates retry logic to use centralized error detection method |
| RestApiStatusCode.java | Adds SERVICE_UNAVAILABLE status code |
| HealthAction.java | Adds server readiness check to health endpoint |
| ResponseEntityBuilder.java | Adds serviceUnavailable response builder method |
| HttpServer.java | Implements graceful shutdown and disables automatic shutdown hook |
| FeConstants.java | Removes unused constants including CLOUD_RETRY_E230 |
| CloudReplica.java | Updates backend filtering to use isQueryAvailable |
| DorisFE.java | Implements graceful shutdown logic and server readiness tracking |
| doris_main.cpp | Sets server readiness flags during startup/shutdown |
| exec_env.h | Adds k_is_server_ready global flag |
| exec_env.cpp | Adds post-delay sleep after graceful shutdown |
| health_action.cpp | Updates health check to return status based on server readiness |
| config.h | Adds grace_shutdown_post_delay_seconds configuration |
| config.cpp | Defines default value for grace_shutdown_post_delay_seconds |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| this.backendStatus.lastStreamLoadTime = lastStreamLoadTime; | ||
| } | ||
|
|
||
| // ATTN: This method only the value of "isQueryDisabled", |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete sentence in comment. Should be 'This method only returns the value of "isQueryDisabled"'.
| // ATTN: This method only the value of "isQueryDisabled", | |
| // ATTN: This method only returns the value of "isQueryDisabled", |
| return false; | ||
| } | ||
|
|
||
| // ATTN: This method only the value of "isLoadDisabled", |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incomplete sentence in comment. Should be 'This method only returns the value of "isLoadDisabled"'.
| // ATTN: This method only the value of "isLoadDisabled", | |
| // ATTN: This method only returns the value of "isLoadDisabled", |
| @SerializedName("isActive") | ||
| public volatile boolean isActive = true; | ||
| @SerializedName("isShutdown") | ||
| public volatile boolean isShutdown = true; |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default value of isShutdown should be false, not true. A newly created backend should not be marked as shutdown by default.
| public volatile boolean isShutdown = true; | |
| public volatile boolean isShutdown = false; |
| boolean isNeedRetry = false; | ||
| if (Config.isCloudMode()) { | ||
| // cloud mode retry | ||
| // cloud mode retryF |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo in comment: 'retryF' should be 'retry'.
| // cloud mode retryF | |
| // cloud mode retry |
| while (true) { | ||
| Thread.sleep(2000); | ||
| } | ||
| LOG.info("Doris FE main loop exited, shutting down gracefully..."); |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This log statement is unreachable because the while loop on line 264 is infinite. The code will never reach line 267.
| LOG.info("Doris FE main loop exited, shutting down gracefully..."); | ||
| } catch (Throwable e) { | ||
| // Some exception may thrown before LOG is inited. | ||
| // Some exception may throw before LOG is inited. |
Copilot
AI
Nov 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Grammatically incorrect: 'may throw' should be 'may be thrown'.
| // Some exception may throw before LOG is inited. | |
| // Some exception may be thrown before LOG is inited. |
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
5a5ae36 to
33e99f5
Compare
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
deardeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
TPC-DS: Total hot run time: 189127 ms |
ClickBench: Total hot run time: 27.62 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
FE Regression Coverage ReportIncrement line coverage |
…d Optimize Query Retry During BE Shutdown (apache#56601) Related PR: apache#23865 This PR includes the following main changes: 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.
…d Optimize Query Retry During BE Shutdown (apache#56601) ### What problem does this PR solve? Related PR: apache#23865 This PR includes the following main changes: #### BE Graceful Shutdown Improvements 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` #### Added FE Graceful Shutdown Support When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. #### Query Retry Optimization During BE Shutdown In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.
…d Optimize Query Retry During BE Shutdown (apache#56601) ### What problem does this PR solve? Related PR: apache#23865 This PR includes the following main changes: #### BE Graceful Shutdown Improvements 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` #### Added FE Graceful Shutdown Support When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. #### Query Retry Optimization During BE Shutdown In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.
What problem does this PR solve?
Related PR: #23865
This PR includes the following main changes:
BE Graceful Shutdown Improvements
New BE Parameter:
grace_shutdown_post_delay_secondsWhen using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished.
Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time.
Enhanced BE
api/healthEndpointWhen the BE has not yet fully started or is in the process of shutting down, the endpoint will return:
"Server is not available"200Under normal circumstances:
"OK"200Added FE Graceful Shutdown Support
When using
stop_fe.sh --grace, the FE will wait for currently running queries to finish before exiting.Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included.
Query Retry Optimization During BE Shutdown
In cloud mode, when encountering the error
"No backend available as scan node",the FE will now internally retry the query to reassign it to other available BE nodes.
Release note
None
Check List (For Author)
Test
select sleep(30)sh bin/stop_fe.sh --graceselectfinishselect sleep(30)is executed, FE will also wait this query finishBehavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)