Skip to content

Conversation

@morningman
Copy link
Contributor

bp #56601

…d Optimize Query Retry During BE Shutdown (apache#56601)

Related PR: apache#23865

This PR includes the following main changes:

1. New BE Parameter: `grace_shutdown_post_delay_seconds`

When using the BE graceful stop feature, after the main process waits
for all currently running tasks to complete, it will continue to wait
for an additional period to ensure that queries still running on other
nodes have also finished.
Since a BE node cannot detect the execution status of tasks on other BE
nodes, this threshold may need to be increased to allow a longer waiting
time.

2. Enhanced BE `api/health` Endpoint

* When the BE has not yet fully started or is in the process of shutting
down, the endpoint will return:

     * Message: `"Server is not available"`
     * HTTP Code: `200`

   * Under normal circumstances:

     * Message: `"OK"`
     * HTTP Code: `200`

When using `stop_fe.sh --grace`, the FE will wait for currently running
queries to finish before exiting.

Note, Currently, only query tasks are waited for; import and other types
of tasks are not yet included.

In cloud mode, when encountering the error `"No backend available as
scan node"`,
the FE will now internally retry the query to reassign it to other
available BE nodes.
@morningman morningman requested a review from morrySnow as a code owner November 7, 2025 06:36
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

morrySnow
morrySnow previously approved these changes Nov 7, 2025
@morrySnow morrySnow changed the title branch-3.1: [opt](scheduler) Improve Graceful Shutdown Behavior for BE and FE, and Optimize Query Retry During BE Shutdown (#56601) branch-3.1: [opt](scheduler) Improve Graceful Shutdown Behavior for BE and FE, and Optimize Query Retry During BE Shutdown #56601 Nov 7, 2025
@morningman
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 55.00% (11/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 45.86% (12944/28222)
Line Coverage 36.70% (115789/315513)
Region Coverage 34.21% (66023/192991)
Branch Coverage 31.23% (34674/111012)

@hello-stephen
Copy link
Contributor

BE Regression && UT Coverage Report

Increment line coverage 63.64% (14/22) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 76.32% (21222/27805)
Line Coverage 69.57% (219508/315531)
Region Coverage 67.49% (130978/194070)
Branch Coverage 61.11% (68225/111640)

@morrySnow morrySnow merged commit 6e2eecd into apache:branch-3.1 Nov 8, 2025
20 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants