Skip to content

Conversation

@morningman
Copy link
Contributor

…d Optimize Query Retry During BE Shutdown (apache#56601)

### What problem does this PR solve?

Related PR: apache#23865

This PR includes the following main changes:

#### BE Graceful Shutdown Improvements

1. New BE Parameter: `grace_shutdown_post_delay_seconds`

When using the BE graceful stop feature, after the main process waits
for all currently running tasks to complete, it will continue to wait
for an additional period to ensure that queries still running on other
nodes have also finished.
Since a BE node cannot detect the execution status of tasks on other BE
nodes, this threshold may need to be increased to allow a longer waiting
time.

2. Enhanced BE `api/health` Endpoint

* When the BE has not yet fully started or is in the process of shutting
down, the endpoint will return:

     * Message: `"Server is not available"`
     * HTTP Code: `200`

   * Under normal circumstances:

     * Message: `"OK"`
     * HTTP Code: `200`

#### Added FE Graceful Shutdown Support

When using `stop_fe.sh --grace`, the FE will wait for currently running
queries to finish before exiting.

Note, Currently, only query tasks are waited for; import and other types
of tasks are not yet included.

#### Query Retry Optimization During BE Shutdown

In cloud mode, when encountering the error `"No backend available as
scan node"`,
the FE will now internally retry the query to reassign it to other
available BE nodes.
@morningman morningman requested a review from yiguolei as a code owner November 29, 2025 01:58
@hello-stephen
Copy link
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@morningman
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 55.00% (11/20) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 52.69% (18210/34560)
Line Coverage 38.08% (165785/435357)
Region Coverage 33.08% (128642/388857)
Branch Coverage 33.93% (55457/163463)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants