[opt](scheduler) Improve Graceful Shutdown Behavior for BE and FE, and Optimize Query Retry During BE Shutdown #56601

morningman · 2025-09-28T16:20:37Z

What problem does this PR solve?

Related PR: #23865

This PR includes the following main changes:

BE Graceful Shutdown Improvements

New BE Parameter: grace_shutdown_post_delay_seconds

When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished.
Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time.
Enhanced BE api/health Endpoint
- When the BE has not yet fully started or is in the process of shutting down, the endpoint will return:
  - Message: "Server is not available"
  - HTTP Code: 200
- Under normal circumstances:
  - Message: "OK"
  - HTTP Code: 200

Added FE Graceful Shutdown Support

When using stop_fe.sh --grace, the FE will wait for currently running queries to finish before exiting.

Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included.

Query Retry Optimization During BE Shutdown

In cloud mode, when encountering the error "No backend available as scan node",
the FE will now internally retry the query to reassign it to other available BE nodes.

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
  1. execute select sleep(30)
  2. execute sh bin/stop_fe.sh --grace
  3. FE process will exit after select finish
  4. During the shutting down, if another select sleep(30) is executed, FE will also wait this query finish
  5. use jmeter to run concurrent queries, and stop one BE, there should be no failure
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Thearas · 2025-09-28T16:20:42Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

morningman · 2025-10-01T00:45:43Z

run buildall

doris-robot · 2025-10-01T01:27:53Z

TPC-DS: Total hot run time: 190772 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 9300273d01fbff391170d4ad52c89a86dc50a3c8, data reload: false

query1	1062	432	401	401
query2	6566	1681	1654	1654
query3	6763	227	223	223
query4	26005	23670	23326	23326
query5	5071	646	506	506
query6	350	259	246	246
query7	4655	493	305	305
query8	310	271	267	267
query9	8758	2593	2593	2593
query10	534	353	288	288
query11	15413	15099	14778	14778
query12	188	115	115	115
query13	1670	551	420	420
query14	11594	9460	9334	9334
query15	210	205	183	183
query16	7660	670	511	511
query17	1446	729	635	635
query18	2044	461	375	375
query19	238	219	201	201
query20	145	131	130	130
query21	268	134	123	123
query22	4732	4622	4516	4516
query23	34881	34308	34064	34064
query24	8688	2532	2554	2532
query25	600	524	467	467
query26	1259	292	168	168
query27	3259	530	374	374
query28	4406	2279	2209	2209
query29	800	613	504	504
query30	303	237	219	219
query31	960	904	770	770
query32	82	74	80	74
query33	585	439	348	348
query34	824	926	570	570
query35	843	859	807	807
query36	1005	1079	907	907
query37	119	106	89	89
query38	3495	3571	3504	3504
query39	1490	1410	1437	1410
query40	222	129	119	119
query41	60	60	64	60
query42	119	110	110	110
query43	497	502	470	470
query44	1359	845	835	835
query45	185	178	171	171
query46	842	1007	636	636
query47	1751	1802	1733	1733
query48	391	413	316	316
query49	749	516	411	411
query50	647	689	415	415
query51	3879	3905	3898	3898
query52	112	108	102	102
query53	230	267	198	198
query54	610	593	540	540
query55	94	84	90	84
query56	319	339	290	290
query57	1176	1211	1112	1112
query58	291	282	278	278
query59	2562	2644	2680	2644
query60	356	367	324	324
query61	153	154	158	154
query62	801	735	716	716
query63	228	226	209	209
query64	4419	1142	848	848
query65	4042	3988	4010	3988
query66	1088	432	342	342
query67	15684	15105	15062	15062
query68	5410	902	610	610
query69	551	335	295	295
query70	1427	1336	1311	1311
query71	481	335	323	323
query72	5921	4992	4872	4872
query73	536	577	361	361
query74	8874	8916	8698	8698
query75	3689	3333	2851	2851
query76	3196	1145	791	791
query77	792	421	343	343
query78	9436	9810	8949	8949
query79	2005	816	606	606
query80	617	561	503	503
query81	500	270	223	223
query82	432	164	131	131
query83	279	278	252	252
query84	331	103	91	91
query85	900	480	420	420
query86	335	323	308	308
query87	3771	3822	3677	3677
query88	3810	2297	2257	2257
query89	396	340	305	305
query90	2003	224	246	224
query91	170	168	129	129
query92	90	73	64	64
query93	1483	993	654	654
query94	696	444	335	335
query95	395	324	317	317
query96	489	586	296	296
query97	2975	2997	2853	2853
query98	253	220	209	209
query99	1416	1431	1303	1303
Total cold run time: 275806 ms
Total hot run time: 190772 ms

doris-robot · 2025-10-01T01:33:25Z

ClickBench: Total hot run time: 31.27 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 9300273d01fbff391170d4ad52c89a86dc50a3c8, data reload: false

query1	0.06	0.05	0.05
query2	0.09	0.06	0.06
query3	0.26	0.08	0.09
query4	1.61	0.12	0.12
query5	0.29	0.27	0.26
query6	1.21	0.67	0.65
query7	0.03	0.03	0.04
query8	0.06	0.05	0.05
query9	0.63	0.54	0.53
query10	0.60	0.57	0.57
query11	0.17	0.14	0.12
query12	0.16	0.12	0.13
query13	0.63	0.62	0.62
query14	1.02	1.02	1.04
query15	0.86	0.87	0.85
query16	0.40	0.39	0.40
query17	1.04	1.07	1.07
query18	0.23	0.20	0.20
query19	1.99	1.85	1.89
query20	0.02	0.02	0.02
query21	15.44	0.94	0.59
query22	0.76	1.25	0.68
query23	14.84	1.37	0.60
query24	7.05	1.40	1.49
query25	0.49	0.25	0.10
query26	0.61	0.16	0.14
query27	0.07	0.06	0.06
query28	9.71	1.34	0.93
query29	12.64	3.91	3.31
query30	0.31	0.16	0.14
query31	2.83	0.59	0.40
query32	3.26	0.56	0.49
query33	3.15	3.11	3.12
query34	16.10	5.46	4.84
query35	4.92	4.91	4.90
query36	0.70	0.51	0.52
query37	0.10	0.07	0.08
query38	0.08	0.05	0.05
query39	0.04	0.03	0.04
query40	0.19	0.17	0.16
query41	0.09	0.03	0.04
query42	0.05	0.04	0.03
query43	0.05	0.04	0.04
Total cold run time: 104.84 s
Total hot run time: 31.27 s

hello-stephen · 2025-10-01T02:02:28Z

FE UT Coverage Report

Increment line coverage 27.78% (5/18) 🎉
Increment coverage report
Complete coverage report

hello-stephen · 2025-10-01T03:18:24Z

FE Regression Coverage Report

Increment line coverage 77.78% (14/18) 🎉
Increment coverage report
Complete coverage report

morningman · 2025-10-07T00:55:14Z

run buildall

doris-robot · 2025-10-07T02:38:44Z

TPC-DS: Total hot run time: 189841 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 2e514e3001f513e3c1cb9175d11199038ea0d444, data reload: false

query1	1112	421	411	411
query2	6580	1722	1694	1694
query3	6751	226	225	225
query4	26240	23959	23216	23216
query5	4825	658	495	495
query6	330	243	214	214
query7	4646	511	304	304
query8	325	283	262	262
query9	8742	2589	2566	2566
query10	469	339	272	272
query11	15491	15087	14825	14825
query12	184	119	112	112
query13	1676	575	427	427
query14	11032	9445	9443	9443
query15	223	199	183	183
query16	7775	676	513	513
query17	1199	735	612	612
query18	2100	478	360	360
query19	225	223	190	190
query20	140	143	135	135
query21	227	145	116	116
query22	4788	4899	4539	4539
query23	34817	34163	33492	33492
query24	8449	2392	2428	2392
query25	576	531	446	446
query26	1241	270	160	160
query27	2766	497	359	359
query28	4371	2187	2130	2130
query29	770	616	476	476
query30	302	222	199	199
query31	930	828	740	740
query32	81	76	72	72
query33	581	375	326	326
query34	817	854	518	518
query35	830	852	765	765
query36	1001	1036	913	913
query37	125	126	87	87
query38	3580	3543	3532	3532
query39	1468	1434	1409	1409
query40	225	134	122	122
query41	75	65	65	65
query42	127	117	115	115
query43	499	503	475	475
query44	1395	836	815	815
query45	188	184	180	180
query46	873	997	652	652
query47	1768	1828	1732	1732
query48	404	440	331	331
query49	805	532	421	421
query50	666	685	416	416
query51	4032	3949	3892	3892
query52	118	118	103	103
query53	246	274	207	207
query54	625	602	547	547
query55	92	87	82	82
query56	344	337	325	325
query57	1175	1200	1150	1150
query58	326	279	282	279
query59	2567	2631	2540	2540
query60	352	347	337	337
query61	162	153	157	153
query62	805	731	653	653
query63	230	196	193	193
query64	4418	1254	853	853
query65	4082	4009	3994	3994
query66	1109	439	336	336
query67	15477	15325	15143	15143
query68	9381	949	592	592
query69	483	335	305	305
query70	1400	1346	1338	1338
query71	495	346	329	329
query72	5754	4957	4967	4957
query73	701	583	358	358
query74	8939	9139	8744	8744
query75	4585	3331	2847	2847
query76	4040	1228	735	735
query77	1001	415	325	325
query78	9587	9828	8911	8911
query79	3326	841	584	584
query80	701	587	532	532
query81	497	261	230	230
query82	280	160	131	131
query83	305	271	253	253
query84	300	121	100	100
query85	893	487	438	438
query86	348	329	321	321
query87	3813	3721	3631	3631
query88	2887	2244	2228	2228
query89	432	338	295	295
query90	2143	226	223	223
query91	159	169	138	138
query92	88	66	67	66
query93	2237	995	644	644
query94	695	428	360	360
query95	457	328	326	326
query96	489	585	276	276
query97	2957	2987	2874	2874
query98	241	218	213	213
query99	1471	1425	1296	1296
Total cold run time: 282055 ms
Total hot run time: 189841 ms

doris-robot · 2025-10-07T02:44:15Z

ClickBench: Total hot run time: 30.44 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 2e514e3001f513e3c1cb9175d11199038ea0d444, data reload: false

query1	0.05	0.05	0.06
query2	0.09	0.05	0.05
query3	0.25	0.09	0.08
query4	1.61	0.12	0.12
query5	0.29	0.26	0.25
query6	1.18	0.67	0.65
query7	0.03	0.03	0.02
query8	0.06	0.05	0.04
query9	0.63	0.55	0.52
query10	0.58	0.57	0.57
query11	0.17	0.12	0.11
query12	0.16	0.12	0.13
query13	0.65	0.62	0.62
query14	1.04	1.05	1.04
query15	0.90	0.87	0.88
query16	0.40	0.42	0.40
query17	1.05	1.08	1.09
query18	0.22	0.21	0.21
query19	1.99	1.88	1.83
query20	0.01	0.01	0.02
query21	15.42	0.96	0.58
query22	0.78	1.16	0.66
query23	14.99	1.40	0.63
query24	7.23	1.45	0.40
query25	0.31	0.24	0.11
query26	0.72	0.17	0.14
query27	0.07	0.05	0.05
query28	9.07	1.44	0.93
query29	12.71	4.02	3.30
query30	0.28	0.14	0.10
query31	2.83	0.60	0.40
query32	3.23	0.57	0.49
query33	3.11	3.09	3.10
query34	16.12	5.58	4.98
query35	5.06	5.06	5.02
query36	0.72	0.56	0.52
query37	0.11	0.08	0.08
query38	0.08	0.05	0.05
query39	0.04	0.03	0.04
query40	0.18	0.15	0.15
query41	0.09	0.04	0.03
query42	0.04	0.03	0.03
query43	0.04	0.04	0.03
Total cold run time: 104.59 s
Total hot run time: 30.44 s

hello-stephen · 2025-10-07T05:25:17Z

FE Regression Coverage Report

Increment line coverage 55.17% (16/29) 🎉
Increment coverage report
Complete coverage report

morningman · 2025-10-20T08:29:12Z

run buildall

Copilot

Pull Request Overview

This pull request implements graceful shutdown mechanisms for both Frontend (FE) and Backend (BE) servers, and refactors error handling for cloud retry logic. The key changes include adding server readiness checks, graceful shutdown hooks, and consolidating retry error handling into a centralized method.

Adds graceful shutdown for FE and BE servers with wait logic for running queries/tasks
Implements server readiness health checks that return appropriate status during startup/shutdown
Refactors cloud retry error checking into a centralized SystemInfoService.needRetryWithReplan() method
Moves backend shutdown state from AtomicBoolean to serialized BackendStatus.isShutdown field

Reviewed Changes

Copilot reviewed 20 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
SystemInfoService.java	Adds centralized retry error detection method and constant set
Backend.java	Refactors setters to return boolean indicating state change; moves isShutDown to BackendStatus
StmtExecutor.java	Updates retry logic to use centralized error detection method
OlapInsertExecutor.java	Updates retry logic to use centralized error detection method
AbstractInsertExecutor.java	Updates retry logic to use centralized error detection method
StreamLoadHandler.java	Changes backend filter from isAlive to isLoadAvailable
MTMVTask.java	Updates retry logic to use centralized error detection method
RestApiStatusCode.java	Adds SERVICE_UNAVAILABLE status code
HealthAction.java	Adds server readiness check to health endpoint
ResponseEntityBuilder.java	Adds serviceUnavailable response builder method
HttpServer.java	Implements graceful shutdown and disables automatic shutdown hook
FeConstants.java	Removes unused constants including CLOUD_RETRY_E230
CloudReplica.java	Updates backend filtering to use isQueryAvailable
DorisFE.java	Implements graceful shutdown logic and server readiness tracking
doris_main.cpp	Sets server readiness flags during startup/shutdown
exec_env.h	Adds k_is_server_ready global flag
exec_env.cpp	Adds post-delay sleep after graceful shutdown
health_action.cpp	Updates health check to return status based on server readiness
config.h	Adds grace_shutdown_post_delay_seconds configuration
config.cpp	Defines default value for grace_shutdown_post_delay_seconds

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-05T04:03:10Z

fe/fe-core/src/main/java/org/apache/doris/system/Backend.java

        this.backendStatus.lastStreamLoadTime = lastStreamLoadTime;
    }

+    // ATTN: This method only the value of "isQueryDisabled",


Incomplete sentence in comment. Should be 'This method only returns the value of "isQueryDisabled"'.

Suggested change

// ATTN: This method only the value of "isQueryDisabled",

// ATTN: This method only returns the value of "isQueryDisabled",

Copilot · 2025-11-05T04:03:11Z

fe/fe-core/src/main/java/org/apache/doris/system/Backend.java

+        return false;
    }

+    // ATTN: This method only the value of "isLoadDisabled",


Incomplete sentence in comment. Should be 'This method only returns the value of "isLoadDisabled"'.

Suggested change

// ATTN: This method only the value of "isLoadDisabled",

// ATTN: This method only returns the value of "isLoadDisabled",

Copilot · 2025-11-05T04:03:11Z

fe/fe-core/src/main/java/org/apache/doris/system/Backend.java

        @SerializedName("isActive")
        public volatile boolean isActive = true;
+        @SerializedName("isShutdown")
+        public volatile boolean isShutdown = true;


The default value of isShutdown should be false, not true. A newly created backend should not be marked as shutdown by default.

Suggested change

public volatile boolean isShutdown = true;

public volatile boolean isShutdown = false;

Copilot · 2025-11-05T04:03:11Z

fe/fe-core/src/main/java/org/apache/doris/qe/StmtExecutor.java

                boolean isNeedRetry = false;
                if (Config.isCloudMode()) {
-                    // cloud mode retry
+                    // cloud mode retryF


Typo in comment: 'retryF' should be 'retry'.

Suggested change

// cloud mode retryF

// cloud mode retry

Copilot · 2025-11-05T04:03:12Z

fe/fe-core/src/main/java/org/apache/doris/DorisFE.java

            while (true) {
                Thread.sleep(2000);
            }
+            LOG.info("Doris FE main loop exited, shutting down gracefully...");


This log statement is unreachable because the while loop on line 264 is infinite. The code will never reach line 267.

Copilot · 2025-11-05T04:03:12Z

fe/fe-core/src/main/java/org/apache/doris/DorisFE.java

+            LOG.info("Doris FE main loop exited, shutting down gracefully...");
        } catch (Throwable e) {
-            // Some exception may thrown before LOG is inited.
+            // Some exception may throw before LOG is inited.


Grammatically incorrect: 'may throw' should be 'may be thrown'.

Suggested change

// Some exception may throw before LOG is inited.

// Some exception may be thrown before LOG is inited.

morningman · 2025-11-05T04:12:41Z

run buildall

github-actions · 2025-11-05T04:27:55Z

PR approved by at least one committer and no changes requested.

morningman · 2025-11-07T02:47:20Z

run buildall

github-actions · 2025-11-07T03:17:42Z

PR approved by at least one committer and no changes requested.

deardeng

LGTM

doris-robot · 2025-11-07T03:31:56Z

TPC-DS: Total hot run time: 189127 ms

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 33e99f524332d3449b56f4095135e982b27cb5c4, data reload: false

query1	1024	406	390	390
query2	6596	1705	1714	1705
query3	6751	223	218	218
query4	26618	23312	23056	23056
query5	4422	604	459	459
query6	343	230	231	230
query7	4644	498	298	298
query8	301	266	250	250
query9	8685	2582	2618	2582
query10	499	343	288	288
query11	16039	15137	14951	14951
query12	187	119	115	115
query13	1688	574	454	454
query14	11900	9341	9489	9341
query15	214	192	178	178
query16	7705	691	531	531
query17	1679	772	607	607
query18	2042	484	358	358
query19	227	225	194	194
query20	157	128	135	128
query21	239	144	121	121
query22	4713	4828	4546	4546
query23	34747	34309	33384	33384
query24	9187	2522	2577	2522
query25	626	524	494	494
query26	1279	287	173	173
query27	3092	511	364	364
query28	4537	2238	2222	2222
query29	829	665	522	522
query30	311	238	196	196
query31	1019	861	757	757
query32	88	81	72	72
query33	599	403	329	329
query34	822	888	546	546
query35	830	846	796	796
query36	979	1024	933	933
query37	129	112	93	93
query38	3671	3619	3528	3528
query39	1493	1418	1399	1399
query40	221	134	120	120
query41	65	61	63	61
query42	118	110	114	110
query43	480	496	470	470
query44	1210	778	748	748
query45	183	181	173	173
query46	881	994	634	634
query47	1749	1839	1724	1724
query48	412	430	321	321
query49	763	505	413	413
query50	642	686	416	416
query51	3893	3965	3825	3825
query52	116	108	102	102
query53	238	268	199	199
query54	317	305	287	287
query55	88	83	86	83
query56	320	336	316	316
query57	1193	1195	1108	1108
query58	297	276	274	274
query59	2559	2721	2475	2475
query60	354	364	337	337
query61	194	189	188	188
query62	794	729	703	703
query63	225	194	195	194
query64	4636	1248	852	852
query65	4064	3979	3944	3944
query66	1092	425	319	319
query67	15660	15330	15132	15132
query68	8552	875	597	597
query69	488	318	282	282
query70	1341	1369	1186	1186
query71	518	333	306	306
query72	5752	4910	4962	4910
query73	677	588	362	362
query74	9233	9129	8691	8691
query75	4036	3290	2824	2824
query76	3802	1149	755	755
query77	808	387	308	308
query78	9482	9564	8930	8930
query79	2528	855	602	602
query80	700	556	497	497
query81	499	263	228	228
query82	466	158	133	133
query83	295	267	245	245
query84	307	107	95	95
query85	910	485	450	450
query86	385	309	283	283
query87	3777	3832	3614	3614
query88	3577	2285	2271	2271
query89	384	320	285	285
query90	1959	210	218	210
query91	164	167	135	135
query92	84	69	63	63
query93	1965	979	655	655
query94	699	457	343	343
query95	406	317	320	317
query96	491	577	295	295
query97	2931	2985	2862	2862
query98	233	224	223	223
query99	1450	1401	1312	1312
Total cold run time: 282574 ms
Total hot run time: 189127 ms

doris-robot · 2025-11-07T03:37:02Z

ClickBench: Total hot run time: 27.62 s

machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 33e99f524332d3449b56f4095135e982b27cb5c4, data reload: false

query1	0.05	0.04	0.05
query2	0.08	0.04	0.04
query3	0.26	0.08	0.08
query4	1.61	0.12	0.12
query5	0.27	0.26	0.26
query6	1.18	0.64	0.64
query7	0.04	0.03	0.03
query8	0.05	0.04	0.04
query9	0.58	0.53	0.52
query10	0.58	0.58	0.58
query11	0.15	0.11	0.12
query12	0.15	0.12	0.12
query13	0.62	0.61	0.60
query14	1.00	1.00	1.01
query15	0.84	0.84	0.84
query16	0.40	0.38	0.38
query17	1.06	1.06	1.02
query18	0.21	0.20	0.19
query19	1.84	1.83	1.77
query20	0.02	0.02	0.02
query21	15.46	0.20	0.13
query22	5.12	0.07	0.05
query23	15.66	0.25	0.10
query24	2.93	0.59	0.50
query25	0.07	0.06	0.07
query26	0.13	0.12	0.12
query27	0.06	0.05	0.05
query28	4.24	1.15	0.93
query29	12.65	3.94	3.23
query30	0.28	0.15	0.11
query31	2.81	0.58	0.39
query32	3.24	0.55	0.48
query33	3.02	3.03	3.07
query34	15.78	5.18	4.59
query35	4.59	4.59	4.58
query36	0.68	0.51	0.50
query37	0.10	0.07	0.07
query38	0.07	0.04	0.04
query39	0.04	0.03	0.03
query40	0.17	0.15	0.14
query41	0.09	0.03	0.02
query42	0.03	0.03	0.03
query43	0.04	0.04	0.04
Total cold run time: 98.25 s
Total hot run time: 27.62 s

hello-stephen · 2025-11-07T04:09:45Z

BE UT Coverage Report

Increment line coverage 55.00% (11/20) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	52.78% (18225/34532)
Line Coverage	38.13% (165765/434725)
Region Coverage	33.14% (128978/389193)
Branch Coverage	33.87% (55316/163328)

hello-stephen · 2025-11-07T05:27:28Z

BE Regression && UT Coverage Report

Increment line coverage 63.64% (14/22) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	71.53% (24280/33946)
Line Coverage	58.01% (252617/435445)
Region Coverage	53.33% (210492/394714)
Branch Coverage	54.66% (89872/164411)

hello-stephen · 2025-11-07T05:41:48Z

FE Regression Coverage Report

Increment line coverage 38.27% (31/81) 🎉
Increment coverage report
Complete coverage report

…d Optimize Query Retry During BE Shutdown (apache#56601) Related PR: apache#23865 This PR includes the following main changes: 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.

…E and FE, and Optimize Query Retry During BE Shutdown #56601 (#57805) bp #56601

Followup #56601

…d Optimize Query Retry During BE Shutdown (apache#56601) ### What problem does this PR solve? Related PR: apache#23865 This PR includes the following main changes: #### BE Graceful Shutdown Improvements 1. New BE Parameter: `grace_shutdown_post_delay_seconds` When using the BE graceful stop feature, after the main process waits for all currently running tasks to complete, it will continue to wait for an additional period to ensure that queries still running on other nodes have also finished. Since a BE node cannot detect the execution status of tasks on other BE nodes, this threshold may need to be increased to allow a longer waiting time. 2. Enhanced BE `api/health` Endpoint * When the BE has not yet fully started or is in the process of shutting down, the endpoint will return: * Message: `"Server is not available"` * HTTP Code: `200` * Under normal circumstances: * Message: `"OK"` * HTTP Code: `200` #### Added FE Graceful Shutdown Support When using `stop_fe.sh --grace`, the FE will wait for currently running queries to finish before exiting. Note, Currently, only query tasks are waited for; import and other types of tasks are not yet included. #### Query Retry Optimization During BE Shutdown In cloud mode, when encountering the error `"No backend available as scan node"`, the FE will now internally retry the query to reassign it to other available BE nodes.

Followup apache#56601

morningman requested review from dataroaring, gavinchou and w41ter as code owners September 28, 2025 16:20

morningman added dev/3.0.x dev/3.1.x dev/4.0.x and removed dev/3.0.x labels Sep 29, 2025

morningman force-pushed the be_shutdown_notice branch from 34da79c to 3279622 Compare October 6, 2025 23:00

morningman force-pushed the be_shutdown_notice branch 2 times, most recently from b9eff70 to f6b418d Compare October 10, 2025 15:18

morningman force-pushed the be_shutdown_notice branch from f6b418d to 1acd70c Compare November 5, 2025 03:51

morningman requested a review from Copilot November 5, 2025 03:52

Copilot AI reviewed Nov 5, 2025

View reviewed changes

morningman changed the title ~~[fix](scheduler) disable query and load when BE is shutting down~~ [opt](scheduler) Improve Graceful Shutdown Behavior for BE and FE, and Optimize Query Retry During BE Shutdown Nov 5, 2025

yiguolei previously approved these changes Nov 5, 2025

View reviewed changes

github-actions bot added approved Indicates a PR has been approved by one committer. reviewed labels Nov 5, 2025

morningman added 2 commits November 7, 2025 10:36

add test

6498376

3

33e99f5

morningman force-pushed the be_shutdown_notice branch from 5a5ae36 to 33e99f5 Compare November 7, 2025 02:36

gavinchou approved these changes Nov 7, 2025

View reviewed changes

github-actions bot added the approved Indicates a PR has been approved by one committer. label Nov 7, 2025

deardeng approved these changes Nov 7, 2025

View reviewed changes

CalvinKirs approved these changes Nov 7, 2025

View reviewed changes

morningman merged commit bbded3a into apache:master Nov 7, 2025
28 of 30 checks passed

github-actions bot added dev/3.1.x-conflict dev/4.0.x-conflict labels Nov 7, 2025

morningman mentioned this pull request Nov 7, 2025

branch-3.1: [opt](scheduler) Improve Graceful Shutdown Behavior for BE and FE, and Optimize Query Retry During BE Shutdown #56601 #57805

Merged

morrySnow pushed a commit that referenced this pull request Nov 8, 2025

branch-3.1: [opt](scheduler) Improve Graceful Shutdown Behavior for B…

6e2eecd

…E and FE, and Optimize Query Retry During BE Shutdown #56601 (#57805) bp #56601

morrySnow added dev/3.1.3-merged and removed dev/3.1.x dev/3.1.x-conflict labels Nov 8, 2025

morningman mentioned this pull request Nov 14, 2025

[fix](grace-stop) remove heartbeat condition #58019

Merged

morningman added a commit that referenced this pull request Nov 15, 2025

[fix](grace-stop) remove heartbeat condition (#58019)

915b337

Followup #56601

github-actions bot pushed a commit that referenced this pull request Nov 15, 2025

[fix](grace-stop) remove heartbeat condition (#58019)

9330178

Followup #56601

morningman added a commit to morningman/doris that referenced this pull request Nov 29, 2025

[fix](grace-stop) remove heartbeat condition (apache#58019)

1128939

Followup apache#56601

morningman mentioned this pull request Nov 29, 2025

branch-4.0: [opt](scheduler) Improve Graceful Shutdown Behavior for BE and FE, and Optimize Query Retry During BE Shutdown #56601 #58019 #58526

Open

	// ATTN: This method only the value of "isQueryDisabled",
	// ATTN: This method only returns the value of "isQueryDisabled",

	public volatile boolean isShutdown = true;
	public volatile boolean isShutdown = false;

	// Some exception may throw before LOG is inited.
	// Some exception may be thrown before LOG is inited.

[opt](scheduler) Improve Graceful Shutdown Behavior for BE and FE, and Optimize Query Retry During BE Shutdown #56601

[opt](scheduler) Improve Graceful Shutdown Behavior for BE and FE, and Optimize Query Retry During BE Shutdown #56601

Uh oh!

Conversation

morningman commented Sep 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

BE Graceful Shutdown Improvements

Added FE Graceful Shutdown Support

Query Retry Optimization During BE Shutdown

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

Thearas commented Sep 28, 2025

Uh oh!

morningman commented Oct 1, 2025

Uh oh!

doris-robot commented Oct 1, 2025

Uh oh!

doris-robot commented Oct 1, 2025

Uh oh!

hello-stephen commented Oct 1, 2025

FE UT Coverage Report

Uh oh!

hello-stephen commented Oct 1, 2025

FE Regression Coverage Report

Uh oh!

morningman commented Oct 7, 2025

Uh oh!

doris-robot commented Oct 7, 2025

Uh oh!

doris-robot commented Oct 7, 2025

Uh oh!

hello-stephen commented Oct 7, 2025

FE Regression Coverage Report

Uh oh!

morningman commented Oct 20, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

morningman commented Nov 5, 2025

Uh oh!

github-actions bot commented Nov 5, 2025

Uh oh!

morningman commented Nov 7, 2025

Uh oh!

github-actions bot commented Nov 7, 2025

Uh oh!

deardeng left a comment

Choose a reason for hiding this comment

Uh oh!

doris-robot commented Nov 7, 2025

Uh oh!

doris-robot commented Nov 7, 2025

Uh oh!

hello-stephen commented Nov 7, 2025

BE UT Coverage Report

Uh oh!

morningman commented Sep 28, 2025 •

edited

Loading