Skip to content

Conversation

@xy720
Copy link
Member

@xy720 xy720 commented Sep 7, 2025

What problem does this PR solve?

I found when there are large amount of garbage(about 90000 partitions) in recycle bin, the Fe's table lock will be hold for long time by DynamicPartitionScheduler thread, the stack is like:

"recycle bin" #28 daemon prio=5 os_prio=0 cpu=73880509.81ms elapsed=96569.50s allocated=9212M defined_classes=9 tid=0x00007f0b545c1800 nid=0x2f4540 runnable  [0x00007f0b251fd000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.doris.catalog.CatalogRecycleBin.getSameNamePartitionIdListToErase(CatalogRecycleBin.java:539)
        - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.CatalogRecycleBin.erasePartitionWithSameName(CatalogRecycleBin.java:556)
        - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.CatalogRecycleBin.erasePartition(CatalogRecycleBin.java:510)
        - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.CatalogRecycleBin.runAfterCatalogReady(CatalogRecycleBin.java:1012)
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58)
        at org.apache.doris.common.util.Daemon.run(Daemon.java:119)

   Locked ownable synchronizers:
        - None

"DynamicPartitionScheduler" #41 daemon prio=5 os_prio=0 cpu=115405.50ms elapsed=87942.53s allocated=16637M defined_classes=96 tid=0x00007f0b545cc800 nid=0x2f4545 waiting for monitor entry  [0x00007f0b247fe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.doris.catalog.CatalogRecycleBin.recyclePartition(CatalogRecycleBin.java:187)
        - waiting to lock <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.OlapTable.dropPartition(OlapTable.java:1164)
        at org.apache.doris.catalog.OlapTable.dropPartition(OlapTable.java:1207)
        at org.apache.doris.datasource.InternalCatalog.dropPartitionWithoutCheck(InternalCatalog.java:1895)
        at org.apache.doris.datasource.InternalCatalog.dropPartition(InternalCatalog.java:1884)
        at org.apache.doris.catalog.Env.dropPartition(Env.java:3212)
        at org.apache.doris.clone.DynamicPartitionScheduler.executeDynamicPartition(DynamicPartitionScheduler.java:605)
        at org.apache.doris.clone.DynamicPartitionScheduler.runAfterCatalogReady(DynamicPartitionScheduler.java:729)
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58)
        at org.apache.doris.clone.DynamicPartitionScheduler.run(DynamicPartitionScheduler.java:688)

The DynamicPartitionScheduler thread is waiting the CatalogRecycleBin thread while the table write lock is holding by itself .
In Fe log, you can see the CatalogRecycleBin thread is running something big and cost almost 5~10 mins every run:

fe.log.20250907-2:2025-09-07 04:15:50,740 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 375503ms
fe.log.20250907-2:2025-09-07 04:23:14,109 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 413369ms
fe.log.20250907-2:2025-09-07 04:30:01,187 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 377077ms
fe.log.20250907-2:2025-09-07 04:38:22,769 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 471581ms
fe.log.20250907-2:2025-09-07 04:45:42,552 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 409782ms
fe.log.20250907-2:2025-09-07 04:54:30,825 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 498272ms
fe.log.20250907-2:2025-09-07 05:01:36,311 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 395485ms

The most costly task of the CatalogRecycleBin thread is erasing the partition with same name:

2025-09-07 04:16:20,884 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62638463] name: p_2019051116000
0_20190511170000 from table[32976073] from db[682022]
2025-09-07 04:16:20,994 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62640651] name: p_2019043016000
0_20190430170000 from table[32976073] from db[682022]
2025-09-07 04:16:21,438 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[60264769] name: p_2019051721000
0_20190517220000 from table[32976073] from db[682022]
2025-09-07 04:16:21,787 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62651922] name: p_2019051015000
0_20190510160000 from table[32976073] from db[682022]
2025-09-07 04:16:21,893 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[59222503] name: p_2019052708000
0_20190527090000 from table[32976073] from db[682022]
2025-09-07 04:16:22,204 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62656398] name: p_2019051109000
0_20190511100000 from table[32976073] from db[682022]
2025-09-07 04:16:22,430 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[59228497] name: p_2019051812000
0_20190518130000 from table[32976073] from db[682022]
2025-09-07 04:16:22,493 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62658335] name: p_2019051217000
0_20190512180000 from table[32976073] from db[682022]
...

This may leads to whole Fe hang because the table lock is used for many threads.
Clipboard_Screenshot_1757283600

This commit mainly optimize the logic of recycling the same name meta, adding caches to reduce the time complexity.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Sep 7, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@xy720 xy720 changed the title [enhancement](recycle bin) reduce [enhancement](recycle bin) optimize the recycle bin to reduce the potential of FE hang Sep 7, 2025
}
idToRecycleTime.put(table.getId(), recycleTime);
idToTable.put(table.getId(), tableInfo);
String key = dbId + "_" + table.getName();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change all + "_" + expression to function calls
to implement the concatenation rules with functions, including db table and partitions..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized using dbId + "_" + tableName is a bad idea because tablet/partition name may contains "_".
So I used a Map<Pair<Long, String>, Set> structure instead.

@xy720
Copy link
Member Author

xy720 commented Sep 8, 2025

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 35042 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit c48a5adfe7fa5f33c9c338d3d5a8f424f0fd1fbf, data reload: false

------ Round 1 ----------------------------------
q1	17602	5206	5287	5206
q2	1995	340	214	214
q3	10368	1369	757	757
q4	10602	1050	558	558
q5	9690	2457	2404	2404
q6	216	174	164	164
q7	964	775	619	619
q8	9361	1436	1196	1196
q9	7333	5100	5135	5100
q10	6966	2437	1978	1978
q11	489	317	282	282
q12	366	360	223	223
q13	17781	3683	3050	3050
q14	245	244	216	216
q15	599	503	481	481
q16	1005	998	942	942
q17	621	873	391	391
q18	7717	7188	7069	7069
q19	1104	960	577	577
q20	352	349	235	235
q21	4144	3254	2393	2393
q22	1109	1062	987	987
Total cold run time: 110629 ms
Total hot run time: 35042 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5139	5126	5094	5094
q2	244	333	229	229
q3	2275	2701	2338	2338
q4	1349	1807	1387	1387
q5	4283	4201	4166	4166
q6	214	169	133	133
q7	1941	1861	1683	1683
q8	2541	2449	2421	2421
q9	6890	6845	6807	6807
q10	2943	3169	2741	2741
q11	594	506	507	506
q12	666	744	597	597
q13	3272	3661	3081	3081
q14	268	292	270	270
q15	517	474	481	474
q16	1063	1061	1019	1019
q17	1134	1489	1401	1401
q18	7280	7175	6921	6921
q19	784	815	859	815
q20	1924	1959	1849	1849
q21	4897	4305	4354	4305
q22	1114	1037	1010	1010
Total cold run time: 51332 ms
Total hot run time: 49247 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 189783 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit c48a5adfe7fa5f33c9c338d3d5a8f424f0fd1fbf, data reload: false

query1	1082	431	438	431
query2	2360	1755	1762	1755
query3	6651	228	223	223
query4	25809	23522	22810	22810
query5	2180	638	558	558
query6	335	267	246	246
query7	4619	530	310	310
query8	310	261	259	259
query9	8532	2902	2947	2902
query10	499	362	312	312
query11	15928	14994	14816	14816
query12	172	120	121	120
query13	1692	558	441	441
query14	11104	9167	9196	9167
query15	228	191	185	185
query16	5920	672	512	512
query17	1656	766	638	638
query18	2056	438	366	366
query19	217	200	171	171
query20	129	130	121	121
query21	196	132	112	112
query22	4077	4217	4032	4032
query23	34078	33111	33182	33111
query24	7793	2392	2415	2392
query25	559	517	437	437
query26	872	277	169	169
query27	2719	514	359	359
query28	4379	2270	2261	2261
query29	797	606	506	506
query30	342	221	197	197
query31	902	816	737	737
query32	91	85	77	77
query33	567	393	366	366
query34	801	854	526	526
query35	831	853	761	761
query36	1003	1001	921	921
query37	129	115	98	98
query38	3517	3537	3457	3457
query39	1536	1448	1430	1430
query40	227	136	130	130
query41	68	67	64	64
query42	133	117	115	115
query43	489	496	462	462
query44	1361	889	915	889
query45	184	181	174	174
query46	895	1024	650	650
query47	1791	1850	1738	1738
query48	405	427	320	320
query49	850	516	421	421
query50	656	700	433	433
query51	3970	3964	4021	3964
query52	122	121	106	106
query53	250	284	202	202
query54	617	602	538	538
query55	92	93	95	93
query56	360	330	322	322
query57	1199	1193	1111	1111
query58	298	287	281	281
query59	2585	2634	2549	2549
query60	364	368	355	355
query61	165	153	166	153
query62	877	762	642	642
query63	235	202	209	202
query64	3581	1167	840	840
query65	4036	3921	4067	3921
query66	1127	444	358	358
query67	15534	15392	15091	15091
query68	5031	953	594	594
query69	511	327	294	294
query70	1370	1277	1328	1277
query71	495	357	325	325
query72	5904	5498	5465	5465
query73	715	717	368	368
query74	9157	9252	9009	9009
query75	3410	3255	2797	2797
query76	3430	1148	754	754
query77	549	404	345	345
query78	9755	9619	8907	8907
query79	3187	829	604	604
query80	1132	603	533	533
query81	510	267	223	223
query82	764	163	134	134
query83	375	264	279	264
query84	260	107	100	100
query85	994	473	432	432
query86	488	318	312	312
query87	3754	3780	3644	3644
query88	3853	2265	2244	2244
query89	410	340	321	321
query90	1960	241	233	233
query91	167	174	135	135
query92	96	77	71	71
query93	2721	1022	626	626
query94	726	422	333	333
query95	392	350	337	337
query96	501	590	285	285
query97	2965	2978	2885	2885
query98	243	219	214	214
query99	1373	1440	1302	1302
Total cold run time: 266337 ms
Total hot run time: 189783 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 30.1 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit c48a5adfe7fa5f33c9c338d3d5a8f424f0fd1fbf, data reload: false

query1	0.06	0.05	0.05
query2	0.10	0.06	0.05
query3	0.25	0.08	0.08
query4	1.61	0.11	0.12
query5	0.28	0.25	0.25
query6	1.17	0.66	0.65
query7	0.03	0.03	0.03
query8	0.06	0.05	0.04
query9	0.62	0.52	0.53
query10	0.58	0.57	0.57
query11	0.16	0.11	0.13
query12	0.16	0.13	0.12
query13	0.63	0.63	0.62
query14	1.04	1.05	1.06
query15	0.88	0.85	0.87
query16	0.40	0.41	0.43
query17	1.03	1.04	1.06
query18	0.21	0.20	0.20
query19	1.96	1.87	1.86
query20	0.02	0.01	0.01
query21	15.42	0.96	0.58
query22	0.77	1.26	0.78
query23	14.79	1.42	0.62
query24	7.14	0.64	0.74
query25	0.49	0.27	0.13
query26	0.65	0.17	0.13
query27	0.08	0.05	0.06
query28	9.59	0.96	0.44
query29	12.59	3.98	3.35
query30	0.29	0.13	0.11
query31	2.84	0.59	0.38
query32	3.24	0.56	0.49
query33	3.07	3.16	3.11
query34	16.10	5.43	4.88
query35	4.91	4.94	4.91
query36	0.71	0.51	0.50
query37	0.11	0.08	0.07
query38	0.06	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.14	0.14
query41	0.08	0.03	0.03
query42	0.04	0.04	0.03
query43	0.05	0.04	0.04
Total cold run time: 104.49 s
Total hot run time: 30.1 s

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 33.04% (38/115) 🎉
Increment coverage report
Complete coverage report

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 80.87% (93/115) 🎉
Increment coverage report
Complete coverage report

@xy720
Copy link
Member Author

xy720 commented Jan 6, 2026

run buildall

@xy720
Copy link
Member Author

xy720 commented Jan 6, 2026

run buildall

@doris-robot
Copy link

TPC-H: Total hot run time: 31881 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 3f96d78527fbe6c77bb0e764d197d905dacef676, data reload: false

------ Round 1 ----------------------------------
q1	17695	4269	4035	4035
q2	2074	338	242	242
q3	10131	1252	709	709
q4	10232	880	311	311
q5	7552	2052	1879	1879
q6	194	174	136	136
q7	947	778	671	671
q8	9325	1433	1180	1180
q9	4951	4658	4633	4633
q10	6809	1785	1432	1432
q11	537	302	259	259
q12	743	766	581	581
q13	17780	3945	3062	3062
q14	285	305	273	273
q15	589	510	505	505
q16	659	676	633	633
q17	663	819	532	532
q18	6778	6324	6926	6324
q19	1668	1016	594	594
q20	452	386	275	275
q21	3260	2606	2616	2606
q22	1138	1085	1009	1009
Total cold run time: 104462 ms
Total hot run time: 31881 ms

----- Round 2, with runtime_filter_mode=off -----
q1	4304	4282	4204	4204
q2	345	435	314	314
q3	2190	2867	2362	2362
q4	1455	1794	1435	1435
q5	4572	4371	4360	4360
q6	214	176	132	132
q7	1963	1950	1775	1775
q8	2550	2364	2365	2364
q9	7143	7185	7111	7111
q10	2429	2668	2258	2258
q11	545	478	432	432
q12	651	686	574	574
q13	3327	3743	3060	3060
q14	279	281	258	258
q15	527	491	482	482
q16	624	630	629	629
q17	1082	1207	1296	1207
q18	7476	7367	7104	7104
q19	819	753	813	753
q20	1914	1946	1778	1778
q21	4468	4244	4076	4076
q22	1067	1011	987	987
Total cold run time: 49944 ms
Total hot run time: 47655 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 171974 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 3f96d78527fbe6c77bb0e764d197d905dacef676, data reload: false

query5	4890	571	448	448
query6	341	226	217	217
query7	4257	456	259	259
query8	327	248	213	213
query9	8765	2685	2686	2685
query10	529	386	320	320
query11	15247	15208	14796	14796
query12	181	120	123	120
query13	1269	480	402	402
query14	6250	2960	2810	2810
query14_1	2672	2761	2652	2652
query15	210	193	174	174
query16	1031	475	367	367
query17	1069	681	591	591
query18	2583	446	336	336
query19	226	227	217	217
query20	128	117	123	117
query21	217	139	122	122
query22	4058	3917	3943	3917
query23	15944	15485	15417	15417
query23_1	15335	15400	15454	15400
query24	7389	1557	1163	1163
query24_1	1167	1175	1189	1175
query25	560	481	426	426
query26	1252	269	169	169
query27	2760	451	293	293
query28	4550	2149	2133	2133
query29	817	563	465	465
query30	314	239	212	212
query31	826	636	557	557
query32	84	73	69	69
query33	548	344	295	295
query34	921	867	531	531
query35	753	818	704	704
query36	829	901	820	820
query37	121	91	76	76
query38	2698	2684	2590	2590
query39	766	752	728	728
query39_1	704	700	721	700
query40	219	130	112	112
query41	68	62	63	62
query42	108	99	101	99
query43	441	442	407	407
query44	1310	719	718	718
query45	182	179	172	172
query46	846	950	577	577
query47	1445	1440	1364	1364
query48	306	315	238	238
query49	605	402	321	321
query50	625	267	199	199
query51	3788	3786	3845	3786
query52	106	110	94	94
query53	301	332	272	272
query54	277	250	253	250
query55	74	75	69	69
query56	282	277	283	277
query57	986	1007	924	924
query58	276	244	248	244
query59	2091	2197	2080	2080
query60	317	310	291	291
query61	164	158	159	158
query62	383	358	310	310
query63	312	266	273	266
query64	4944	1303	994	994
query65	3793	3720	3722	3720
query66	1420	431	296	296
query67	15529	14937	14598	14598
query68	6804	972	695	695
query69	487	343	310	310
query70	1026	979	866	866
query71	384	307	275	275
query72	6111	3411	3472	3411
query73	782	725	301	301
query74	8867	8726	8537	8537
query75	2822	2794	2469	2469
query76	3958	1061	644	644
query77	521	359	290	290
query78	9682	9844	9085	9085
query79	1620	857	586	586
query80	644	567	462	462
query81	514	258	224	224
query82	239	146	111	111
query83	273	251	230	230
query84	260	121	108	108
query85	899	518	451	451
query86	373	319	320	319
query87	2848	2830	2735	2735
query88	3233	2231	2225	2225
query89	389	352	315	315
query90	2038	144	149	144
query91	171	169	138	138
query92	75	68	61	61
query93	1331	878	527	527
query94	573	326	284	284
query95	565	316	301	301
query96	588	468	198	198
query97	2276	2337	2249	2249
query98	242	203	201	201
query99	591	569	531	531
Total cold run time: 253967 ms
Total hot run time: 171974 ms

@doris-robot
Copy link

ClickBench: Total hot run time: 26.89 s
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/clickbench-tools
ClickBench test result on commit 3f96d78527fbe6c77bb0e764d197d905dacef676, data reload: false

query1	0.06	0.05	0.04
query2	0.12	0.05	0.05
query3	0.26	0.08	0.08
query4	1.61	0.11	0.10
query5	0.26	0.27	0.28
query6	1.15	0.67	0.64
query7	0.03	0.03	0.02
query8	0.06	0.04	0.04
query9	0.56	0.50	0.50
query10	0.54	0.56	0.54
query11	0.15	0.10	0.10
query12	0.14	0.11	0.10
query13	0.61	0.58	0.59
query14	0.95	0.94	0.94
query15	0.79	0.77	0.76
query16	0.39	0.40	0.40
query17	1.06	1.06	1.01
query18	0.22	0.21	0.22
query19	1.96	1.79	1.86
query20	0.02	0.01	0.01
query21	15.52	0.25	0.14
query22	4.86	0.08	0.05
query23	15.67	0.28	0.09
query24	1.01	0.40	0.68
query25	0.13	0.09	0.05
query26	0.16	0.14	0.14
query27	0.08	0.06	0.07
query28	4.80	1.06	0.88
query29	12.59	3.90	3.13
query30	0.28	0.13	0.12
query31	2.83	0.64	0.39
query32	3.23	0.55	0.44
query33	2.97	3.02	3.06
query34	16.54	5.13	4.49
query35	4.44	4.47	4.47
query36	0.67	0.50	0.49
query37	0.11	0.06	0.06
query38	0.08	0.04	0.04
query39	0.04	0.03	0.03
query40	0.18	0.14	0.13
query41	0.09	0.03	0.02
query42	0.04	0.04	0.03
query43	0.04	0.04	0.04
Total cold run time: 97.3 s
Total hot run time: 26.89 s

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 74.40% (93/125) 🎉
Increment coverage report
Complete coverage report

@xy720
Copy link
Member Author

xy720 commented Jan 7, 2026

run feut

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage 62.40% (78/125) 🎉
Increment coverage report
Complete coverage report

@xy720
Copy link
Member Author

xy720 commented Jan 7, 2026

run p0

@hello-stephen
Copy link
Contributor

FE Regression Coverage Report

Increment line coverage 74.40% (93/125) 🎉
Increment coverage report
Complete coverage report

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jan 7, 2026
@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Jan 7, 2026

PR approved by anyone and no changes requested.

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR #55753 Code Review: Recycle Bin Optimization

Overview

This PR addresses a critical performance issue where FE can hang when there are ~90,000 partitions in the recycle bin. The erasePartitionWithSameName method was taking 5-10 minutes while holding a synchronized lock on CatalogRecycleBin, blocking DynamicPartitionScheduler which holds the table write lock.

Solution: Add caches to track same-name metadata, reducing O(n) scans to O(1) lookups.


✅ Strengths

  1. Correct algorithmic fix: Changed from linear scans (getSameNamePartitionIdListToErase) to cache-based lookups. This is the right approach.

  2. Good key handling: After reviewer feedback, changed from error-prone dbId + "_" + tableName to Pair<Long, String> — correct since table names can contain underscores.

  3. Complete cache maintenance: Caches are updated in all 14+ places: recycleDatabase, recycleTable, recyclePartition, all erase/recover/replay methods, and readFieldsWithGson.

  4. Unit tests: Added 839 lines of tests with 74.4% increment coverage.


⚠️ Issues and Concerns

1. Potential NPE in sorting (Medium Severity)

// CatalogRecycleBin.java:573
private synchronized List<Long> getIdListToEraseByRecycleTime(List<Long> ids, int maxTrashNum) {
    // ...
    ids.sort((x, y) -> Long.compare(idToRecycleTime.get(y), idToRecycleTime.get(x)));

If an ID exists in the cache but was already removed from idToRecycleTime (race condition or bug), this will throw NPE.

Suggested fix:

ids.sort((x, y) -> {
    Long xTime = idToRecycleTime.get(x);
    Long yTime = idToRecycleTime.get(y);
    if (xTime == null || yTime == null) {
        return (xTime == null) ? 1 : -1; // Push nulls to end
    }
    return Long.compare(yTime, xTime);
});

2. Test-only production code modification (Low Severity)

// CatalogRecycleBin.java:318
private synchronized boolean isExpireMinLatency(long id, long currentTimeMs) {
    return (currentTimeMs - idToRecycleTime.get(id)) > minEraseLatency || FeConstants.runningUnitTest;
}

This changes production behavior based on a test flag. Consider using dependency injection for time or a test subclass.

3. ConcurrentHashMap inside synchronized methods (Minor)

Map<String, Set<Long>> dbNameToIds = new ConcurrentHashMap<>();

All access is already within synchronized methods, so ConcurrentHashMap is redundant. Regular HashMap would suffice and be consistent with idToDatabase, idToTable, etc.

4. Code duplication in cleanup pattern

The same computeIfPresent cleanup pattern appears 15+ times:

dbNameToIds.computeIfPresent(dbName, (k, v) -> {
    v.remove(dbId);
    return v.isEmpty() ? null : v;
});

Consider extracting a helper method:

private <K> void removeFromCacheSet(Map<K, Set<Long>> cache, K key, Long id) {
    cache.computeIfPresent(key, (k, v) -> {
        v.remove(id);
        return v.isEmpty() ? null : v;
    });
}

5. Missing @VisibleForTesting annotation

// CatalogRecycleBin.java:1617
public synchronized void clearAll() {

This test-only method should be annotated or made package-private.


📊 Performance Analysis

Before After
O(n) per erase cycle where n = total partitions O(m) where m = same-name partitions
5-10 minutes for 90k partitions Expected: sub-second

The fix correctly addresses the time complexity issue.


✅ Correctness Verification

  • Cache updated on recycle (add)
  • Cache updated on erase (remove)
  • Cache updated on recover (remove)
  • Cache updated on replay (remove)
  • Cache rebuilt on FE restart (readFieldsWithGson)
  • Cache not persisted (correct, can be rebuilt)

Verdict

Approve with minor changes. The core optimization is correct and addresses a real production issue. The suggestions above are improvements but not blockers.

Priority fixes before merge:

  1. Add null-safety to the sorting comparator in getIdListToEraseByRecycleTime
  2. Consider removing the || FeConstants.runningUnitTest hack

Copy link
Contributor

@deardeng deardeng left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@lide-reed lide-reed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@dataroaring dataroaring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@dataroaring dataroaring added the usercase Important user case type label label Jan 8, 2026
@dataroaring dataroaring merged commit d535f2b into apache:master Jan 8, 2026
36 of 42 checks passed
github-actions bot pushed a commit that referenced this pull request Jan 8, 2026
…ential of FE hang (#55753)

### What problem does this PR solve?

I found when there are large amount of garbage(about 90000 partitions)
in recycle bin, the Fe's table lock will be hold for long time by
DynamicPartitionScheduler thread, the stack is like:

```
"recycle bin" #28 daemon prio=5 os_prio=0 cpu=73880509.81ms elapsed=96569.50s allocated=9212M defined_classes=9 tid=0x00007f0b545c1800 nid=0x2f4540 runnable  [0x00007f0b251fd000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.doris.catalog.CatalogRecycleBin.getSameNamePartitionIdListToErase(CatalogRecycleBin.java:539)
        - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.CatalogRecycleBin.erasePartitionWithSameName(CatalogRecycleBin.java:556)
        - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.CatalogRecycleBin.erasePartition(CatalogRecycleBin.java:510)
        - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.CatalogRecycleBin.runAfterCatalogReady(CatalogRecycleBin.java:1012)
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58)
        at org.apache.doris.common.util.Daemon.run(Daemon.java:119)

   Locked ownable synchronizers:
        - None

"DynamicPartitionScheduler" #41 daemon prio=5 os_prio=0 cpu=115405.50ms elapsed=87942.53s allocated=16637M defined_classes=96 tid=0x00007f0b545cc800 nid=0x2f4545 waiting for monitor entry  [0x00007f0b247fe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.doris.catalog.CatalogRecycleBin.recyclePartition(CatalogRecycleBin.java:187)
        - waiting to lock <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.OlapTable.dropPartition(OlapTable.java:1164)
        at org.apache.doris.catalog.OlapTable.dropPartition(OlapTable.java:1207)
        at org.apache.doris.datasource.InternalCatalog.dropPartitionWithoutCheck(InternalCatalog.java:1895)
        at org.apache.doris.datasource.InternalCatalog.dropPartition(InternalCatalog.java:1884)
        at org.apache.doris.catalog.Env.dropPartition(Env.java:3212)
        at org.apache.doris.clone.DynamicPartitionScheduler.executeDynamicPartition(DynamicPartitionScheduler.java:605)
        at org.apache.doris.clone.DynamicPartitionScheduler.runAfterCatalogReady(DynamicPartitionScheduler.java:729)
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58)
        at org.apache.doris.clone.DynamicPartitionScheduler.run(DynamicPartitionScheduler.java:688)
```

The DynamicPartitionScheduler thread is waiting the CatalogRecycleBin
thread while the table write lock is holding by itself .
In Fe log, you can see the CatalogRecycleBin thread is running something
big and cost almost 5~10 mins every run:

```
fe.log.20250907-2:2025-09-07 04:15:50,740 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 375503ms
fe.log.20250907-2:2025-09-07 04:23:14,109 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 413369ms
fe.log.20250907-2:2025-09-07 04:30:01,187 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 377077ms
fe.log.20250907-2:2025-09-07 04:38:22,769 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 471581ms
fe.log.20250907-2:2025-09-07 04:45:42,552 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 409782ms
fe.log.20250907-2:2025-09-07 04:54:30,825 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 498272ms
fe.log.20250907-2:2025-09-07 05:01:36,311 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 395485ms
```

The most costly task of the CatalogRecycleBin thread is erasing the
partition with same name:

```
2025-09-07 04:16:20,884 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62638463] name: p_2019051116000
0_20190511170000 from table[32976073] from db[682022]
2025-09-07 04:16:20,994 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62640651] name: p_2019043016000
0_20190430170000 from table[32976073] from db[682022]
2025-09-07 04:16:21,438 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[60264769] name: p_2019051721000
0_20190517220000 from table[32976073] from db[682022]
2025-09-07 04:16:21,787 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62651922] name: p_2019051015000
0_20190510160000 from table[32976073] from db[682022]
2025-09-07 04:16:21,893 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[59222503] name: p_2019052708000
0_20190527090000 from table[32976073] from db[682022]
2025-09-07 04:16:22,204 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62656398] name: p_2019051109000
0_20190511100000 from table[32976073] from db[682022]
2025-09-07 04:16:22,430 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[59228497] name: p_2019051812000
0_20190518130000 from table[32976073] from db[682022]
2025-09-07 04:16:22,493 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62658335] name: p_2019051217000
0_20190512180000 from table[32976073] from db[682022]
...
```

This may leads to whole Fe hang because the table lock is used for many
threads.
<img width="1230" height="438" alt="Clipboard_Screenshot_1757283600"
src="https://github.com/user-attachments/assets/59ec8707-82f8-4daf-8dae-b9ebea2b2959"
/>

This commit mainly optimize the logic of recycling the same name meta,
adding caches to reduce the time complexity.


### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [x] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
yiguolei pushed a commit that referenced this pull request Jan 12, 2026
…duce the potential of FE hang #55753 (#59699)

Cherry-picked from #55753

Co-authored-by: xy720 <22125576+xy720@users.noreply.github.com>
zzzxl1993 pushed a commit to zzzxl1993/doris that referenced this pull request Jan 13, 2026
…ential of FE hang (apache#55753)

### What problem does this PR solve?

I found when there are large amount of garbage(about 90000 partitions)
in recycle bin, the Fe's table lock will be hold for long time by
DynamicPartitionScheduler thread, the stack is like:

```
"recycle bin" apache#28 daemon prio=5 os_prio=0 cpu=73880509.81ms elapsed=96569.50s allocated=9212M defined_classes=9 tid=0x00007f0b545c1800 nid=0x2f4540 runnable  [0x00007f0b251fd000]
   java.lang.Thread.State: RUNNABLE
        at org.apache.doris.catalog.CatalogRecycleBin.getSameNamePartitionIdListToErase(CatalogRecycleBin.java:539)
        - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.CatalogRecycleBin.erasePartitionWithSameName(CatalogRecycleBin.java:556)
        - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.CatalogRecycleBin.erasePartition(CatalogRecycleBin.java:510)
        - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.CatalogRecycleBin.runAfterCatalogReady(CatalogRecycleBin.java:1012)
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58)
        at org.apache.doris.common.util.Daemon.run(Daemon.java:119)

   Locked ownable synchronizers:
        - None

"DynamicPartitionScheduler" apache#41 daemon prio=5 os_prio=0 cpu=115405.50ms elapsed=87942.53s allocated=16637M defined_classes=96 tid=0x00007f0b545cc800 nid=0x2f4545 waiting for monitor entry  [0x00007f0b247fe000]
   java.lang.Thread.State: BLOCKED (on object monitor)
        at org.apache.doris.catalog.CatalogRecycleBin.recyclePartition(CatalogRecycleBin.java:187)
        - waiting to lock <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin)
        at org.apache.doris.catalog.OlapTable.dropPartition(OlapTable.java:1164)
        at org.apache.doris.catalog.OlapTable.dropPartition(OlapTable.java:1207)
        at org.apache.doris.datasource.InternalCatalog.dropPartitionWithoutCheck(InternalCatalog.java:1895)
        at org.apache.doris.datasource.InternalCatalog.dropPartition(InternalCatalog.java:1884)
        at org.apache.doris.catalog.Env.dropPartition(Env.java:3212)
        at org.apache.doris.clone.DynamicPartitionScheduler.executeDynamicPartition(DynamicPartitionScheduler.java:605)
        at org.apache.doris.clone.DynamicPartitionScheduler.runAfterCatalogReady(DynamicPartitionScheduler.java:729)
        at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58)
        at org.apache.doris.clone.DynamicPartitionScheduler.run(DynamicPartitionScheduler.java:688)
```

The DynamicPartitionScheduler thread is waiting the CatalogRecycleBin
thread while the table write lock is holding by itself .
In Fe log, you can see the CatalogRecycleBin thread is running something
big and cost almost 5~10 mins every run:

```
fe.log.20250907-2:2025-09-07 04:15:50,740 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 375503ms
fe.log.20250907-2:2025-09-07 04:23:14,109 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 413369ms
fe.log.20250907-2:2025-09-07 04:30:01,187 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 377077ms
fe.log.20250907-2:2025-09-07 04:38:22,769 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 471581ms
fe.log.20250907-2:2025-09-07 04:45:42,552 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 409782ms
fe.log.20250907-2:2025-09-07 04:54:30,825 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 498272ms
fe.log.20250907-2:2025-09-07 05:01:36,311 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 395485ms
```

The most costly task of the CatalogRecycleBin thread is erasing the
partition with same name:

```
2025-09-07 04:16:20,884 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62638463] name: p_2019051116000
0_20190511170000 from table[32976073] from db[682022]
2025-09-07 04:16:20,994 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62640651] name: p_2019043016000
0_20190430170000 from table[32976073] from db[682022]
2025-09-07 04:16:21,438 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[60264769] name: p_2019051721000
0_20190517220000 from table[32976073] from db[682022]
2025-09-07 04:16:21,787 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62651922] name: p_2019051015000
0_20190510160000 from table[32976073] from db[682022]
2025-09-07 04:16:21,893 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[59222503] name: p_2019052708000
0_20190527090000 from table[32976073] from db[682022]
2025-09-07 04:16:22,204 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62656398] name: p_2019051109000
0_20190511100000 from table[32976073] from db[682022]
2025-09-07 04:16:22,430 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[59228497] name: p_2019051812000
0_20190518130000 from table[32976073] from db[682022]
2025-09-07 04:16:22,493 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62658335] name: p_2019051217000
0_20190512180000 from table[32976073] from db[682022]
...
```

This may leads to whole Fe hang because the table lock is used for many
threads.
<img width="1230" height="438" alt="Clipboard_Screenshot_1757283600"
src="https://github.com/user-attachments/assets/59ec8707-82f8-4daf-8dae-b9ebea2b2959"
/>

This commit mainly optimize the logic of recycling the same name meta,
adding caches to reduce the time complexity.


### Release note

None

### Check List (For Author)

- Test <!-- At least one of them must be included. -->
    - [ ] Regression test
    - [ ] Unit Test
    - [ ] Manual test (add detailed scripts or steps below)
    - [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
        - [ ] Previous test can cover this change.
        - [ ] No code files have been changed.
        - [ ] Other reason <!-- Add your reason?  -->

- Behavior changed:
    - [x] No.
    - [ ] Yes. <!-- Explain the behavior change -->

- Does this need documentation?
    - [ ] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->

### Check List (For Reviewer who merge this PR)

- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/4.0.3-merged reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants