-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[enhancement](recycle bin) optimize the recycle bin to reduce the potential of FE hang #55753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
| } | ||
| idToRecycleTime.put(table.getId(), recycleTime); | ||
| idToTable.put(table.getId(), tableInfo); | ||
| String key = dbId + "_" + table.getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change all + "_" + expression to function calls
to implement the concatenation rules with functions, including db table and partitions..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized using dbId + "_" + tableName is a bad idea because tablet/partition name may contains "_".
So I used a Map<Pair<Long, String>, Set> structure instead.
6340cdc to
c48a5ad
Compare
|
run buildall |
TPC-H: Total hot run time: 35042 ms |
TPC-DS: Total hot run time: 189783 ms |
ClickBench: Total hot run time: 30.1 s |
FE UT Coverage ReportIncrement line coverage |
FE Regression Coverage ReportIncrement line coverage |
|
run buildall |
|
run buildall |
TPC-H: Total hot run time: 31881 ms |
TPC-DS: Total hot run time: 171974 ms |
ClickBench: Total hot run time: 26.89 s |
FE Regression Coverage ReportIncrement line coverage |
|
run feut |
FE UT Coverage ReportIncrement line coverage |
|
run p0 |
FE Regression Coverage ReportIncrement line coverage |
dataroaring
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
dataroaring
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR #55753 Code Review: Recycle Bin Optimization
Overview
This PR addresses a critical performance issue where FE can hang when there are ~90,000 partitions in the recycle bin. The erasePartitionWithSameName method was taking 5-10 minutes while holding a synchronized lock on CatalogRecycleBin, blocking DynamicPartitionScheduler which holds the table write lock.
Solution: Add caches to track same-name metadata, reducing O(n) scans to O(1) lookups.
✅ Strengths
-
Correct algorithmic fix: Changed from linear scans (
getSameNamePartitionIdListToErase) to cache-based lookups. This is the right approach. -
Good key handling: After reviewer feedback, changed from error-prone
dbId + "_" + tableNametoPair<Long, String>— correct since table names can contain underscores. -
Complete cache maintenance: Caches are updated in all 14+ places:
recycleDatabase,recycleTable,recyclePartition, all erase/recover/replay methods, andreadFieldsWithGson. -
Unit tests: Added 839 lines of tests with 74.4% increment coverage.
⚠️ Issues and Concerns
1. Potential NPE in sorting (Medium Severity)
// CatalogRecycleBin.java:573
private synchronized List<Long> getIdListToEraseByRecycleTime(List<Long> ids, int maxTrashNum) {
// ...
ids.sort((x, y) -> Long.compare(idToRecycleTime.get(y), idToRecycleTime.get(x)));If an ID exists in the cache but was already removed from idToRecycleTime (race condition or bug), this will throw NPE.
Suggested fix:
ids.sort((x, y) -> {
Long xTime = idToRecycleTime.get(x);
Long yTime = idToRecycleTime.get(y);
if (xTime == null || yTime == null) {
return (xTime == null) ? 1 : -1; // Push nulls to end
}
return Long.compare(yTime, xTime);
});2. Test-only production code modification (Low Severity)
// CatalogRecycleBin.java:318
private synchronized boolean isExpireMinLatency(long id, long currentTimeMs) {
return (currentTimeMs - idToRecycleTime.get(id)) > minEraseLatency || FeConstants.runningUnitTest;
}This changes production behavior based on a test flag. Consider using dependency injection for time or a test subclass.
3. ConcurrentHashMap inside synchronized methods (Minor)
Map<String, Set<Long>> dbNameToIds = new ConcurrentHashMap<>();All access is already within synchronized methods, so ConcurrentHashMap is redundant. Regular HashMap would suffice and be consistent with idToDatabase, idToTable, etc.
4. Code duplication in cleanup pattern
The same computeIfPresent cleanup pattern appears 15+ times:
dbNameToIds.computeIfPresent(dbName, (k, v) -> {
v.remove(dbId);
return v.isEmpty() ? null : v;
});Consider extracting a helper method:
private <K> void removeFromCacheSet(Map<K, Set<Long>> cache, K key, Long id) {
cache.computeIfPresent(key, (k, v) -> {
v.remove(id);
return v.isEmpty() ? null : v;
});
}5. Missing @VisibleForTesting annotation
// CatalogRecycleBin.java:1617
public synchronized void clearAll() {This test-only method should be annotated or made package-private.
📊 Performance Analysis
| Before | After |
|---|---|
| O(n) per erase cycle where n = total partitions | O(m) where m = same-name partitions |
| 5-10 minutes for 90k partitions | Expected: sub-second |
The fix correctly addresses the time complexity issue.
✅ Correctness Verification
- Cache updated on recycle (add)
- Cache updated on erase (remove)
- Cache updated on recover (remove)
- Cache updated on replay (remove)
- Cache rebuilt on FE restart (
readFieldsWithGson) - Cache not persisted (correct, can be rebuilt)
Verdict
Approve with minor changes. The core optimization is correct and addresses a real production issue. The suggestions above are improvements but not blockers.
Priority fixes before merge:
- Add null-safety to the sorting comparator in
getIdListToEraseByRecycleTime - Consider removing the
|| FeConstants.runningUnitTesthack
deardeng
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
fe/fe-core/src/main/java/org/apache/doris/catalog/CatalogRecycleBin.java
Show resolved
Hide resolved
lide-reed
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
dataroaring
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
…ential of FE hang (#55753) ### What problem does this PR solve? I found when there are large amount of garbage(about 90000 partitions) in recycle bin, the Fe's table lock will be hold for long time by DynamicPartitionScheduler thread, the stack is like: ``` "recycle bin" #28 daemon prio=5 os_prio=0 cpu=73880509.81ms elapsed=96569.50s allocated=9212M defined_classes=9 tid=0x00007f0b545c1800 nid=0x2f4540 runnable [0x00007f0b251fd000] java.lang.Thread.State: RUNNABLE at org.apache.doris.catalog.CatalogRecycleBin.getSameNamePartitionIdListToErase(CatalogRecycleBin.java:539) - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin) at org.apache.doris.catalog.CatalogRecycleBin.erasePartitionWithSameName(CatalogRecycleBin.java:556) - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin) at org.apache.doris.catalog.CatalogRecycleBin.erasePartition(CatalogRecycleBin.java:510) - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin) at org.apache.doris.catalog.CatalogRecycleBin.runAfterCatalogReady(CatalogRecycleBin.java:1012) at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) at org.apache.doris.common.util.Daemon.run(Daemon.java:119) Locked ownable synchronizers: - None "DynamicPartitionScheduler" #41 daemon prio=5 os_prio=0 cpu=115405.50ms elapsed=87942.53s allocated=16637M defined_classes=96 tid=0x00007f0b545cc800 nid=0x2f4545 waiting for monitor entry [0x00007f0b247fe000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.doris.catalog.CatalogRecycleBin.recyclePartition(CatalogRecycleBin.java:187) - waiting to lock <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin) at org.apache.doris.catalog.OlapTable.dropPartition(OlapTable.java:1164) at org.apache.doris.catalog.OlapTable.dropPartition(OlapTable.java:1207) at org.apache.doris.datasource.InternalCatalog.dropPartitionWithoutCheck(InternalCatalog.java:1895) at org.apache.doris.datasource.InternalCatalog.dropPartition(InternalCatalog.java:1884) at org.apache.doris.catalog.Env.dropPartition(Env.java:3212) at org.apache.doris.clone.DynamicPartitionScheduler.executeDynamicPartition(DynamicPartitionScheduler.java:605) at org.apache.doris.clone.DynamicPartitionScheduler.runAfterCatalogReady(DynamicPartitionScheduler.java:729) at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) at org.apache.doris.clone.DynamicPartitionScheduler.run(DynamicPartitionScheduler.java:688) ``` The DynamicPartitionScheduler thread is waiting the CatalogRecycleBin thread while the table write lock is holding by itself . In Fe log, you can see the CatalogRecycleBin thread is running something big and cost almost 5~10 mins every run: ``` fe.log.20250907-2:2025-09-07 04:15:50,740 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 375503ms fe.log.20250907-2:2025-09-07 04:23:14,109 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 413369ms fe.log.20250907-2:2025-09-07 04:30:01,187 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 377077ms fe.log.20250907-2:2025-09-07 04:38:22,769 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 471581ms fe.log.20250907-2:2025-09-07 04:45:42,552 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 409782ms fe.log.20250907-2:2025-09-07 04:54:30,825 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 498272ms fe.log.20250907-2:2025-09-07 05:01:36,311 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 395485ms ``` The most costly task of the CatalogRecycleBin thread is erasing the partition with same name: ``` 2025-09-07 04:16:20,884 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62638463] name: p_2019051116000 0_20190511170000 from table[32976073] from db[682022] 2025-09-07 04:16:20,994 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62640651] name: p_2019043016000 0_20190430170000 from table[32976073] from db[682022] 2025-09-07 04:16:21,438 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[60264769] name: p_2019051721000 0_20190517220000 from table[32976073] from db[682022] 2025-09-07 04:16:21,787 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62651922] name: p_2019051015000 0_20190510160000 from table[32976073] from db[682022] 2025-09-07 04:16:21,893 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[59222503] name: p_2019052708000 0_20190527090000 from table[32976073] from db[682022] 2025-09-07 04:16:22,204 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62656398] name: p_2019051109000 0_20190511100000 from table[32976073] from db[682022] 2025-09-07 04:16:22,430 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[59228497] name: p_2019051812000 0_20190518130000 from table[32976073] from db[682022] 2025-09-07 04:16:22,493 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62658335] name: p_2019051217000 0_20190512180000 from table[32976073] from db[682022] ... ``` This may leads to whole Fe hang because the table lock is used for many threads. <img width="1230" height="438" alt="Clipboard_Screenshot_1757283600" src="https://github.com/user-attachments/assets/59ec8707-82f8-4daf-8dae-b9ebea2b2959" /> This commit mainly optimize the logic of recycling the same name meta, adding caches to reduce the time complexity. ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [x] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
…ential of FE hang (apache#55753) ### What problem does this PR solve? I found when there are large amount of garbage(about 90000 partitions) in recycle bin, the Fe's table lock will be hold for long time by DynamicPartitionScheduler thread, the stack is like: ``` "recycle bin" apache#28 daemon prio=5 os_prio=0 cpu=73880509.81ms elapsed=96569.50s allocated=9212M defined_classes=9 tid=0x00007f0b545c1800 nid=0x2f4540 runnable [0x00007f0b251fd000] java.lang.Thread.State: RUNNABLE at org.apache.doris.catalog.CatalogRecycleBin.getSameNamePartitionIdListToErase(CatalogRecycleBin.java:539) - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin) at org.apache.doris.catalog.CatalogRecycleBin.erasePartitionWithSameName(CatalogRecycleBin.java:556) - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin) at org.apache.doris.catalog.CatalogRecycleBin.erasePartition(CatalogRecycleBin.java:510) - locked <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin) at org.apache.doris.catalog.CatalogRecycleBin.runAfterCatalogReady(CatalogRecycleBin.java:1012) at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) at org.apache.doris.common.util.Daemon.run(Daemon.java:119) Locked ownable synchronizers: - None "DynamicPartitionScheduler" apache#41 daemon prio=5 os_prio=0 cpu=115405.50ms elapsed=87942.53s allocated=16637M defined_classes=96 tid=0x00007f0b545cc800 nid=0x2f4545 waiting for monitor entry [0x00007f0b247fe000] java.lang.Thread.State: BLOCKED (on object monitor) at org.apache.doris.catalog.CatalogRecycleBin.recyclePartition(CatalogRecycleBin.java:187) - waiting to lock <0x000000020d6d6130> (a org.apache.doris.catalog.CatalogRecycleBin) at org.apache.doris.catalog.OlapTable.dropPartition(OlapTable.java:1164) at org.apache.doris.catalog.OlapTable.dropPartition(OlapTable.java:1207) at org.apache.doris.datasource.InternalCatalog.dropPartitionWithoutCheck(InternalCatalog.java:1895) at org.apache.doris.datasource.InternalCatalog.dropPartition(InternalCatalog.java:1884) at org.apache.doris.catalog.Env.dropPartition(Env.java:3212) at org.apache.doris.clone.DynamicPartitionScheduler.executeDynamicPartition(DynamicPartitionScheduler.java:605) at org.apache.doris.clone.DynamicPartitionScheduler.runAfterCatalogReady(DynamicPartitionScheduler.java:729) at org.apache.doris.common.util.MasterDaemon.runOneCycle(MasterDaemon.java:58) at org.apache.doris.clone.DynamicPartitionScheduler.run(DynamicPartitionScheduler.java:688) ``` The DynamicPartitionScheduler thread is waiting the CatalogRecycleBin thread while the table write lock is holding by itself . In Fe log, you can see the CatalogRecycleBin thread is running something big and cost almost 5~10 mins every run: ``` fe.log.20250907-2:2025-09-07 04:15:50,740 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 375503ms fe.log.20250907-2:2025-09-07 04:23:14,109 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 413369ms fe.log.20250907-2:2025-09-07 04:30:01,187 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 377077ms fe.log.20250907-2:2025-09-07 04:38:22,769 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 471581ms fe.log.20250907-2:2025-09-07 04:45:42,552 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 409782ms fe.log.20250907-2:2025-09-07 04:54:30,825 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 498272ms fe.log.20250907-2:2025-09-07 05:01:36,311 INFO (recycle bin|28) [CatalogRecycleBin.erasePartition():516] erasePartition eraseNum: 0 cost: 395485ms ``` The most costly task of the CatalogRecycleBin thread is erasing the partition with same name: ``` 2025-09-07 04:16:20,884 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62638463] name: p_2019051116000 0_20190511170000 from table[32976073] from db[682022] 2025-09-07 04:16:20,994 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62640651] name: p_2019043016000 0_20190430170000 from table[32976073] from db[682022] 2025-09-07 04:16:21,438 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[60264769] name: p_2019051721000 0_20190517220000 from table[32976073] from db[682022] 2025-09-07 04:16:21,787 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62651922] name: p_2019051015000 0_20190510160000 from table[32976073] from db[682022] 2025-09-07 04:16:21,893 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[59222503] name: p_2019052708000 0_20190527090000 from table[32976073] from db[682022] 2025-09-07 04:16:22,204 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62656398] name: p_2019051109000 0_20190511100000 from table[32976073] from db[682022] 2025-09-07 04:16:22,430 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[59228497] name: p_2019051812000 0_20190518130000 from table[32976073] from db[682022] 2025-09-07 04:16:22,493 INFO (recycle bin|28) [CatalogRecycleBin.erasePartitionWithSameName():569] erase partition[62658335] name: p_2019051217000 0_20190512180000 from table[32976073] from db[682022] ... ``` This may leads to whole Fe hang because the table lock is used for many threads. <img width="1230" height="438" alt="Clipboard_Screenshot_1757283600" src="https://github.com/user-attachments/assets/59ec8707-82f8-4daf-8dae-b9ebea2b2959" /> This commit mainly optimize the logic of recycling the same name meta, adding caches to reduce the time complexity. ### Release note None ### Check List (For Author) - Test <!-- At least one of them must be included. --> - [ ] Regression test - [ ] Unit Test - [ ] Manual test (add detailed scripts or steps below) - [ ] No need to test or manual test. Explain why: - [ ] This is a refactor/code format and no logic has been changed. - [ ] Previous test can cover this change. - [ ] No code files have been changed. - [ ] Other reason <!-- Add your reason? --> - Behavior changed: - [x] No. - [ ] Yes. <!-- Explain the behavior change --> - Does this need documentation? - [ ] No. - [ ] Yes. <!-- Add document PR link here. eg: apache/doris-website#1214 --> ### Check List (For Reviewer who merge this PR) - [ ] Confirm the release note - [ ] Confirm test cases - [ ] Confirm document - [ ] Add branch pick label <!-- Add branch pick label that this PR should merge into -->
What problem does this PR solve?
I found when there are large amount of garbage(about 90000 partitions) in recycle bin, the Fe's table lock will be hold for long time by DynamicPartitionScheduler thread, the stack is like:
The DynamicPartitionScheduler thread is waiting the CatalogRecycleBin thread while the table write lock is holding by itself .
In Fe log, you can see the CatalogRecycleBin thread is running something big and cost almost 5~10 mins every run:
The most costly task of the CatalogRecycleBin thread is erasing the partition with same name:
This may leads to whole Fe hang because the table lock is used for many threads.

This commit mainly optimize the logic of recycling the same name meta, adding caches to reduce the time complexity.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)