Skip to content

Conversation

@freemandealer
Copy link
Contributor

@freemandealer freemandealer commented Mar 25, 2025

When the system restarts, the LRU queue in memory is lost due to lack of persistence. This requires re-scanning the disk directory to load data, leading to the following issues:

  1. The loading order after restart depends on directory traversal, and the original eviction order cannot be preserved.
  2. If the system enters resource limit mode after restart, it may mistakenly delete frequently accessed hot data by users.

In this commit, we periodically dump the LRU queue information to disk and rebuild the LRU queue upon restart. Considering that the LRU content may be extensive, we only dump the tail end (the part that will be evicted first) of the LRU queue, with the specific quantity configured by the config.

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@Thearas
Copy link
Contributor

Thearas commented Mar 25, 2025

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@freemandealer freemandealer marked this pull request as ready for review May 15, 2025 16:33
@freemandealer freemandealer requested a review from gavinchou as a code owner May 15, 2025 16:33
@freemandealer
Copy link
Contributor Author

run buildall

@freemandealer freemandealer changed the title [WIP] persist LRU POC [enhancement](cloud) persist LRU information for filecache May 15, 2025
@freemandealer
Copy link
Contributor Author

freemandealer commented May 15, 2025

The PR is still under optimization, but comments are welcomed.
I am resolving the conflicts.
I am also adding regression cases for this PR.
see more on the TODOs

The following features are not included in this PR:

  1. file cache microbench tests
  2. get block info directly from LRUQueue (Queue refractor)
  3. dynamic dump interval policy
  4. shadow queue accuracy check facilities

@freemandealer
Copy link
Contributor Author

run buildall

@freemandealer freemandealer changed the title [enhancement](cloud) persist LRU information for filecache [RFC][enhancement](cloud) persist LRU information for filecache May 16, 2025
@freemandealer
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

Cloud UT Coverage Report

Increment line coverage 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 83.20% (1124/1351)
Line Coverage 66.71% (19342/28996)
Region Coverage 66.50% (9588/14418)
Branch Coverage 56.38% (5203/9228)

@dataroaring dataroaring added the usercase Important user case type label label Jun 25, 2025
@freemandealer freemandealer changed the title [RFC][enhancement](cloud) persist LRU information for filecache [enhancement](cloud) persist LRU information for filecache Jun 26, 2025
@freemandealer
Copy link
Contributor Author

run buildall

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@doris-robot
Copy link

TPC-H: Total hot run time: 34139 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit fe841a1bd04b075c8c29f9ab1cc471a94e8fc36c, data reload: false

------ Round 1 ----------------------------------
q1	17629	5199	5109	5109
q2	1977	290	184	184
q3	10529	1353	699	699
q4	10269	1032	511	511
q5	8017	2410	2357	2357
q6	188	161	130	130
q7	901	760	597	597
q8	9309	1299	1104	1104
q9	6939	5115	5089	5089
q10	6900	2380	1963	1963
q11	492	296	307	296
q12	353	347	208	208
q13	17777	3690	3064	3064
q14	222	224	214	214
q15	554	488	494	488
q16	425	424	395	395
q17	622	930	376	376
q18	7512	7216	7261	7216
q19	1218	973	556	556
q20	334	349	221	221
q21	3975	3167	2374	2374
q22	1056	1026	988	988
Total cold run time: 107198 ms
Total hot run time: 34139 ms

----- Round 2, with runtime_filter_mode=off -----
q1	5209	5138	5114	5114
q2	252	331	217	217
q3	2209	2720	2311	2311
q4	1372	1788	1331	1331
q5	4311	4434	4440	4434
q6	224	165	121	121
q7	1987	1914	1777	1777
q8	2653	2708	2603	2603
q9	7148	7157	7154	7154
q10	3088	3250	2792	2792
q11	571	511	507	507
q12	684	818	635	635
q13	3549	3839	3265	3265
q14	288	314	264	264
q15	521	476	475	475
q16	443	503	450	450
q17	1160	1548	1325	1325
q18	7464	7182	7235	7182
q19	797	841	911	841
q20	1937	1955	1838	1838
q21	4829	4464	4301	4301
q22	1077	1048	1000	1000
Total cold run time: 51773 ms
Total hot run time: 49937 ms

@doris-robot
Copy link

TPC-DS: Total hot run time: 184748 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit fe841a1bd04b075c8c29f9ab1cc471a94e8fc36c, data reload: false

query1	1003	395	386	386
query2	6513	1687	1685	1685
query3	6740	214	210	210
query4	26471	23475	22980	22980
query5	4338	569	418	418
query6	303	201	190	190
query7	4623	505	288	288
query8	272	207	214	207
query9	8564	2632	2621	2621
query10	502	325	274	274
query11	15308	15119	14875	14875
query12	151	105	107	105
query13	1644	522	390	390
query14	8524	5670	5667	5667
query15	212	199	170	170
query16	7291	633	480	480
query17	1212	738	602	602
query18	1985	402	324	324
query19	197	193	166	166
query20	121	118	111	111
query21	216	127	109	109
query22	4171	4101	3943	3943
query23	34065	33114	33137	33114
query24	8499	2386	2375	2375
query25	634	468	390	390
query26	1233	274	148	148
query27	2766	510	332	332
query28	4292	2138	2107	2107
query29	781	573	425	425
query30	281	216	189	189
query31	889	836	776	776
query32	117	62	60	60
query33	550	368	305	305
query34	778	840	512	512
query35	790	828	744	744
query36	941	941	899	899
query37	111	97	72	72
query38	4194	4128	4080	4080
query39	1502	1441	1422	1422
query40	205	115	102	102
query41	56	56	52	52
query42	123	103	108	103
query43	512	524	496	496
query44	1289	819	805	805
query45	182	167	165	165
query46	836	994	616	616
query47	1748	1844	1745	1745
query48	371	413	301	301
query49	815	511	393	393
query50	647	695	396	396
query51	4134	4159	4123	4123
query52	109	104	103	103
query53	224	245	193	193
query54	571	558	516	516
query55	85	77	82	77
query56	312	304	270	270
query57	1174	1193	1139	1139
query58	260	256	263	256
query59	2645	2729	2578	2578
query60	317	305	292	292
query61	122	120	122	120
query62	784	718	641	641
query63	226	185	190	185
query64	4384	1034	657	657
query65	4281	4137	4180	4137
query66	1143	413	313	313
query67	15997	15882	15502	15502
query68	8572	929	521	521
query69	469	301	265	265
query70	1189	1090	1114	1090
query71	447	313	303	303
query72	5245	4722	4718	4718
query73	717	587	346	346
query74	9194	9139	8942	8942
query75	3943	3179	2663	2663
query76	3652	1133	710	710
query77	789	380	292	292
query78	10028	10126	9359	9359
query79	2408	823	581	581
query80	615	517	425	425
query81	470	259	222	222
query82	484	124	96	96
query83	295	244	233	233
query84	289	102	85	85
query85	792	437	306	306
query86	329	301	287	287
query87	4435	4461	4395	4395
query88	2868	2249	2265	2249
query89	390	316	281	281
query90	1926	200	197	197
query91	134	141	108	108
query92	158	59	56	56
query93	1274	947	583	583
query94	672	397	297	297
query95	369	287	284	284
query96	490	562	275	275
query97	2722	2779	2660	2660
query98	237	204	196	196
query99	1437	1433	1296	1296
Total cold run time: 273375 ms
Total hot run time: 184748 ms

@hello-stephen
Copy link
Contributor

BE UT Coverage Report

Increment line coverage 78.01% (525/673) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.19% (15430/26979)
Line Coverage 46.24% (140108/303026)
Region Coverage 45.54% (70970/155858)
Branch Coverage 40.28% (37440/92944)

@freemandealer
Copy link
Contributor Author

run cloud_p0

@freemandealer
Copy link
Contributor Author

run feut

@freemandealer
Copy link
Contributor Author

run external

@hello-stephen
Copy link
Contributor

FE UT Coverage Report

Increment line coverage `` 🎉
Increment coverage report
Complete coverage report

@github-actions github-actions bot added the approved Indicates a PR has been approved by one committer. label Jul 4, 2025
@github-actions
Copy link
Contributor

github-actions bot commented Jul 4, 2025

PR approved by at least one committer and no changes requested.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 4, 2025

PR approved by anyone and no changes requested.

@gavinchou gavinchou merged commit e74a5a1 into apache:master Jul 4, 2025
27 of 28 checks passed
@gavinchou gavinchou changed the title [enhancement](cloud) persist LRU information for filecache [enhancement](cloud) Persist LRU information for file cache Jul 4, 2025
freemandealer added a commit to freemandealer/doris that referenced this pull request Jul 4, 2025
…9456)

When the system restarts, the LRU queue in memory is lost due to lack of
persistence. This requires re-scanning the disk directory to load data,
leading to the following issues:
1. The loading order after restart depends on directory traversal, and
the original eviction order cannot be preserved.
2. If the system enters resource limit mode after restart, it may
mistakenly delete frequently accessed hot data by users.

In this commit, we periodically dump the LRU queue information to disk
and rebuild the LRU queue upon restart. Considering that the LRU content
may be extensive, we only dump the tail end (the part that will be
evicted first) of the LRU queue, with the specific quantity configured
by the config.
freemandealer added a commit to freemandealer/doris that referenced this pull request Jul 4, 2025
…9456)

When the system restarts, the LRU queue in memory is lost due to lack of
persistence. This requires re-scanning the disk directory to load data,
leading to the following issues:
1. The loading order after restart depends on directory traversal, and
the original eviction order cannot be preserved.
2. If the system enters resource limit mode after restart, it may
mistakenly delete frequently accessed hot data by users.

In this commit, we periodically dump the LRU queue information to disk
and rebuild the LRU queue upon restart. Considering that the LRU content
may be extensive, we only dump the tail end (the part that will be
evicted first) of the LRU queue, with the specific quantity configured
by the config.
freemandealer added a commit to freemandealer/doris that referenced this pull request Jul 29, 2025
…che#49456 apache#53969)

to make osx happy

Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by one committer. dev/3.1.0-merged filecache reviewed usercase Important user case type label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants