-
Notifications
You must be signed in to change notification settings - Fork 3.7k
[enhancement](cloud) Persist LRU information for file cache (pick #49456 and following fixes) #52807
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[enhancement](cloud) Persist LRU information for file cache (pick #49456 and following fixes) #52807
Conversation
…9456) When the system restarts, the LRU queue in memory is lost due to lack of persistence. This requires re-scanning the disk directory to load data, leading to the following issues: 1. The loading order after restart depends on directory traversal, and the original eviction order cannot be preserved. 2. If the system enters resource limit mode after restart, it may mistakenly delete frequently accessed hot data by users. In this commit, we periodically dump the LRU queue information to disk and rebuild the LRU queue upon restart. Considering that the LRU content may be extensive, we only dump the tail end (the part that will be evicted first) of the LRU queue, with the specific quantity configured by the config.
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 39918 ms |
TPC-DS: Total hot run time: 197295 ms |
ClickBench: Total hot run time: 31.96 s |
|
run feut |
|
PR approved by at least one committer and no changes requested. |
|
PR approved by anyone and no changes requested. |
### What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
1.fix LRU queue crash use after free
2.fix extra LRU queue info when 'need_to_move' flag unset
3.use concurrent queueu to record queueu change info for thread safety
```
ERROR: AddressSanitizer: heap-use-after-free on address 0x603005548c40 at pc 0x55f28e8c4785 bp 0x7f603582e1f0 sp 0x7f603582e1e8
READ of size 8 at 0x603005548c40 thread T201
#0 0x55f28e8c4784 in std::_Head_base<0ul, doris::io::CacheLRULog*, false>::_Head_base<doris::io::CacheLRULog*>(doris::io::CacheLRULog*&&) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/tuple:190:17
#1 0x55f28e8c4784 in std::_Tuple_impl<0ul, doris::io::CacheLRULog*, std::default_delete<doris::io::CacheLRULog>>::_Tuple_impl(std::_Tuple_impl<0ul, doris::io::CacheLRULog*, std::default_delete<doris::io::CacheLRULog>>&&) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/tuple:292:2
#2 0x55f28e8c4784 in std::tuple<doris::io::CacheLRULog*, std::default_delete<doris::io::CacheLRULog>>::tuple(std::tuple<doris::io::CacheLRULog*, std::default_delete<doris::io::CacheLRULog>>&&) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/tuple:1079:17
apache#3 0x55f28e8c4784 in std::_uniq_ptr_impl<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>::uniq_ptr_impl(std::_uniq_ptr_impl<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>&&) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:162:9
apache#4 0x55f28e8c4784 in std::_uniq_ptr_data<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>, true, true>::uniq_ptr_data(std::_uniq_ptr_data<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>, true, true>&&) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:211:7
apache#5 0x55f28e8c4784 in std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>::unique_ptr(std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>&&) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/unique_ptr.h:327:7
apache#6 0x55f28e8c4784 in doris::io::LRUQueueRecorder::replay_queue_event(doris::io::FileCacheType) /root/doris/be/src/io/cache/lru_queue_recorder.cpp:40:20
apache#7 0x55f28e82d620 in doris::io::BlockFileCache::run_background_lru_log_replay() /root/doris/be/src/io/cache/block_file_cache.cpp:2242:24
apache#8 0x55f2cdc2720f in execute_native_thread_routine /data/gcc-11.1.0/build/x86_64-pc-linux-gnu/libstdc+-v3/src/c11/../../../../../libstdc-v3/src/c+11/thread.cc:82:18
apache#9 0x7f61f1842608 in start_thread /build/glibc-SzIz7B/glibc-2.31/nptl/pthread_create.c:477:8
apache#10 0x7f61f1aef132 in __clone /build/glibc-SzIz7B/glibc-2.31/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:95
0x603005548c40 is located 16 bytes inside of 24-byte region [0x603005548c30,0x603005548c48)
freed by thread T201 here:
#0 0x55f28e51680d in operator delete(void*) (/home/work/unlimit_teamcity/TeamCity/Agents/20250708205944agent_172.16.0.48_1/work/60183217f6ee2a9c/output/be/lib/doris_be+0x3975a80d) (BuildId: 8b6ba6101e736655)
#1 0x55f28e8c3ce0 in std::__cxx11::list<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>, std::allocator<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>>::pop_front() /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_list.h:1198:15
#2 0x55f28e8c3ce0 in doris::io::LRUQueueRecorder::replay_queue_event(doris::io::FileCacheType) /root/doris/be/src/io/cache/lru_queue_recorder.cpp:41:19
apache#3 0x55f28e82d620 in doris::io::BlockFileCache::run_background_lru_log_replay() /root/doris/be/src/io/cache/block_file_cache.cpp:2242:24
apache#4 0x55f2cdc2720f in execute_native_thread_routine /data/gcc-11.1.0/build/x86_64-pc-linux-gnu/libstdc+-v3/src/c11/../../../../../libstdc-v3/src/c+11/thread.cc:82:18
previously allocated by thread T607 (CumuCompactionT) here:
#0 0x55f28e515fad in operator new(unsigned long) (/home/work/unlimit_teamcity/TeamCity/Agents/20250708205944agent_172.16.0.48_1/work/60183217f6ee2a9c/output/be/lib/doris_be+0x39759fad) (BuildId: 8b6ba6101e736655)
#1 0x55f28e8c660d in __gnu_cxx::new_allocator<std::_List_node<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>>::allocate(unsigned long, void const*) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/ext/new_allocator.h:121:27
#2 0x55f28e8c660d in std::allocator<std::_List_node<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>>::allocate(unsigned long) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/allocator.h:173:32
apache#3 0x55f28e8c660d in std::allocator_traits<std::allocator<std::_List_node<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>>>::allocate(std::allocator<std::_List_node<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>>&, unsigned long) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/alloc_traits.h:460:20
apache#4 0x55f28e8c660d in std::__cxx11::_List_base<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>, std::allocator<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>>::_M_get_node() /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_list.h:442:16
apache#5 0x55f28e8c660d in std::List_node<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>* std::_cxx11::list<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>, std::allocator<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>>::_M_create_node<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>(std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>&&) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_list.h:634:21
apache#6 0x55f28e8c660d in void std::__cxx11::list<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>, std::allocator<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>>::_M_insert<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>(std::_List_iterator<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>, std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>&&) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_list.h:1911:18
apache#7 0x55f28e8c3522 in std::__cxx11::list<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>, std::allocator<std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>>>::push_back(std::unique_ptr<doris::io::CacheLRULog, std::default_delete<doris::io::CacheLRULog>>&&) /var/local/ldb-toolchain/bin/../lib/gcc/x86_64-linux-gnu/11/../../../../include/c++/11/bits/stl_list.h:1217:15
apache#8 0x55f28e8c3522 in doris::io::LRUQueueRecorder::record_queue_event(doris::io::FileCacheType, doris::io::CacheLRULogType, doris::io::UInt128Wrapper, unsigned long, unsigned long) /root/doris/be/src/io/cache/lru_queue_recorder.cpp:29:15
apache#9 0x55f28e82f09b in doris::io::BlockFileCache::use_cell(doris::io::BlockFileCache::FileBlockCell const&, std::__cxx11::list<std::shared_ptr<doris::io::FileBlock>, std::allocator<std::shared_ptr<doris::io::FileBlock>>>*, bool, std::lock_guard<std::mutex>&) /root/doris/be/src/io/cache/block_file_cache.cpp:380:20
apache#10 0x55f28e833d1b in doris::io::BlockFileCache::get_impl[abi:cxx11](doris::io::UInt128Wrapper const&, doris::io::CacheContext const&, doris::io::FileBlock::Range const&, std::lock_guard<std::mutex>&) /root/doris/be/src/io/cache/block_file_cache.cpp:572:13
apache#11 0x55f28e83b4ef in doris::io::BlockFileCache::get_or_set(doris::io::UInt128Wrapper const&, unsigned long, unsigned long, doris::io::CacheContext&) /root/doris/be/src/io/cache/block_file_cache.cpp:762:27
apache#12 0x55f28e7ffcee in doris::io::CachedRemoteFileReader::read_at_impl(unsigned long, doris::Slice, unsigned long*, doris::io::IOContext const*) /root/doris/be/src/io/cache/cached_remote_file_reader.cpp:191:21
apache#13 0x55f28e7f8017 in doris::io::FileReader::read_at(unsigned long, doris::Slice, unsigned long*, doris::io::IOContext const*) /root/doris/be/src/io/fs/file_reader.cpp:34:17
```
### Release note
None
### Check List (For Author)
- Test <!-- At least one of them must be included. -->
- [ ] Regression test
- [ ] Unit Test
- [ ] Manual test (add detailed scripts or steps below)
- [ ] No need to test or manual test. Explain why:
- [ ] This is a refactor/code format and no logic has been changed.
- [x] Previous test can cover this change.
- [ ] No code files have been changed.
- [ ] Other reason <!-- Add your reason? -->
- Behavior changed:
- [x] No.
- [ ] Yes. <!-- Explain the behavior change -->
- Does this need documentation?
- [x] No.
- [ ] Yes. <!-- Add document PR link here. eg:
apache/doris-website#1214 -->
### Check List (For Reviewer who merge this PR)
- [ ] Confirm the release note
- [ ] Confirm test cases
- [ ] Confirm document
- [ ] Add branch pick label <!-- Add branch pick label that this PR
should merge into -->
---------
Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>
Duplicate elements may occur during the transition between TTL and normal states. - Replay and dump operations are periodic, which means there could be a time window that leads to capturing the addition of an element to a queue but not its removal from the previous queue. - The issue stems from the use of non-order-assure current queue in the log queue for record operations. The above two are currently unavoidable, so a fallback logic is necessary to remove duplicates when elements are added repeatedly. Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>
Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>
Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>
|
run buildall |
|
PR approved by at least one committer and no changes requested. |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage `` 🎉 |
TPC-H: Total hot run time: 39904 ms |
TPC-DS: Total hot run time: 192739 ms |
ClickBench: Total hot run time: 31.72 s |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage `` 🎉 |
fix incorrect dumper file name and use directory to contain all the test dump files to make test clean. Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>
00ba857 to
8c0ebdf
Compare
|
run buildall |
TPC-H: Total hot run time: 39796 ms |
|
PR approved by at least one committer and no changes requested. |
TPC-DS: Total hot run time: 192028 ms |
ClickBench: Total hot run time: 31.53 s |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
FE UT Coverage ReportIncrement line coverage `` 🎉 |
|
run buildall |
Cloud UT Coverage ReportIncrement line coverage Increment coverage report
|
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
TPC-H: Total hot run time: 39637 ms |
TPC-DS: Total hot run time: 192095 ms |
FE UT Coverage ReportIncrement line coverage `` 🎉 |
ClickBench: Total hot run time: 31.36 s |
|
run nonConcurrent |
pick #49456
When the system restarts, the LRU queue in memory is lost due to lack of persistence. This requires re-scanning the disk directory to load data, leading to the following issues:
In this commit, we periodically dump the LRU queue information to disk and rebuild the LRU queue upon restart. Considering that the LRU content may be extensive, we only dump the tail end (the part that will be evicted first) of the LRU queue, with the specific quantity configured by the config.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)