-
Notifications
You must be signed in to change notification settings - Fork 3.7k
Open
Description
Search before asking
- I had searched in the issues and found no similar issues.
Version
master
What's Wrong?
Location: be/src/io/cache/fs_file_cache_storage.cpp in FSFileCacheStorage::load_cache_info_into_memory() (around line 880)
Description:
In the cache loading logic, we calculate consistency between RocksDB metadata and filesystem using:
double difference_ratio =
(static_cast<double>(estimated_file_count) - static_cast<double>(db_block_count)) /
static_cast<double>(estimated_file_count);
This formula assumes estimated_file_count >= db_block_count, where:
estimated_file_count = directory_size / 1MB(upper-bound assumption)db_block_count= actual cache blocks loaded from RocksDB
However, in data lake scenarios with many small files (<1MB), this estimation becomes an underestimation, resulting in estimated_file_count < db_block_count and producing negative difference_ratio values.
Impact:
- Inaccurate metric: Negative ratios don't represent the actual discrepancy magnitude
- Wrong decisions: May incorrectly skip filesystem reload when
difference_ratiois negative but below threshold
What You Expected?
Suggested fix:
Use absolute value to measure the discrepancy magnitude:
double difference_ratio =
std::abs(static_cast<double>(estimated_file_count) - static_cast<double>(db_block_count)) /
static_cast<double>(estimated_file_count);
How to Reproduce?
No response
Anything Else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct
Metadata
Metadata
Assignees
Labels
No labels