-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-28004 Persistent cache map can get corrupt if crash happens midway through the write #5341
Conversation
…way through the write
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Change-Id: I8a7888f28c91a866184e3820e3d6fc48855755f2
💔 -1 overall
This message was automatically generated. |
Change-Id: I8091c310c9a6583945170160c8daddd932627c00
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/BucketCache.java
Outdated
Show resolved
Hide resolved
hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/bucket/PersistentIOEngine.java
Outdated
Show resolved
Hide resolved
} catch (IOException e1) { | ||
LOG.debug("Check for key {} failed. Removing it from map.", keyEntry.getKey()); | ||
backingMap.remove(keyEntry.getKey()); | ||
prefetchCompleted.remove(keyEntry.getKey().getHfileName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does removing the whole file when one entry is missing makes sense, can we do some % check here that if we are missing the certain % of blocks for the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By "prefetchCompleted", we mean files that had all blocks cached. If we fail to validate a single block for a given file, we can't guarantee the given file is fully cached anymore, so the PrefetchExecutor should run for this file again. I don't think this is a big deal, though, as the PrefetchExecutor itself always tries to read the block first from the cache, and if it misses it, then it goes to the file system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In today's current implementation, the performance will be significantly compromised if we rerun the prefetch even if 99% of blocks are cached, as during the prefetch, if the block is present in the cache. we retrieve the entire data from the bucket cache, proceed to uncompress it, and then promptly discard it.
Options are :
-
Fix the prefetch to avoid loading the block from the file cache if we find it in backingMap.
-
Or, Hfile from prefetchlist can be discarded on the basis of percentage , If the number of missing blocks falls below a certain threshold (x%), the prefetch process would be triggered again. Users also have the flexibility to set this threshold to 100% if they wish to rerun the prefetch when even a single block is absent.
I would prefer option-1 if possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe option #1 is feasible, I'm testing a solution based on that approach, will update this PR afterwards.
int totalKeysOriginally = backingMap.size(); | ||
for (Map.Entry<BlockCacheKey, BucketEntry> keyEntry : backingMap.entrySet()) { | ||
try { | ||
((FileIOEngine) ioEngine).checkCacheTime(keyEntry.getValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks we are gonna check for each entry and if we have 30GB of backing Map for 2TB bucket cache. Then it could take a longer time, did we test it out with 2TB of bucket cache to determine how much time it will take to get the RS live?
If not performant, We should make this optional and verify the blocks when they are read or during the eviction scan.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about it, as performance is not optimal (in my tests, 190GB cache with 2M blocks took around 105s to complete validation). The problem is that we have BucketAllocator, who needs to configure its indexes to buckets and offsets for the cache file according to the offset and length of each block in the backing map. If we don't do this validation, the backing map might be inconsistent, with more than one block for the same offset, then the bucket allocator initialisation will fail and we'll not able to access the blocks in the cache.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will regress with size, can we list down the scenarios in which we can go inconsistent like system error or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On RS induced aborts, we sync the backing map when shutting down, so no need to validate the blocks when restarting and reloading the cache. I've done few tests where I've induced RS abort after HDFS/ZK errors during a write load that updates the cache, then on restart this validation wasn't needed (the backing map was consistent with the cache state).
So I think the validation would happen only in cases where the RS crash abruptly, from a JVM fatal error or OS sigkill situation. For the record, I've tested crashing RS with a 1.6TB cache, the validation then took around 35 minutes to complete. This is not great, since the RS is useless during this time. One thought here is to allow bucket cache initialisation in the background, so that meanwhile reads would be going to the FS, but I would rather work on this in a separate jira/PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ankitsinghal , these add extra check to search for blocks in the bucket cache backing map first while prefetching, if we do find it there, we skip both the cache and/or file system reading altogether and increase the offset to check for the next block.
Change-Id: I9401c42d44134fce77c812414c88aa2e731e8616
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
cacheConf.getBlockCache().ifPresent(bc -> { | ||
if (bc instanceof BucketCache) { | ||
((BucketCache) bc).fileCacheCompleted(path.getName()); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, persistent cache is only for L2 cache. I think, we need to get the instance of L2 cache only if the blockcache is an instance of CombinedBlockCache.
Change-Id: I3e0341c48d77f6a4f91dd12f9f74a69fc1c1ac33
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
} catch (IOException e1) { | ||
LOG.debug("Check for key {} failed. Removing it from map.", keyEntry.getKey()); | ||
backingMap.remove(keyEntry.getKey()); | ||
prefetchCompleted.remove(keyEntry.getKey().getHfileName()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In today's current implementation, the performance will be significantly compromised if we rerun the prefetch even if 99% of blocks are cached, as during the prefetch, if the block is present in the cache. we retrieve the entire data from the bucket cache, proceed to uncompress it, and then promptly discard it.
Options are :
-
Fix the prefetch to avoid loading the block from the file cache if we find it in backingMap.
-
Or, Hfile from prefetchlist can be discarded on the basis of percentage , If the number of missing blocks falls below a certain threshold (x%), the prefetch process would be triggered again. Users also have the flexibility to set this threshold to 100% if they wish to rerun the prefetch when even a single block is absent.
I would prefer option-1 if possible.
int totalKeysOriginally = backingMap.size(); | ||
for (Map.Entry<BlockCacheKey, BucketEntry> keyEntry : backingMap.entrySet()) { | ||
try { | ||
((FileIOEngine) ioEngine).checkCacheTime(keyEntry.getValue()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will regress with size, can we list down the scenarios in which we can go inconsistent like system error or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Change-Id: I5f9aba5b85053f5e7750f4a152d0d321126b39ed
Change-Id: Ica5aca33034bf391f1d76ac904449bcfdafed393
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
…ally Change-Id: If06d2a1e67d90f9ab7aaaa3b93ecaf951015f1d7
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Change-Id: Idbc28136f6744ae9026b0899c4393bfa78123d01
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Change-Id: I88408fc035c8aa267020a6d36a4f0854599751a4
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
…way through the write (#5341) Signed-off-by: Ankit Singhal <ankit@apache.org> Reviewed-by: Rahul Agarkar <rahul.agarkar@gmail.com>
…way through the write (#5341) Signed-off-by: Ankit Singhal <ankit@apache.org> Reviewed-by: Rahul Agarkar <rahul.agarkar@gmail.com> Change-Id: I577990e1460d6fdc137e1dfcd26e85fed373ed6e
…way through the write (apache#5341) Signed-off-by: Ankit Singhal <ankit@apache.org> Reviewed-by: Rahul Agarkar <rahul.agarkar@gmail.com> Change-Id: I577990e1460d6fdc137e1dfcd26e85fed373ed6e
HBASE-27686 added a background thread for periodically saving the cache index map, together with a list of completed cached files so that we can recover the cache state in case of crash or restart. Problem is that the cache index can become few GB large (a sample case with 1.6TB of used bucket cache would map to between 8GB to 10GB indexes), and these writes take few seconds to complete, causing any RS crash very likely to leave a corrupt index file that can't be recovered when the RS starts again. Worse, since we store the list of cached files on a separate file, this also leads to cache inconsistencies, with files in the list of cached files never cached once the RS is restarted, even though we have no cache index for those and every read ends up going to the FS.
This task aims to refactor the cache persistent as follows: