-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-25869 WAL value compression #3244
Conversation
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java
Outdated
Show resolved
Hide resolved
hbase-common/src/main/java/org/apache/hadoop/hbase/KeyValue.java
Outdated
Show resolved
Hide resolved
Clean up the commit message. |
I broke trunk on an unrelated issue last night. Rebased. Removed some precommit noise that resulted. |
hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/wal/WALCellCodec.java
Outdated
Show resolved
Hide resolved
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
ca44233
to
b8ff445
Compare
🎊 +1 overall
This message was automatically generated. |
So we will only compress value? As we will do batching when writing WAL entries out, is it possible to compress when flushing? The data will be larger and compress may perform better. The structure of a WAL file will be multiple compressed blocks. I think for AsyncFSWAL this is easy as we will buffer all the entries in memory and when flush is called we flush all the in memory data out. For FSHLog it will be a bit hard as we rely on the DFSOutputStream to do flush if the data is too big before we actually call hflush. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
This is an enhancement to existing WAL compression. As you know the existing WAL compression already compresses other aspects of WAL entries except for the value. This patch adds support for compressing values too.
This is not possible for two reasons:
Way back in the distant past our WAL format was based on Hadoop's SequenceFile, which supported both record-by-record and block based compression, where the blocks would contain multiple records. I don't remember why we moved away from it but I imagine it was because if there are corruptions of the WAL, a record by record codec is able to skip over the corrupt record and we lose only the record (or as many records as are actually corrupt), but with a block format we would lose the whole block and all of the edits contained within that block, especially if compression or encryption is enabled. |
@bharathv I wrote a simple bounded delegating input stream impl to avoid the unnecessary copy at decompression time. Rebased on master. |
🎊 +1 overall
This message was automatically generated. |
I am redoing microbenchmarks with the latest patch and will update here soon. Improvements have unlocked IO performance improvement from the compression.
Before, SNAPPY+copy had a 10x loss, now SNAPPY+no-copy has a 10x gain. I double checked these findings and it seems correct. I will re-run the benchmark when measuring for the other codec types for comparison. Microbenchmarks are collected with this change. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
SNAPPY or ZSTD are recommended, all other options provided for comparison. (LZMA is included as a sanity check that indeed an expensive algorithm really is expensive.) When using SNAPPY or ZSTD we derive a performance benefit due to reduced IO for the large values in the test case. Microbenchmarks are collected with this change.
|
Holy guacamole! Is this because of the reduced disk IO with compressed values? I'm glad we regained all the lost performance by eliding the copy. Edit: Just saw your last comment "When using SNAPPY or ZSTD we derive a performance benefit due to reduced IO for the large values in the test case." Sweeeeet! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 with one pending suggestion and green QA.
* is finished. | ||
*/ | ||
@InterfaceAudience.Private | ||
public class BoundedDelegatingInputStream extends DelegatingInputStream { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can merge this and DelegatingInputStream? Don't think we need both of them separately.
public class BoundedDelegatingInputStream extends FilterInputStream {
.... < bounding methods>...
setDelegate() {}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we could keep them both.
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
Unit test failures are not related. It looks like master is recently unstable, related to RS groups. |
That failure looks suspicious. Let me grab the logs before you merge/close the PR. |
I have what's available. |
How can it be related? That test neither enables WAL compression nor WAL value compression.
Test log is full of:
Passes on my dev branch:
|
Pushed a fix for whitespace and javadoc issues introduced in last change. No additional changes anticipated from this point. |
🎊 +1 overall
This message was automatically generated. |
Parton Andrew. I did not mean for "suspicious" to imply "caused by this changeset," merely that I was interested in looking at it further. All the PR build artifacts are purged from Jenkins after the PR is closed. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Thanks. Unless objection I will merge this tonight or tomorrow morning, to master, and then to branch-2 (for future 2.5.0) |
WAL storage can be expensive, especially if the cell values represented in the edits are large, consisting of blobs or significant lengths of text. Such WALs might need to be kept around for a fairly long time to satisfy replication constraints on a space limited (or space-contended) filesystem. We have a custom dictionary compression scheme for cell metadata that is engaged when WAL compression is enabled in site configuration. This is fine for that application, where we can expect the universe of values and their lengths in the custom dictionaries to be constrained. For arbitrary cell values it is better to use one of the available compression codecs, which are suitable for arbitrary albeit compressible data. Signed-off-by: Bharath Vissapragada <bharathv@apache.org> Signed-off-by: Duo Zhang <zhangduo@apache.org> Signed-off-by: Nick Dimiduk <ndimiduk@apache.org>
WAL storage can be expensive, especially if the cell values represented in the edits are large, consisting of blobs or significant lengths of text. Such WALs might need to be kept around for a fairly long time to satisfy replication constraints on a space limited (or space-contended) filesystem.
We have a custom dictionary compression scheme for cell metadata that is engaged when WAL compression is enabled in site configuration. This is fine for that application, where we can expect the universe of values and their lengths in the custom dictionaries to be constrained. For arbitrary cell values it is better to use one of the available compression codecs, which are suitable for arbitrary albeit compressible data.
Microbrenchmark Results
Site configuration used:
Loader: IntegrationTestLoadCommonCrawl
Input: s3n://commoncrawl/crawl-data/CC-MAIN-2021-10/segments/1614178347293.1/warc/CC-MAIN-20210224165708-20210224195708-00000.warc.gz
SNAPPY or ZSTD at level 1 are recommended, all other options provided for comparison.
Microbenchmarks are collected with this change.
Statistics are collected over the lifetime of the regionserver and are dumped at end of test at shutdown. Statistics are updated under synchronization but this is done in a way that excludes that overhead from measurement. The normal patch does not contain either the instrumentation or the synchronization point. Nanoseconds are converted to milliseconds for the table.
Compression enabled, value compression enabled, v1 patch, Deflate (best speed)1,209,947,515(76.4%)12.694 (stddev 8.48)