-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epic: pageserver image layer compression #5431
Labels
c/storage/pageserver
Component: storage: pageserver
t/feature
Issue type: feature, for new features or requests
Comments
jcsp
added
t/feature
Issue type: feature, for new features or requests
c/storage/pageserver
Component: storage: pageserver
labels
Oct 2, 2023
jcsp
changed the title
Epic: pageserver compression
Epic: pageserver image layer compression
Apr 15, 2024
arpad-m
added a commit
that referenced
this issue
May 23, 2024
We'd like to get some bits reserved in the length field of image layers for future usage (compression). This PR bases on the assumption that we don't have any blobs that require more than 28 bits (3 bytes + 4 bits) to store the length, but as a preparation, before erroring, we want to first emit warnings as if the assumption is wrong, such warnings are less disruptive than errors. A metric would be even less disruptive (log messages are more slow, if we have a LOT of such large blobs then it would take a lot of time to print them). At the same time, likely such 256 MiB blobs will occupy an entire layer file, as they are larger than our target size. For layer files we already log something, so there shouldn't be a large increase in overhead. Part of #5431
Last week:
This week:
|
This week:
|
arpad-m
added a commit
that referenced
this issue
Jul 2, 2024
Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431
arpad-m
added a commit
that referenced
this issue
Jul 3, 2024
…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431
This was referenced Jul 3, 2024
arpad-m
added a commit
that referenced
this issue
Jul 4, 2024
As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431
arpad-m
added a commit
that referenced
this issue
Jul 4, 2024
Adds a find-large-objects subcommand to the scrubber to allow listing layer objects larger than a specific size. To be used like: ``` AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas ``` Part of #5431
arpad-m
added a commit
that referenced
this issue
Jul 4, 2024
This flattens the compression algorithm setting, removing the `Option<_>` wrapping layer and making handling of the setting easier. It also adds a specific setting for *disabled* compression with the continued ability to read copmressed data, giving us the option to more easily back out of a compression rollout, should the need arise, which was one of the limitations of #8238. Implements my suggestion from #8238 (comment) , inspired by Christian's review in #8238 (review) . Part of #5431
This was referenced Jul 5, 2024
Closed
arpad-m
added a commit
that referenced
this issue
Jul 5, 2024
Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431
arpad-m
added a commit
that referenced
this issue
Jul 5, 2024
The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431
This was referenced Jul 6, 2024
VladLazar
pushed a commit
that referenced
this issue
Jul 8, 2024
Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431
VladLazar
pushed a commit
that referenced
this issue
Jul 8, 2024
…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431
VladLazar
pushed a commit
that referenced
this issue
Jul 8, 2024
Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431
VladLazar
pushed a commit
that referenced
this issue
Jul 8, 2024
The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431
VladLazar
pushed a commit
that referenced
this issue
Jul 8, 2024
Add support for reading and writing zstd-compressed blobs for use in image layer generation, but maybe one day useful also for delta layers. The reading of them is unconditional while the writing is controlled by the `image_compression` config variable allowing for experiments. For the on-disk format, we re-use some of the bitpatterns we currently keep reserved for blobs larger than 256 MiB. This assumes that we have never ever written any such large blobs to image layers. After the preparation in #7852, we now are unable to read blobs with a size larger than 256 MiB (or write them). A non-goal of this PR is to come up with good heuristics of when to compress a bitpattern. This is left for future work. Parts of the PR were inspired by #7091. cc #7879 Part of #5431
VladLazar
pushed a commit
that referenced
this issue
Jul 8, 2024
…8238) PR #8106 was created with the assumption that no blob is larger than `256 MiB`. Due to #7852 we have checking for *writes* of blobs larger than that limit, but we didn't have checking for *reads* of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason. Therefore, we now add a warning for *reads* of such large blobs as well. To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it. Part of #5431
VladLazar
pushed a commit
that referenced
this issue
Jul 8, 2024
As per @koivunej 's request in #8238 (comment) , use a runtime param instead of monomorphizing the function based on the value. Part of #5431
VladLazar
pushed a commit
that referenced
this issue
Jul 8, 2024
Adds a find-large-objects subcommand to the scrubber to allow listing layer objects larger than a specific size. To be used like: ``` AWS_PROFILE=dev REGION=us-east-2 BUCKET=neon-dev-storage-us-east-2 cargo run -p storage_scrubber -- find-large-objects --min-size 250000000 --ignore-deltas ``` Part of #5431
VladLazar
pushed a commit
that referenced
this issue
Jul 8, 2024
This flattens the compression algorithm setting, removing the `Option<_>` wrapping layer and making handling of the setting easier. It also adds a specific setting for *disabled* compression with the continued ability to read copmressed data, giving us the option to more easily back out of a compression rollout, should the need arise, which was one of the limitations of #8238. Implements my suggestion from #8238 (comment) , inspired by Christian's review in #8238 (review) . Part of #5431
VladLazar
pushed a commit
that referenced
this issue
Jul 8, 2024
Improve parsing of the `ImageCompressionAlgorithm` enum to allow level customization like `zstd(1)`, as strum only takes `Default::default()`, i.e. `None` as the level. Part of #5431
VladLazar
pushed a commit
that referenced
this issue
Jul 8, 2024
The find-large-objects scrubber subcommand is quite fast if you run it in an environment with low latency to the S3 bucket (say an EC2 instance in the same region). However, the higher the latency gets, the slower the command becomes. Therefore, add a concurrency param and make it parallelized. This doesn't change that general relationship, but at least lets us do multiple requests in parallel and therefore hopefully faster. Running with concurrency of 64 (default): ``` 2024-07-05T17:30:22.882959Z INFO lazy_load_identity [...] [...] 2024-07-05T17:30:28.289853Z INFO Scanned 500 shards. [...] ``` With concurrency of 1, simulating state before this PR: ``` 2024-07-05T17:31:43.375153Z INFO lazy_load_identity [...] [...] 2024-07-05T17:33:51.987092Z INFO Scanned 500 shards. [...] ``` In other words, to list 500 shards, speed is increased from 2:08 minutes to 6 seconds. Follow-up of #8257, part of #5431
arpad-m
added a commit
that referenced
this issue
Jul 10, 2024
Removes the `ImageCompressionAlgorithm::DisabledNoDecompress` variant. We now assume any blob with the specific bits set is actually a compressed blob. The `ImageCompressionAlgorithm::Disabled` variant still remains and is the new default. Reverts large parts of #8238 , as originally intended in that PR. Part of #5431
arpad-m
added a commit
that referenced
this issue
Jul 11, 2024
arpad-m
added a commit
that referenced
this issue
Jul 12, 2024
Merged
Week Jul 1-5:
Week Jul 8-12:
|
skyzh
pushed a commit
that referenced
this issue
Jul 15, 2024
Removes the `ImageCompressionAlgorithm::DisabledNoDecompress` variant. We now assume any blob with the specific bits set is actually a compressed blob. The `ImageCompressionAlgorithm::Disabled` variant still remains and is the new default. Reverts large parts of #8238 , as originally intended in that PR. Part of #5431
skyzh
pushed a commit
that referenced
this issue
Jul 15, 2024
skyzh
pushed a commit
that referenced
this issue
Jul 15, 2024
arpad-m
added a commit
that referenced
this issue
Jul 18, 2024
problame
pushed a commit
that referenced
this issue
Jul 22, 2024
arpad-m
added a commit
that referenced
this issue
Jul 30, 2024
If compression is enabled, we currently try compressing each image larger than a specific size and if the compressed version is smaller, we write that one, otherwise we use the uncompressed image. However, this might sometimes be a wasteful process, if there is a substantial amount of images that don't compress well. The compression metrics added in #8420 `pageserver_compression_image_in_bytes_total` and `pageserver_compression_image_out_bytes_total` are well designed for answering the question how space efficient the total compression process is end-to-end, which helps one to decide whether to enable it or not. To answer the question of how much waste there is in terms of trial compression, so CPU time, we add two metrics: * one about the images that have been trial-compressed (considered), and * one about the images where the compressed image has actually been written (chosen). There is different ways of weighting them, like for example one could look at the count, or the compressed data. But the main contributor to compression CPU usage is amount of data processed, so we weight the images by their *uncompressed* size. In other words, the two metrics are: * `pageserver_compression_image_in_bytes_considered` * `pageserver_compression_image_in_bytes_chosen` Part of #5431
arpad-m
added a commit
that referenced
this issue
Aug 5, 2024
If compression is enabled, we currently try compressing each image larger than a specific size and if the compressed version is smaller, we write that one, otherwise we use the uncompressed image. However, this might sometimes be a wasteful process, if there is a substantial amount of images that don't compress well. The compression metrics added in #8420 `pageserver_compression_image_in_bytes_total` and `pageserver_compression_image_out_bytes_total` are well designed for answering the question how space efficient the total compression process is end-to-end, which helps one to decide whether to enable it or not. To answer the question of how much waste there is in terms of trial compression, so CPU time, we add two metrics: * one about the images that have been trial-compressed (considered), and * one about the images where the compressed image has actually been written (chosen). There is different ways of weighting them, like for example one could look at the count, or the compressed data. But the main contributor to compression CPU usage is amount of data processed, so we weight the images by their *uncompressed* size. In other words, the two metrics are: * `pageserver_compression_image_in_bytes_considered` * `pageserver_compression_image_in_bytes_chosen` Part of #5431
From @Bodobolero's benchmarks: add lz4 support for comparison. |
we talked about this in the call and agreed that until further investigation in which compression is identified as culprit, we will not spend developer time on this. |
I think this can be closed now. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
c/storage/pageserver
Component: storage: pageserver
t/feature
Issue type: feature, for new features or requests
Background
We may substantially decrease the capacity & bandwidth footprint of tenants by compressing data in their image layers.
There are many possible implementations, from compressing whole layers files as streams, to introducing some chunked format and decompressing a chunk at a time, to simply compressing individual pages.
Compressing individual pages in image layers is by far the simplest thing to do, and should have a high payoff as:
Compressing deltas is a harder problem (individual deltas are likely too small to usefully compress), and is left as a possible future change.
Implementation
There is a preliminary version here: #7091, which demonstrates that per-page compression in image layers may be added as a relatively lightweight code change.
To get this ready for production, there is more work to do:
PRs/issues
ImageCompressionAlgorithm
#8281Rollout
The text was updated successfully, but these errors were encountered: