Only support compressed reads if the compression setting is present #8238

arpad-m · 2024-07-02T17:06:41Z

PR #8106 was created with the assumption that no blob is larger than 256 MiB. Due to #7852 we have checking for writes of blobs larger than that limit, but we didn't have checking for reads of such large blobs: in theory, we could be reading these blobs every day but we just don't happen to write the blobs for some reason.

Therefore, we now add a warning for reads of such large blobs as well.

To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config. This also means that we can't back out of compression once we enabled it.

Part of #5431

Otherwise, warn

arpad-m · 2024-07-02T17:09:14Z

I'd like most of the changes of this PR to be temporary: that is, eventually the compression setting should disappear and we assume any blob larger than 256 MiB must be a compressed blob. That change can happen however months in the future.

github-actions · 2024-07-02T17:57:34Z

3000 tests run: 2885 passed, 0 failed, 115 skipped (full report)

Code coverage* (full report)

functions: 32.7% (6934 of 21215 functions)
lines: 50.0% (54354 of 108614 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
4220e2f at 2024-07-02T17:57:34.054Z :recycle:}

pageserver/src/tenant/blob_io.rs

koivunej · 2024-07-03T09:45:59Z

pageserver/src/tenant/blob_io.rs

+            if compression_bits > BYTE_UNCOMPRESSED {
+                warn!("reading key above future limit ({len} bytes)");
+            }


Doesn't this mean that we are reading a previously compressed as uncompressed?

Shouldn't we also or alternatively check if the read bytes start with the zstd fourcc/magic?

Anyway, I am having a hard time parsing this compression_bits. I get it that here it can be the u32 after anding it with 0x0f or 0x7f which means -- aha, it is never masked.

Related note: On line L332R341 the LEN_COMPRESSION_BIT_MASK is used as literal 0xf0:

-assert_eq!(len_buf[0] & 0xf0, 0); +assert_eq!(len_buf[0] & LEN_COMPRESSION_BIT_MASK, 0);

Ok... So perhaps I know understand. Possible compression_bits are:

match compression_bits >> 4 { 0 => /* image layer written before the compression support or small value? */, 1..8 => /* reserved */, 8 => /* uncompressed blob */, 9 => /* zstd */, 10..=15 => /* undefined or written before compression support and too large, which we warn here? */ _ => unreachable!("u4"), }

If this is correctly understood, then okay maybe ... The compression_bits > BYTE_COMPRESSED just looks so off, in my mind a bitfield doesn't support ordered comparison. It'd be nice to have enums and matches for these. Err nope that cannot be correct.

Did you test that this warning is produced with some hand-crafted image file?

EDIT: rust snippet had 1..8 and 10..=15 wrong way around, possibly.

Doesn't this mean that we are reading a previously compressed as uncompressed?

If the compression setting is disabled, yes. This has the consequence that we can't turn off compression easily any more, but I think it's okay to have it for a while, after which point we'll (mostly) revert this PR.

It'd be nice to have enums and matches for these

match is not good for that as FOOBAR => is equivalent to a variable capture.

Is my match correct? Did you test if this is hit? Will that be used as a success criteria for the compression support? If so, what is the plan to read all image layers?

Other questions remain, magic/fourcc instead of reserving more bits?

Is my match correct?

almost correct, the 1..8 range doesn't have the highest bit set so it's an indicator for small uncompressed values.

match compression_bits >> 4 { 0..8 => /* small, uncompressed value below 128 bytes */, 8 => /* uncompressed blob */, 9 => /* zstd */, 10..=15 => /* reserved or written before compression support and too large, which we warn here */ _ => unreachable!("u4"), }

Will that be used as a success criteria for the compression support?

What do you mean, can you expand?

If so, what is the plan to read all image layers?

??

magic/fourcc instead of reserving more bits?

What do you mean by that? I do not think we should autodetect zstd magic here.

Christian wants this to be merged as-is, doing that now.

success criteria
plan to read all image layers

Christian mentioned these in #8238 (comment). I was wondering similarly what are the next steps.

magic/fourcc

What do you mean by that? I do not think we should autodetect zstd magic here.

zstd always starts the compressed bytes with the same 4 magic bytes. I was thinking if we should instead use that knowledge instead of awkwardly reserving bits, as I had no idea what was the plan for the next step.. But yeah, seems there is a plan after all.

For future reference, the slack thread where I proposed to merge asap is https://neondb.slack.com/archives/C074JQU7NER/p1720021276202079

problame

side question: why is PageServerConf::image_compression an Option? Like with EvictionPolicy, a simple ImageCompressionAlgorithm::Disabled and been designed as the #[default].

To make deploying compression less dangerous, we therefore only assume a blob is compressed if the compression setting is present in the config.
This also means that we can't back out of compression once we enabled it.

I'm confused by this.

IMO once we have written the first compressed blob to S3, we'll have to support decompressing layers with compressed blobs into perpituity. The PageServerConf::image_compression conflig flag should only be controlling whether we write new blobs with compression or not. But not affect reading of exisitng blobs.

So, I guess I'm rejecting the fundamental idea behind this PR. Or I'm missing something.

pageserver/src/tenant/storage_layer/image_layer.rs

arpad-m · 2024-07-03T14:02:32Z

why is PageServerConf::image_compression an Option?

That's what the option type is for, to indicate None. But of course it can be also inlined.

we'll have to support decompressing layers with compressed blobs into perpituity.

yes, this PR is only meant for a temporary period until compression is rolled out everywhere. at that point, I'd like to revert most of it. Of course, for the transition period, and maybe also beyond, it might make sense to have the ability to easily revert the config setting.

The risk is that we might be reading 256 MiB large blobs in production right now, and on deploy of current main we yield corrupted data. This PR is just to avoid that situation.

Maybe both of your suggestions could be combined, and one could change ImageCompressionAlgorithm to:

enum ImageCompressionAlgorithm {
    DisabledNoDecompress,
    Disabled,
    Zstd { level: Option<i8> },
}

ImageCompressionAlgorithm::DisabledNoDecompress would interpret any blobs larger than 256 MiB as uncompressed, while ImageCompressionAlgorithm::Disabled would still decompress them. DisabledNoDecompress would be the default at the start, for a couple of weeks until compression is rolled out. Then, we can switch to Zstd or Disabled as the default, and remove the DisabledNoDecompress variant.

How does that sound?

problame · 2024-07-03T15:08:48Z

Ok, thanks for clarifying.

I think it has to go like so:

roll out software that guarantees we don't write new blobs >= 256MiB
then do a full scrubber run, ensuring that all blobs are < 256MiB
invariant now: all length fields have the LEN_COMPRESSION_BIT_MASK bits zeroed, so, we can use them for compression
start shipping/enabling code that interprets the LEN_COMPRESSION_BIT_MASK

But let's take that discussion into Slack.

Regarding this PR, I now understand its importance and we should definitely merge it before the code from #8106 hits prod, because that would mean doing (4) before invariant (3) is established.

problame

Approving this, the decompress code needs to be dead at runtime until we've established invariant (3), see my previous post.

@koivunej