Combined/Grouped Records to Maximise Compression #13107

Haravikk · 2022-02-15T12:58:18Z

Describe the feature would like to see added to OpenZFS

I would like to see the ability to store "combined" records, whereby multiple smaller records are stored as if they were a single larger one, in order to make more effective use of filesystem compression in a similar manner to ZVOLs.

How will this feature improve OpenZFS?

It will allow datasets to achieve higher compression ratios when storing smaller records that compress less well on their own, but compress well when treated as a single large "block".

It may also allow for improved performance in raidz configurations, as grouping records can result in a wider grouped record that can be more efficiently split across disks. This improvement would be most visible when dealing with smaller records that currently result in sub-optimal writes to these devices.

Additional context

The inspiration for this request came after I migrated the contents of a plain ZFS dataset to a ZVOL while attempting to debug an unrelated issue. While my compressratio on the plain dataset was an okay 1.19, on the ZVOL the exact same content achieved a ratio of nearly 2.0 with all else being equal.

This makes sense, as ZVOLs effectively have a large minimum "record" size owing to their volblocksize (in my case 128k), whereas in an ordinary dataset records can be as small as the minimum physical block size determined by the ashift value (4k in my case). Most compression algorithms can only achieve limited savings on smaller amounts of data, and they tend to work a lot better the more compressible data you can give them, and that is the case here.

I would propose that the feature works something like the following:

With an appropriate setting enabled (e.g- grouprecordsize=128K), ZFS will delay final writes to disk so that smaller records can be grouped together into a larger record of the specified size (up to, but no larger than recordsize). The final combined record size may still be smaller than the target, as it is only a best effort using available written data.
Another setting may determine whether this will apply to data, metadata or both (e.g- grouprecordtype=all|data|metadata). The default would be both, but this allows tuning for performance vs. size on metadata records.
Once enough small records are buffered they will be written out as a single combined/group record as if they were one larger record, and compressed accordingly as a group (rather than individually).
Metadata for the records will point to the location of the group record and an offset within it where the individual specific record can be found.
To read a single sub-record within a group, the entire group record is retrieved and decompressed so that the smaller record can be retrieved (same as happens for a small file within a ZVOL block, except that the awareness of the sub-record is handled by ZFS itself, rather than a secondary filesystem).
As old individual sub-records are freed this will create "holes" within group records. Once the amount freed exceeds some limit (e.g- grouprecordrewrite=64K) then all of its remaining individual records would be queued for writing as if they had been updated, allowing the old group record to be freed once a new one has been written out.

In essence the idea is to give ZVOL-like storage performance in ordinary datasets, with some of the same basic caveats; i.e- additional data needs to be read to access a single small record, and more data may need to be written. However, in the latter case by only rewriting when old records become too fragmented this should be less common than it is for ZVOLs so some of the reduced write performance would be mitigated.

While this feature should not be enabled by default for a variety of reasons, properly tuned its performance shouldn't be any worse than a ZVOL would be, and in some cases write performance should be better than a ZVOL, since there should be less unnecessary copying as a result of small writes (compared to an actual ZVOL where ZFS has no awareness of the contents), and there would be no need for an entire secondary filesystem to be involved.

In fact it may be possible to leverage existing zvol code to implement this, i.e- a grouped record would simply be a zvol block, and extracting data would function the same as reading only part of the block. The main differences would be that writing may prefer to create new blocks rather than updating old ones (though this is not a requirement) and there would need to be a special metadata flag or format so that metadata can be stored in zvol blocks, and reference locations within other blocks.

The text was updated successfully, but these errors were encountered:

GregorKopka · 2024-02-03T16:11:52Z

Default recordsize (zfs internal logical file block size) for filesystems is 128k and ZFS already works exactly like you're asking for on reads or writes that are smaller than the recordsize of a file was created with.

The way compression works in zfs is identical between data stored in a file and a zvol, a filesystem dataset (containing only one file holding the same data as a zvol) showing a worse compression ratio is to be expected, as a filesystem has additional metadata (directory, permissions, modification time, ...) that is factored into the compression ratio of the whole dataset.

Please close as this asked for existing functionaliy.

Haravikk · 2024-02-04T12:27:09Z

Default recordsize (zfs internal logical file block size) for filesystems is 128k and ZFS already works exactly like you're asking for on reads or writes that are smaller than the recordsize of a file was created with.

No it doesn't; recordsize is the maximum size of a single record, smaller records can be stored at a smaller size but they are still single, individual records (either a complete file, or part of a single file), which is compressed and stored in isolation, there is no grouping or shared compression between them.

ZVOLs can have much greater compression gains because ZFS is almost always provided with volblocksize of data to compress since it has no awareness of what's inside (could be part of a single file, could be many small files).

So this is in no way existing functionality, except in the sense that ZVOLs already do this using an entire secondary filesystem on top (with additional inefficiencies of its own).

I think perhaps you're confusing the fact that records can be smaller than recordsize with them being grouped together; just because you could, for example, store four 32k records in the same space as a single 128k record, is not the same as grouping them as one record – ZFS still handles them as four separate records. As each of those 32k records was compressed, and encrypted, separately.

This proposal is that the four 32k records would be combined into a single 128k record, then compressed, encrypted etc. as a 128k unit. This means that to read one of them back requires loading the entire 128k (or smaller, after compression) record and extracting part of it, so it still involves an extra step, as a ZVOL does, but without an entirely separate filesystem on top, having to mount it separately etc., and with greater awareness of when a grouped record needs to be rewritten (as a ZVOL block may be full of holes that ZFS is unaware of, and can't do anything about, and is always written out as a complete new unit, whereas a grouped record wouldn't always need to be if only some of the sub-records changed).

GregorKopka · 2024-02-04T13:08:01Z

The logical block size for compression to work on is set using recordsize for filesystems (new files inherit this on creation) and volblocksize for zvols (at creation time of the volume, can not be changed afterwards). Only the last block of a file can be smaller than the recordsize inherited from the filesystem at file creation.

A partial record (smaller than recordsize) write to a file will do a read/modify/write cycle on the affected record, this is also true for the last (potentially only partly filled) record of a file.

Compression for data (regardless of it being stored in volumes or files) uses the same codepath.
What you propose is already there.

Please read up on how ZFS stores data on-disk,
https://www.giis.co.in/Zfs_ondiskformat.pdf could help,
before beginning to think about in what way I might be confused.

Haravikk · 2024-02-04T14:45:44Z

A partial record (smaller than recordsize) write to a file will do a read/modify/write cycle on the affected record, this is also true for the last (potentially only partly filled) record of a file.

Again, this is not what the proposal is requesting; what you're describing is the read/modify/write cycle of a single discreet record (up to recordsize), i.e- when you access part of a file you read the corresponding record, but that's not what I'm asking for here at all. I'm not sure how else to explain it to make this clearer to you; I know how ZFS handles files.

I've already set out the reasoning for this proposal as clearly as I can in the proposal itself; volblocksize is an effective minimum record size (except in rare cases where the volume block isn't "complete" yet), whereas recordsize is a maximum record size, this is why the compression performance can vary, as ZVOLs containing a lot of smaller files are able to compress in much larger "blocks" as multiple files can be contained in a single volblocksize of space.

To try and make this clearer; let's say you set volblocksize to 1M, that means 1 megabyte of data will be stored in most volume blocks, while that data could be a 1 megabyte chunk from a single file (within the secondary filesystem), it could also be sixteen different 64 kilobyte files. If those same sixteen files were stored in a ZFS dataset they would each consist of a single 64 kilobyte record (plus metadata), and each record will be individually compressed, meaning the maximum amount of data available for the compression algorithm to work with is 64 kilobytes.

By comparison, those same sixteen 64 kilobyte files stored in a 1 megabyte volume block are compressed as a 1 megabyte chunk of data (since ZFS treats this as a single record) meaning the compression algorithm has up to 1 megabyte of data to work with. Since compression algorithms typically work better the more (compressible) data they receive, this will lead to much bigger savings. For example, if those 64 kilobyte files are all related XML files, there will be substantial compression savings to make around the structure text (XML tags) that the files have in common, which isn't possible when they're compressed individually.

Now this works inconsistently for volume blocks, since ZFS has no control over the contents; each block could contain large or small files, a mix of compressible and uncompressible data etc., but by creating "group" records ZFS has total control, so it could avoid grouping records for which there is no benefit to grouping, and larger records can be ignored entirely.

amotin · 2024-02-04T18:00:14Z

As I understand, you are asking for ability to store multiple unrelated logical blocks in one physical blocks. As result, multiple logical blocks would receive almost identical block pointers, including the same DVAs, checksums, etc, but some different offsets within that physical blocks. It does not look impossibly difficult for writing and reading, but becomes much more problematic on delete -- you'd need some reference counter, that should be possible to modify each time logical block is freed to free the physical block when all the logical blocks are freed. In many cases you may get many partial blocks that can not be freed, that would kill any space benefits you likely get from better compression. In case of partial logical block rewrite you would have to create a new physical blocks for it and leave a hole in the previous physical blocks, since you may not be able to rewrite the old physical block in place, and you can not update all the logical block pointers to the new location since ZFS does not have a back pointers to know what file(s) use the specific physical block.

Haravikk · 2024-02-04T19:27:35Z

It does not look impossibly difficult for writing and reading, but becomes much more problematic on delete -- you'd need some reference counter, that should be possible to modify each time logical block is freed to free the physical block when all the logical blocks are freed.

I suggested a grouprecordrewrite property in the proposal as a way to set when a group record should be re-created, basically if it contains more than a certain amount in freed space (though maybe a percentage would be better?).

It's much the same problem ZVOLs experience, except that in that case ZFS has no knowledge of the contents so doesn't know if a volume block is fully utilised, or is full of unused holes, as it only really knows when the entire block is freed (TRIM'ed?) at which point it can discard it (unless a snapshot references it). So as the ZVOL's own guest filesystem becomes fragmented ZFS suffers the same, it just doesn't know it.

In the group record case ZFS can at least be aware of the holes, so it can do something about it when they become wasteful, and we should be able to tune accordingly (make it more or less aggressive about replacing underutilised records.

In case of partial logical block rewrite you would have to create a new physical blocks for it and leave a hole in the previous physical blocks, since you may not be able to rewrite the old physical block in place

That's the idea, a new group record would be written out for any data in the old group record that's being retired, alongside any new writes taking place at the same time (basically the old sub-records are grouped with new ones to create a new group record).

Since ZFS currently doesn't support block pointer rewrite, this will likely mean recreating the individual records as well. In this sense this re-create/defragment operation behaves like a new copy of everything, and the fragmented old record(s) are tidied up as the references expire, same as if you copied manually.

Over time a lot of fragmentation could result in leftover group records holding a noticeable amount of space they don't need, but this is true of ZVOLs now, and is why I wouldn't suggest this become a feature enabled by default. It's intended more for datasets where content is known to contain a high volume of generally compressible small files (either exclusively, or in a broad mix).

There is also the possibility of adding defragmentation of group records to scrubs (similar to #15335) to perform rewriting of less fragmented group records periodically to reclaim space. I don't think it's critical that this be added immediately, but it's a good long term feature to clean up the holes.

Haravikk added the Type: Feature Feature request or new feature label Feb 15, 2022

Haravikk mentioned this issue Jan 12, 2024

Interblock Efficiency Techniques #15685

Open

This comment was marked as off-topic.

Sign in to view

This comment was marked as abuse.

Sign in to view

openzfs locked as too heated and limited conversation to collaborators Feb 5, 2024

openzfs unlocked this conversation Mar 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Combined/Grouped Records to Maximise Compression #13107

Combined/Grouped Records to Maximise Compression #13107

Haravikk commented Feb 15, 2022 •

edited

Loading

GregorKopka commented Feb 3, 2024

Haravikk commented Feb 4, 2024 •

edited

Loading

GregorKopka commented Feb 4, 2024

Haravikk commented Feb 4, 2024 •

edited

Loading

amotin commented Feb 4, 2024

Haravikk commented Feb 4, 2024

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as abuse.

Combined/Grouped Records to Maximise Compression #13107

Combined/Grouped Records to Maximise Compression #13107

Comments

Haravikk commented Feb 15, 2022 • edited Loading

Describe the feature would like to see added to OpenZFS

How will this feature improve OpenZFS?

Additional context

GregorKopka commented Feb 3, 2024

Haravikk commented Feb 4, 2024 • edited Loading

GregorKopka commented Feb 4, 2024

Haravikk commented Feb 4, 2024 • edited Loading

amotin commented Feb 4, 2024

Haravikk commented Feb 4, 2024

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as off-topic.

This comment was marked as abuse.

Haravikk commented Feb 15, 2022 •

edited

Loading

Haravikk commented Feb 4, 2024 •

edited

Loading

Haravikk commented Feb 4, 2024 •

edited

Loading