-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Combined/Grouped Records to Maximise Compression #13107
Comments
Default The way compression works in zfs is identical between data stored in a file and a zvol, a filesystem dataset (containing only one file holding the same data as a zvol) showing a worse compression ratio is to be expected, as a filesystem has additional metadata (directory, permissions, modification time, ...) that is factored into the compression ratio of the whole dataset. Please close as this asked for existing functionaliy. |
No it doesn't; ZVOLs can have much greater compression gains because ZFS is almost always provided with So this is in no way existing functionality, except in the sense that ZVOLs already do this using an entire secondary filesystem on top (with additional inefficiencies of its own). I think perhaps you're confusing the fact that records can be smaller than This proposal is that the four 32k records would be combined into a single 128k record, then compressed, encrypted etc. as a 128k unit. This means that to read one of them back requires loading the entire 128k (or smaller, after compression) record and extracting part of it, so it still involves an extra step, as a ZVOL does, but without an entirely separate filesystem on top, having to mount it separately etc., and with greater awareness of when a grouped record needs to be rewritten (as a ZVOL block may be full of holes that ZFS is unaware of, and can't do anything about, and is always written out as a complete new unit, whereas a grouped record wouldn't always need to be if only some of the sub-records changed). |
The logical block size for compression to work on is set using A partial record (smaller than Compression for data (regardless of it being stored in volumes or files) uses the same codepath. Please read up on how ZFS stores data on-disk, |
Again, this is not what the proposal is requesting; what you're describing is the read/modify/write cycle of a single discreet record (up to I've already set out the reasoning for this proposal as clearly as I can in the proposal itself; To try and make this clearer; let's say you set By comparison, those same sixteen 64 kilobyte files stored in a 1 megabyte volume block are compressed as a 1 megabyte chunk of data (since ZFS treats this as a single record) meaning the compression algorithm has up to 1 megabyte of data to work with. Since compression algorithms typically work better the more (compressible) data they receive, this will lead to much bigger savings. For example, if those 64 kilobyte files are all related XML files, there will be substantial compression savings to make around the structure text (XML tags) that the files have in common, which isn't possible when they're compressed individually. Now this works inconsistently for volume blocks, since ZFS has no control over the contents; each block could contain large or small files, a mix of compressible and uncompressible data etc., but by creating "group" records ZFS has total control, so it could avoid grouping records for which there is no benefit to grouping, and larger records can be ignored entirely. |
As I understand, you are asking for ability to store multiple unrelated logical blocks in one physical blocks. As result, multiple logical blocks would receive almost identical block pointers, including the same DVAs, checksums, etc, but some different offsets within that physical blocks. It does not look impossibly difficult for writing and reading, but becomes much more problematic on delete -- you'd need some reference counter, that should be possible to modify each time logical block is freed to free the physical block when all the logical blocks are freed. In many cases you may get many partial blocks that can not be freed, that would kill any space benefits you likely get from better compression. In case of partial logical block rewrite you would have to create a new physical blocks for it and leave a hole in the previous physical blocks, since you may not be able to rewrite the old physical block in place, and you can not update all the logical block pointers to the new location since ZFS does not have a back pointers to know what file(s) use the specific physical block. |
I suggested a It's much the same problem ZVOLs experience, except that in that case ZFS has no knowledge of the contents so doesn't know if a volume block is fully utilised, or is full of unused holes, as it only really knows when the entire block is freed (TRIM'ed?) at which point it can discard it (unless a snapshot references it). So as the ZVOL's own guest filesystem becomes fragmented ZFS suffers the same, it just doesn't know it. In the group record case ZFS can at least be aware of the holes, so it can do something about it when they become wasteful, and we should be able to tune accordingly (make it more or less aggressive about replacing underutilised records.
That's the idea, a new group record would be written out for any data in the old group record that's being retired, alongside any new writes taking place at the same time (basically the old sub-records are grouped with new ones to create a new group record). Since ZFS currently doesn't support block pointer rewrite, this will likely mean recreating the individual records as well. In this sense this re-create/defragment operation behaves like a new copy of everything, and the fragmented old record(s) are tidied up as the references expire, same as if you copied manually. Over time a lot of fragmentation could result in leftover group records holding a noticeable amount of space they don't need, but this is true of ZVOLs now, and is why I wouldn't suggest this become a feature enabled by default. It's intended more for datasets where content is known to contain a high volume of generally compressible small files (either exclusively, or in a broad mix). There is also the possibility of adding defragmentation of group records to scrubs (similar to #15335) to perform rewriting of less fragmented group records periodically to reclaim space. I don't think it's critical that this be added immediately, but it's a good long term feature to clean up the holes. |
Describe the feature would like to see added to OpenZFS
I would like to see the ability to store "combined" records, whereby multiple smaller records are stored as if they were a single larger one, in order to make more effective use of filesystem compression in a similar manner to ZVOLs.
How will this feature improve OpenZFS?
It will allow datasets to achieve higher compression ratios when storing smaller records that compress less well on their own, but compress well when treated as a single large "block".
It may also allow for improved performance in
raidz
configurations, as grouping records can result in a wider grouped record that can be more efficiently split across disks. This improvement would be most visible when dealing with smaller records that currently result in sub-optimal writes to these devices.Additional context
The inspiration for this request came after I migrated the contents of a plain ZFS dataset to a ZVOL while attempting to debug an unrelated issue. While my
compressratio
on the plain dataset was an okay 1.19, on the ZVOL the exact same content achieved a ratio of nearly 2.0 with all else being equal.This makes sense, as ZVOLs effectively have a large minimum "record" size owing to their
volblocksize
(in my case 128k), whereas in an ordinary dataset records can be as small as the minimum physical block size determined by theashift
value (4k in my case). Most compression algorithms can only achieve limited savings on smaller amounts of data, and they tend to work a lot better the more compressible data you can give them, and that is the case here.I would propose that the feature works something like the following:
grouprecordsize=128K
), ZFS will delay final writes to disk so that smaller records can be grouped together into a larger record of the specified size (up to, but no larger thanrecordsize
). The final combined record size may still be smaller than the target, as it is only a best effort using available written data.grouprecordtype=all|data|metadata
). The default would beboth
, but this allows tuning for performance vs. size on metadata records.grouprecordrewrite=64K
) then all of its remaining individual records would be queued for writing as if they had been updated, allowing the old group record to be freed once a new one has been written out.In essence the idea is to give ZVOL-like storage performance in ordinary datasets, with some of the same basic caveats; i.e- additional data needs to be read to access a single small record, and more data may need to be written. However, in the latter case by only rewriting when old records become too fragmented this should be less common than it is for ZVOLs so some of the reduced write performance would be mitigated.
While this feature should not be enabled by default for a variety of reasons, properly tuned its performance shouldn't be any worse than a ZVOL would be, and in some cases write performance should be better than a ZVOL, since there should be less unnecessary copying as a result of small writes (compared to an actual ZVOL where ZFS has no awareness of the contents), and there would be no need for an entire secondary filesystem to be involved.
In fact it may be possible to leverage existing zvol code to implement this, i.e- a grouped record would simply be a zvol block, and extracting data would function the same as reading only part of the block. The main differences would be that writing may prefer to create new blocks rather than updating old ones (though this is not a requirement) and there would need to be a special metadata flag or format so that metadata can be stored in zvol blocks, and reference locations within other blocks.
The text was updated successfully, but these errors were encountered: