-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extent tree v2 #25
Comments
@josefbacik, @morbidrsa: any chance I could take a peek at this design doc? I'm quite interested in reading more about this... |
I, for one, welcome our new non-bookending data extent overlords. It makes dedupe much simpler, and eliminates an entire use case for defrag (where people run defrag to get its wasted-block-discarding side-effect).
...and non-shared case? Do we figure out block-by-block whether the blocks freed by this write were the last references to those blocks in the data extent? If they were, do we update all the other references to the remaining shared parts of the extent? e.g.
Or do we just fall back to the old behavior when the data extent is shared (i.e. waste the blocks)? Edit: step 3 can't delete the blocks, but it could update all the references and split the data extent into 3 pieces with different numbers of references. This spends more work in the present to eliminate the need to build a per-block extent reference count map in the future, i.e. all the blocks in any extent are always referenced the same number of times. This might also make things like prealloc and nodatacow faster, as they currently have to build these maps. |
@Zygo In your example we'd free that 1m-2m area. Here's the step by step of how it would work.
Now say you snapshotted subB before you wrote, the yes we'd still point at the original full extent via subC and subD, but that's expected because they really would point at it. @Conan-Kudo I'll dig it out at some point, but it's not as fleshed out as this, it's more a general proposal, basically I've watered down what he had into my stripe tree description, while leaving him the spots in the metadata to expand for his own uses in the future. |
OK, so what does the data extent look like after step 3b? Currently it would look like:
References are currently tracked by bytenr, so when we modified subB we didn't have to change anything in subA.
Would it look like this after step 3b instead:
So we solve the wasted space problem because we can clearly see at step 6 that the extent ref count of the 1m..2m extent drops to 0, and we remove the data extent from the extent tree. But wouldn't this require updating every reference to the data extent whenever a partial reference to the extent is freed? What am I missing here? As I understand it, after step 5 we have:
so OK that's fine, as long as we are pruning the bookend extents as we go. Not very interesting. What if we added step 7 at the end:
Will that force an update of subB, subC, and subD to match subA's new extent boundaries?
|
This got confusing with all the offsets, and it looks very different from what we do now, I'll write it your way
Before you would never have extent references overlapping in the extent tree. I think that's what you're missing here, I would allow extent ranges to overlap, because every file extent will reference the exact length of the real extent ref they hold in the extent tree. This means you could end up with something wonky like this [0-65m] If I drop [1m-5m] we immediately get that space freed (well "immediately", it'll be pinned of course). This does make free'ing waaaaay more complicated of course. Because if we do something like [0-4m] and we free [0-4m] we have to go and see if we have any references in that range. I will handle this by using our range locking to lock the extent we're messing with. So I'll lock [0-4m], then look to see if I have any overlaps. If there's any range that gets freed it'll get added to the pinned thing and we'll carry on. There will clearly be a very robust self-test written for this code, because there will be plenty of dark corners. |
Ah, indeed that is the missing piece. Then for free space reclaim we just look at the bytenr range covered by the data extent that hit 0 references, and free any blocks that are not overlapped by another data extent, i.e. we don't need to look at any subvol trees, because everything we need to know about which blocks are referenced is available in the extent tree (though possibly up to 128M-4K before the freed item, so it could be a few hundred pages away in Dave-Chinner-style test cases). Thanks, I think I see how it works now. |
Block group tree(I forgot to add this to my design doc, I will edit it once I'm at a real computer). We load all of the block groups on mount, which we don't necessarily have to do, but changing it could be annoying. At any rate all of the block groups are indexed by bytenr, so they're spread all out on the disk. This makes mounting very large file systems kind of slow. Fix this by adding a new tree dedicated to the block groups, it'll simple have block group items in it and they'll look exactly like they do today. This way they'll be all closely located to each other and we'll be able to mount a lot faster. |
Not anymore for fsync, it's very limited now since it only affects the slow and less common path. |
What's the likely conversion method from v1 to v2? Would it be more or less complex than free space cache v1 to v2? Right now we see a wide range of mount time for initial v2 creation, from seconds to many hours. I'm wondering about the scalability of extent tree conversion, but maybe it's too soon to tell. |
@cmurf it will be an offline conversion tool, the changes are too complicated to do online on mount like what we did with the free space tree. |
I've edited my original comment to indicate what the plan is now after thinking about it for a month and starting the work. The hilights are
|
Who is @lorddoskias? |
The wrong fucking person, my bad. I meant @morbidrsa. |
New snapshot tracking tree and drop treesOne of the main problems I have been trying to solve with extent tree v2 is the explosion of delayed refs for our normal operations. Reworking relocation was a big part of that, tracking all metadata in the extent tree was another, but there's another less well hi-lighted problem, which is how we push down references at shared times. This is described in my original plan to only track shared metadata blocks, but this still opens us up to a lot of delayed refs for any arbitrary modification. The original COW B-Tree paper had this mechanism where we pushed down references at every COW, and this is how Btrfs has fundamentally operated since the beginning. If we COW block A and it is shared we must add a reference for all of its children before continuing. With a 16k nodesize this can be upwards of 500 references per level. This means for a 3 level tree (the size of the fs tree for a clean fedora install) we'll push 1000 references for every path the first time through after a snapshot. This tends to not be too big of a deal in practice because this is mostly an up-front cost. First modification sucks, every subsequent modification sucks a little less. However the larger the tree the more suckage and the longer the suckage lasts. This is the other fundamental problem with ENOSPC tracking in btrfs. Any particular modification is potentially unbounded. We could certainly worst case reserve 500 items worth of reservation for every level in a tree, but that would be overkill most of the time and put a lot of stress on the ENOSPC infrastructure. Instead we want to bound the modifications to the items we're modifying. If we could push one item per level that would be easy to account for, we have a maximum of 8 delayed refs for every search. This is easy to account for. Enter the snapshot tree and drop trees. Instead of tracking references per metadata block, we'll keep track of who no longer references blocks per metadata block. We will then use the snapshot tree to determine who theoretically refers to a particular block. The snapshot tree will have entries like this
On snapshot we will copy the root like we always do, insert the snapshot item with the source tree objectid and the destination tree objectid, the generation (transid) and a For example, consider subvolume A that has had a snapshot at generation 9 by subvolume B and a snapshot at generation 20 by subvolume C. Our snapshot items will have A->B gen 9, A->C gen 20. For a block owned by A with a generation of 7 we will know that it is potentially pointed at by A, B, and C. This is where the drop tree's come into play. When we cow this block we will add a drop entry for our given volume. The drop tree's will look like this
The With this we eliminate 2 things
We will use this for data as well. We will track drops of the data extents via the drop tree, and once the The side effect of this will be that we no longer can do backreference lookup. This is actually OK, and makes things like Qgroups insanely cheap. We use backreference lookups to determine what roots point at specific blocks. Now we can do that with 0 tree lookups, we can calculate it from in-memory cache of the snapshot dependency graph. If you want to know what current roots point at an arbitrary block you can simply take the list of theoretical references and subtract any objectids from the drop tree and have the current list of references. Qgroups becomes much less search intensive and more scalable to many snapshots. The The |
I made a series of videos to explain all of the changes so far in case reading this is confusing. Unfortunately I recorded two of them before I realized my cursor doesn't show up, so sometimes I do things like say "This block" and you have no idea what I'm pointing at. Sorry about that. Happy to make more videos to clear up other confusing aspects of the extent tree v2 changes. https://www.youtube.com/watch?v=TH1Ho19dqP0 |
Regarding the remap tree, aren't you concerned about slowing down all the reads because of the extra logical->physical lookup? |
Yeah for sure, but it's going there's a bunch of upsides that will more than pay for the cost. The worst case is
Now for what we gain.
This is a file system, there's always a cost, but the perf cost is likely paid for by no longer incurring the perf penalty of unsharing blocks in the old scheme. Even if it didn't, all the other upsides are definitely worth the cost. |
I was meaning to comment more on this earlier, but haven't had the time to really dig into it.
It's the same average cost as looking up an (inode, offset) pair while reading a file through a subvol, and looking up the bytenr for csums. Both are O(log(n)) lookups with roughly the same item size. For a given file, the key density for (inode, offset) pairs, logical addresses, and physical addresses are different, so some reads will have negligible extra costs (when the logical addresses are contiguous, the next csum/reloc item will be in memory 99.6..99.8% of the time), while other reads will have much higher worst-case costs (when the logical addresses are heavily fragmented, the next extent in the file will require hitting the disk for O(log(n)) csum/reloc tree pages) but again this extra cost is roughly the same as the csum lookup cost, so however bad it gets, its upper bound is a doubling of the cost that is there already for datasum files. It is an extra cost for read-heavy workloads, but right now write-heavy workloads stop the filesystem more or less dead for more or less unbounded periods of time. I do have a few concerns about the "relocate everything" cases, where every extent in the filesystem ends up in the remap tree. ZBD devices are AIUI balancing block groups every time they want to delete anything, and adding or upgrading disks in a RAID array requires relocating significant chunks (up to 100%) of the filesystem. If there is nothing putting pressure on the remap tree to shrink, and logical addresses aren't modified, they're going to get more and more fragmented over time, the remap tree will keep getting bigger, and logically sequential addresses are going to point to more random physical addresses. Maybe some background thread could come along later, rewrite the logical addresses, and reduce the remap tree size (either deleting or coalescing nodes)? It could be very lazy, to the point of not doing any work at all if the disk is never idle. This would require figuring out how to do relocation in tiny increments that don't blow up delayed refs, but presumably it's easier to do that after the reloc and snapshot trees are in place. Does it make sense to make the remap tree separate from the extent tree? If one common case is "remap tree is empty" and another common case is "remap tree contains every extent", we could put the physical location remap items in the extent tree. I think I can talk myself out of this idea though--we'd have to wade through backref items to get to the reloc items, and maybe that means the tree is always full of 50% of data we don't want for the particular operation we are doing (i.e. you'll often want either the backrefs or the physical locations, not both), which would make separate trees better.
Heh, I'd love to have 30-second commit times. While balance is running, 40% of the machine time is spent in commits running longer than 60 seconds each. I have about 30 events per machine-day where commits take 2-6 minutes. Add another order of magnitude for putting metadata on spinning rust. The commit times are now the biggest single barrier to rolling out btrfs for new use cases. Crashes and hangs are gone, we can live without raid5, raid1 life cycle is crude but tolerable, but a lot of applications assume the server has crashed if a write takes more than 2 minutes. Can't run any of those on btrfs yet, because btrfs doesn't have bounded latency. |
...which reminds me (and this is straying a bit from the topic): delayed refs processing on big, old, spinning filesystems is something like 40% deleting extent items and csums, most of that time waiting to read random metadata pages because no two logically adjacent blocks are physically anywhere near each other after a filesystem has aged a while. I wonder if there are gains to be had by sorting those, or flushing all the delete operations out to a tree, so we get the commit over with quickly and unblock applications. A background thread can clean up and fully delete stuff after the transaction commits. |
Well we no longer have extent trees for non-data block groups. The other thing is we have to stuff the remap tree in the system chunks in order to make sure we can read them before we start trying to read in the block groups. As for the "relocate everything" case, yes if you don't change anything it kind of sucks, but as block groups are free'd their remap items are deleted. And then for multiple remaps' we simply update the new location for every remap'ed item. For example, we remap the block group 1gib-2gib. Say there's a single extent, 1gib-1.5gib, we insert a remap item for the new location, but we also add a remap backref with the new location as the key.objectid. That way if we remap the block group that has the new location for another block group, we simply go and update the existing remap items with the new locations, so we don't have multiple steps of indirection, we only just have one. Finally we have to remember that the remap items only refer to allocated area in the block group. This doesn't refer to actual extents. So if we have 500 1mib extents in that 1gib-1.5gib range, we're only going to get 1 remap item for that whole range. You're only going to get a remap item for whatever you actually relocate. It can certainly be more than the number of extents, but it can be less as well. Additionally in the worst case we're talking a max of ~6mib to describe all the remapped items in the block group, and we only need 3mib of that in memory. The plan is to load this entire tree into a mapping at mount time, so the remap operation is purely CPU/memory overhead, no going to disk to look up our new location. |
Keep in mind that we're no longer tracking metadata references in the extent tree, so the only "deletions" for metadata blocks will be adding drop items for any shared metadata blocks, so significantly less delayed metadata operations per transaction, which will help everybody. For checksums we could delay checksum removal, tho it makes the ENOSPC tracking annoying because we'd have to transfer the ENOSPC reservation from the delayed ref to the actor on the csum tree. But that's the only annoying part, we could easily just use an extent_io_tree to mark csums that need to be deleted and post-process that as a last step in the delayed ref running. |
OK so that's different from how ZBD/ZNS seem to use relocation. They want to do garbage collection. So if half of 500 extents are deleted and garbage collected, they'd be looking at say 125-249 reloc items? 125 for packing half of the remaining extents into the spaces between the other half; 249 for squashing all the extents up against the first one. Or a simpler example with 10 extents:
OK, so 188GB in memory for a 50TB filesystem...hope that's pageable? :'O I mean it obviously works in simple relocation cases (like you have 5 raid devices and you want 6) where you just want to pick up a block group and drop it somewhere else without changing anything about its structure, even the free space inside it. The garbage-collection use cases don't seem to fit in that model, though. They seem better off with the traditional relocation, or maybe an enhanced version of defrag that coalesces the physical extents while moving them. |
I was thinking of pushing them all the way out of the transaction. Have something like log tree on disk, called delete tree, and all it does is record which extents and csums have hit zero references and need to be removed so we can get the transaction over with. Then btrfs-cleaner or similar comes along and does the actual removal (i.e. go fetch the extent and csum tree pages, remove items from them, repeat) later on, without blocking anyone because it can simply stop consuming extents from the delete tree at any time. Maybe if the filesystem starts getting really full we go back to deleting csums in-transaction to get some space freed up, but if used space is trending downward (as we'd expect it to be if we have a huge pile of csums queued up) then there's no particular rush. As I write this, I open a ssh window on a host, run |
I'm going to set the csum thing aside for now and focus on the remap tree for now. You make a good point for the 50tb case. This wouldn't work out exactly like that, since you're filling block groups as you relocate them, and if you write 50tb and then delete every other 4k block you get what you deserve, but it certainly hi-lights the drawbacks of this solution. Lets talk about what we want relocation to do then. It's essentially garbage collection. In it's current state it uses the extent tree as it exists with all of its backreferences to find all of the blocks in a block group, copy them somewhere else, and then go back and update all of the pointers to point at the new location. In its current state, that means if you have 1000 snapshots that share a block, it must update 1000 paths to that block to point at the newly relocated block. This currently has to be done in a single transaction, and because we're doing the btrfs_inc_ref() as we walk down we're talking anywhere from 500k to 4million delayed refs in a single transaction. This is where your 45 minute transaction commit shows up. This also means we only continue to share this one block and every block underneath it. If it is a leaf then we've just added 8k new blocks so we could "share" this one relocated block. For every level up we're saving 1000 blocks, but it's still a pretty large space explosion. Can I make the current scheme less shitty? Sure, I can add stuff to track where I am in the merge process. This allows us to ratelimit the amount of delayed refs we generate per transaction. However relocation still is a function of how many paths to a shared block we have to update. So the relocation itself is still going to take all eternity, and take longer for every snapshot we have. Can I make it better WRT the number of snapshots? Maybe. I could do the inverse of what relocation does today. Instead of walking down the reloc tree and then relinking a path for a snapshot to the new block, I could simply build a tree that matches the original snapshot and then just swap the root to point at the reloc root. This only really works if the snapshot doesn't change tho. So it helps my stupid case of 1000 snapshots because likely only 1 root is being modified and the rest are staying static, but in reality this isn't how things work, every root is going to have slightly updated stuff, so keeping state and swapping whole trees is super fragile and likely to break in a horrifying way. Finally snapshot aware defrag is a similar problem, just limited to data. It is bound by the number of referencing trees, and we're exploding usage for every operation to update the tree. And all of these still operate with the original extent tree. With extent tree v2 it is impossible to figure out who owns discrete blocks simply from the bytenr. If we read the actual block we can determine who references us, which means we have to walk all trees looking for a block in a particular block group. If we wanted to make old fashioned relocation still work I wouldn't be able to change the extent tree like I want to. Which brings us to why do we want relocation in the first place. Well because we divide up metadata and data allocations into specific block group types. What if we stopped doing that? Well we could certainly do away with block groups completely, simply have a device tree that matches stripes, and now we can only operate on entire devices instead of individual chunks. This is not as horrible as it sounds, in the end all we really want to accomplish is having blocks striped in the way we configure, so the extra granularity just adds flexibility but headaches because we can't mix things. But we have zoned support, which does actually have a certain granularity that it wants to operate on, the zone size. So we can't just remove the notion of block groups, we have to keep them around. And we need to be able to garbage collect, so we have to keep some sort of GC method around. Snapshot aware defrag is hugely problematic given how btrfs is designed. Since all of the metadata is in one tree we can actually end up with a huge amount of unsharing between snapshots with just basic things like atime updates. This means that we can't easily whole-hog replace extent mappings for an inode, because we could have modified unrelated metadata that shares paths with the extent items and thus prevents us from doing easy things like swapping in whole subtrees into a snapshot. We could do something like per-inode extent trees. This would be good for large files with lots of extents, but you don't want to waste a 16kib leaf for every inode on the file system, so you still have to contend with the unsharing of lots of snapshots. The real thing that hurts us here is concurrent modifications of the trees. If we're unsharing random things we have no choice but to unshare loads of paths to splice in new locations. We could solve this by disabling concurrent modifications of file system tree's while we move shit around. This would certainly work nicely from a complexity standpoint, because then we can simply move shit around for one tree, and then update all the other trees to point at the new paths. In our 1000 snapshots thing this means we're updating 1008 blocks, which is pretty nice. What's not nice is nobody can write to any of these snapshots while we're doing relocation. This is like fs-freeze, only worse because the time we're frozen is bound by the number of snapshots we have. Which brings us back to the mapping tree. It's even worse than you hilight, because again we are forced to stick these in the system chunks because we have to bootstrap their location via the superblock at mount time. So we can't actually store 188gib of remap times, we can store as much as system chunks that we can have, which is in the worst case 14, so 14gib total for storing the block group and remap tree. Which even with no relocation suddenly makes our maximum FS size around 390 petabytes instead of 8 exabytes. With remap things we'll cut down on the amount of block group items we can have, and we could make it so you weren't allowed to have more than a couple of TiB (in the worst case of course). I could split the block group and remap trees into per-type trees. Then only stick metadata block group and metadata remap trees in the system chunks and then the data block groups and data remap trees in the metadata areas. Now we're only limited to 393 pb of metadata, or a few TiB of metadata in the worst case. This is an easier pill to swallow I think, but still kind of awkward. Another thing I could do would be to limit the size of the remap items. Instead of letting us remap sectorsize granularity, we limit ourselves to a specific chunk size, say 8mib. This turns our worst case remapping into 6kib, so our theoretical 50tib fs only needs 340mib to describe the entire file system. This wouldn't allow us to compress down as nicely, and in fact wouldn't let us relocate this 50tib fs because we wouldn't be able to find 8mib chunks, but this may be an acceptable trade. This is a long post to say that I've thought long and hard about this problem. Relocation just sucks, there's no good approach that works well. The remap tree can be made to be less awful in the worst case, and allows me to still make all these other changes that I want to make. With my current plan I eliminate unbounded modifications and delayed ref generation. If I were to leave relocation in place like it is I would have to leave the extent tree like it is, which means leaving ref count pushing like it is, which leaves us in the current state WRT delayed ref updates for shared blocks. I think making our garbage collection simpler but more coarse is a fine price to pay for what we're going to get in return. |
My original plan was to stuff the remap tree and block group tree in the system chunks, and then simply cow any blocks in a system chunk for any relocation of the system chunk instead of doing the remap thing for system chunks. Clearly this has all the drawbacks I listed above, we're limited on system chunks. However I still need this property, so I'm going to add another chunk type that will hold the remap and block group tree, and give it the same properties. This way we can have as large of a remap tree as we want and not have to worry about it. It'll be
|
That has some disturbing implications for dedupe, but at the same time maybe there's a better way to dedupe? Dedupe is in a sense like a single-extent relocation, except that instead of copying data from point A to point B, we claim credit for a copy someone else previously did, update all the references to A so they are now shared references to B, then free the space at A. Right now dedupe metadata updates are even worse than the snapshot relocation case. Snapshots can share some interior nodes along the paths between extents and roots, while dedupe has to permanently unshare all nodes along every path as well. It's possible to optimize that a little since all the shared leaf nodes get exactly the same modification--but the best case in terms of iops and space consumed is still equivalent to "one-extent balance." Ideally, we do dedupe by reading data blocks the same way scrub does (now), which gives us some important benefits:
When we find a duplicate data block, we have to figure out what files the block belongs to, so we can update data pointers. We also have to know which of two duplicate extents to keep, e.g. we want to keep the one with more existing references, or the one with the longest run of contiguous duplicate blocks, or whatever metric we're using for extent cost model. Both of those require a bytenr to list_of_forward_refs map. btrfs currently provides Remap tree (assuming all scaling issues are resolved) could make dedupe possible without touching subvols at all. A dedupe agent can find identical data blocks and remap them without knowing or caring which files they belong to. The gotcha is that AFAICT remap tree doesn't handle overlapping/shared references at all, and even if it did, it's not scalable to millions of duplicate files (or maybe it is, but its performance is equivalent to csum tree?). If the filesystem doesn't provide a reverse map (like To be clear, I hate it because I'm going to have to do a lot of work if I want bees to work with extent tree v2, not because extent tree v2 is not a good idea for btrfs. It's probably possible to build a logical->(inode,offset) map and keep it up to date incrementally with tolerable effort ( |
Ooops sorry @Zygo, it'll no longer be possible to say "who owns this METADATA block", we'll still track all the same extent information per inode. So all of the normal stuff we can do will work, and with the snapshot tree we'll be able to very cheaply map a shared extent to all of the referencing subvolumes. The only thing we're losing with extent tree v2 is being able to easily pinpoint which metadata blocks are allocated where, and who owns them. We'll have to do that by walking trees. |
Oh OK that's much better. Though I did get a few good ideas from the exercise of figuring out "how could bees work without It would be really handy if there was an ioctl for dedupe that could change data pointers in shared metadata blocks without unsharing them (or changing length since that might require inserting new items). I had been thinking of doing something along these lines for extent tree v1 (where we can use the backreferences and walk upward from the data extent) but never found the time to try it (all the time I did find got burned trying to understand how balance does what it does). If we need a top-down tree walk to work, bees could provide a map of millions of duplicate (bytenr, length, generation) pairs at a time to amortize the cost (the generation is there to detect modifications--if the extent at bytenr doesn't match by the time tree walk gets to it, then dedupe is skipped). It would basically be a two-phase algorithm: phase one uses memory on a hash table to discover duplicate extents, phase two loads the duplicate extent map into memory where the hash table was, and walks the file tree again, replacing any extent on the list as it goes. Two-phase dedupe algorithms suck, but it would be worth it if we can keep shared metadata pages shared.
That might still be a challenge for ZBD/ZNS garbage collection, but metadata is a much simpler case: all blocks are the same size, there's many fewer of them than data, and there's a steady stream of new ones. Garbage collection could lazily insert new metadata writes between old metadata block copies, and it would only need one remap item to cover the beginning of the block group to the write pointer. Or something like that--ZBD/ZNS isn't really my bailiwick. |
Some notes from today's btrfs developer meeting.
|
So there would be no conversion at all? That would mean we could never get rid of extent tree v1, right? |
I don't want to write off the conversion tool idea yet. We have the infrastructure in place in convert and even with the advanced features like extent sharing and whatnot, it's still doable in the same way - writing new metadata into the holes and switch superblock at the end. In some sense I think this would be even easier as both versions would use the same code for modifying trees, the difference is in the logical layer of the extent tree. |
I have a requirement that would allow a something not possible with v1. We've discussed that elsewhere, so I'll repeat it here too. Subvolume groups. Now we have one global space where the extents can be shared, either by reflink or as a snapshot. The idea is to add more such extent sharing domains. The toplevel one would be always there. Extent sharing cannot cross the domain, which is the only visible difference to what we have now. A followup feature is encryption, that works on the extent sharing domain and allowing to do things in a different in each group. Things like assigning different keys, full/partial/no metadata encryption. Based on the discussion, this should be possible with v2. |
I typed a whole update and then closed the tab, I hate my life. Here's an update of the problems we've been encountering with the extent tree v2 design https://www.youtube.com/watch?v=BT-FwntK7RI tl;dr |
This would be extremely appealing for Fedora Linux, as we're actively trying to come up with a strategy for seamless full disk encryption that doesn't require reformatting and reinstalling. If v2 gives us the flexibility to pull this off with strong encryption all the way down to metadata, then this would be utterly fantastic. Of course, this would also include being able to do "blind" send/receive (that is, send/receive without decrypting the volume or the data for the send stream). |
Yeah, that's independent and covered by Omar's patches. |
Which ones are those? I haven't seen anything land recently for this... |
Extent tree v2 changesPer-bg roots are now simply N global rootsThe original plan was to allocate a new csum, extent, and drop root for every block group. This would make lock contention better, but at the cost at really exploding the amount of roots we would potentially have to update per transaction commit. We could solve this partly by increasing the default BG size, but this isn’t possibly for zoned block groups. Instead we will pre-allocate N number of roots at mkfs time. This can be user tunable, but will basically pick NR_CPUS as a default. Block groups will then point at their set of csum, extent, and drop roots on creation. So if we have 4 CPU’s, we’ll have 4 csum roots, 4 extent roots, and 4 drop roots, and then block groups will point at one of these 4 ID’s for their set of global roots. This solves the lock contention problem without drastically exploding the number of roots that could be modified during a transaction. This also allows us to in the future have global roots for other things. For example if we wanted to have per-subvolume csum roots we could potentially do this by adding a field to the root to point at their own personal global root id set. We need to keep bookended extentsUnfortunately my scheme to do overlapped extents and free space as required will have to go. The overall plan works fine, but if you through qgroups into the mix it suddenly becomes a nightmare. At that point we can’t really tell if a data extent is shared or exclusive, because there could be overlapping ranges, so we would have to go to the extent tree to check if there were overlapping ranges at every given point. We will have to keep the current scheme as it is to make qgroups not suck. A possible future option is to post-process these extents in order to free up space, but that’ll be outside the scope of this change. Open questionsThe snapshot+drop root tree is a nice solution for tracking changes, because it only requires 1 tree modification per level of a shared tree, and it allows us to easily figure out what is shared where, sort of. Drop tree pros
Drop tree cons
Chris’s proposal is to essentially dump the delayed refs tree to disk. Our most recent compromise was to do both things, to have the snapshot+drop tree and then have a garbage collection tree to handle the “cons” cases above after the fact. However the con’s for drop tree+snapshot tree are very wonky and don’t really fit well into any particular mental model. You just have to know that these are problems and we have to code around them and hope they don’t hurt us in the long run. Alternative proposalWe do what Chris suggested early on, and simply push our delayed ref stuff to an on disk tree. This has the upside of keeping our reference counting the way it has always been, and is relatively straightforward to keep track of. It has all the downsides of our current approach, however we can make those downsides less painful. Part of what makes our current approach so painful is that every delayed ref must be run in that transaction. Because of this we can get bad cases where we’ve completely blown out the transaction commit time because we have to run the delayed refs. However if we simply put this work into an on-disk tree, we can post-process the extent tree updates at any point, allowing transaction commits to happen at any time. Furthermore we can take our most expensive operations, btrfs_inc_ref() and btrfs_dec_ref() and turn them into single items in the delayed ref tree. This doesn’t mean we avoid the cost associated with btrfs_inc_ref() and btrfs_dec_ref(), but it does mean we can delay it to the point where we may have less churn on the file system in general. Meaning we will still at some point have to actually do the work of btrfs_inc_ref() and btrfs_dec_ref() in the future, but we could end up cancelling operations out and end up with less actual churn on the extent trees. Pros
Cons
What do I want?I want all the core btrfs developers to read the above thing and pick a side, or suggest alternatives. At our meeting next week I want us to decide on what we think the best way forward is, and then I’ll go down that path and see if there’s anything we didn’t think about that may change our minds. |
This is similar to how xfs is doing its internal split into allocation groups. One question I have is how BGs are going to be tied to their respective set of roots - perhaps a new item type would be necessary, whose body would include the 3 block pointers to respective roots? Perhaps giving example of possiible structure for those items would be useful. Do you think it will be useful to tune the global roots count after the fs is created, I envision it can be done for added complexity. I.e if you have a fs on one machine and then upgrade the CPUs would we want to scale existing fs accordingly? This might require balance in order to bring things into, well, balance :)
Can you expand a bit more on this or it's just "thinking out loud"?
Would this post-processing require on-disk format changes if so it might make sense to think about it harder now so as down the line it can be enabled with less pain/format-breaking changes.
To me this reads as "this needs to be very explicitly documented so that anyone who does changes to relating code should be well aware of the pitfalls and be able to mitigate them.
Doesn't this open a whole new can of worms, because we now need to implement logic to decide when is a good time to process the delayed refs tree and apply its changed to the extent tree. What this delayed refs tree becomes is really another type of log tree, no ? I assume its changed would need to happen within a transaction, so say in T1 (trans 1) we create a bunch of delayed refs + other operations, we write the delref tree on-disk (what drives the decision whether we choose to write the tree as-is rather than process it in this transaction i.e do we base the decision on some threshold ), without applying it to the extent tree. Some time later a T2 is about to be committed - do we keep on appending ot the delref tree or do we run everything in T2. Another route we can go is perhahps have the cleaner thread open a transaction to reconcile extent tree/delref tree, however if in the mean time we have other operations in progress they weill naturally open a handle to thetransaction running the delref/extent tree reconciliation process, meaning they could possibly bear the cost of previous changes to the fs.
|
Yes so the idea is that we create N number of global roots at mkfs time, but at any point in the future you can easily add new sets. I'm going to re-use the chunk_objectid to point at the "global root id", since that is always set to BTRFS_FIRST_CHUNK_TREE_OBJECTID currently. So we allocate a block group, we pick whatever "global root id" we want next, and set it in this item. Then any bytenr that falls inside that block group just looks up the block group, gets the "global root id" from the block group, and then uses the appropriate helper, so if we want the extent root we do something like
So if after the fact we want to add more global roots, well then that's just more global root id's we can assign when we make block groups. If we want to re-balance everything across more block groups then we can just balance to naturally spread everything across the new block groups.
Just thinking out loud, kdave mentioned possibly wanting to do this, having the root id thing makes it easy to make these changes in the future.
I'm still thinking on this, it's kind of an ugly situation and would require changing the file extent items and the extent items themselves. I don't think we could even accomplish this with postprocessing without exploding metadata usage if there were snapshots, so I don't think this is a realistic option.
Of course, this change would require a heavy amount of documentation and commenting everywhere so it's clear what the rules are. This is a con, since it would completely change everything about how btrfs works WRT space tracking, and would take a while for everybody to get used to it.
Right, so it is essentially a form of log tree, but because we'll run all references through this tree we don't have any deadline of when we need to apply the changes. We could potentially have 100's of transactions of extent reference changes queued up in here, and we would just apply them in the background at our leisure. The only time we would absolutely need to process everything is at ENOSPC time.
That's the point of this change, to make it so there is way less operations that absolutely must be completed in the transaction that the modification occurred in. This has the drawback of echo'ing out into other transactions, so we could be doing work long after the user has stopped writing, but the changes would be discrete and thus we'd incur less latency at transaction commit time as we won't have operations that we must complete before committing the transaction, they just become their own operation inside of a trans handle. |
I've found that this issue is getting a bit hard to tell what the actual design is. To help facilitate better discussion, I've made a branch of the documentation that I will submit (in its finalized form of course) once the code is merged. This way you can easily see the current state of the design, and then we can continue the discussions here https://github.com/josefbacik/btrfs-dev-docs/blob/extent-tree-v2/extent-tree-v2.md |
Removing bookended blocks is possible already. Dedupers on brtfs already have to deal with bookends when an extent has partially unique, partially duplicate data. The entire extent must be destroyed, so the deduper must create a temporary copy of the unique data, and then all the data in the extent is duplicate, so we can dedupe all of it (or, depending on implementer preference, don't dedupe any part of the extent--but definitely not go half way and create a bookended extent). The same idea can be used to make a garbage-collection tool, though it would obviously be more efficient if we had an ioctl that said "copy this list of blocks into a new extent and map them on top of the old extents" in the kernel instead of doing it all from userspace. Obviously it's less IO to be able to split up bookended extents in-place without moving their data around; however, it's usually quite desirable to defrag the data at the same time, since the bookending will be leaving tiny holes everywhere otherwise. Might as well kill two birds with one stone and make an integrated autodefrag + garbage collection tool. Ideally we'd have a way to relocate data in a shared metadata page without making the page unshared (i.e. we figure out when we're making the same change in every snapshot, so we only have to change the roots of the subvol tree, not the leaves of the subvol tree, but we'd still have to change everything referenced by metadata page bytenr in the extent tree). That would make snapshots a lot less space-intensive if we can figure out a way to do it. There'd still be a ton of iops and I don't see a sane way to get rid of that. |
We could do this with a snapshot aware defrag, the problem is with the current design (and extent tree v2) we simply can't get around the unsharing of a massive amount of metadata. We have 1000 snapshots and we want to defrag a single inode, say everything fits nicely onto one leaf, we're going to unshare 1000 paths to that inode to update the file extents for all paths. That just sucks. One thing we could possibly do is de-couple the file extents from the fs tree itself. We would want to limit this to large files that span more than one leaf worth of extents, but this would solve the problem nicely. Then we can just fuck with the file extent tree for the inode, and all 1000 inodes just get the new thing via the new extent items without unsharing the entire fs trees. This wouldn't be great for a variety of other reasons. First we'd need a way to find the file extent root, which means a massive explosion of the number of tree's we'd have to track. Maybe we do something fun like have a global root for file extent roots, but we can't get around the fact that there would be a lot of root items that need to be updated every transaction. We can't stuff the bytenr in the inode, because then we're back at the same problem, we have to update 1000 paths to point at the new root bytenr. Maybe I'm overly fearful of updating a 1000 inodes file extent root items in a transaction, we could probably add the cost to the trans handle that does the work, but somebody is going to end up paying the cost somewhere. And all this gains us is the ability to cleanup bookend extents and have snapshot aware defrag. These things are definitely big upsides, but for a certain relatively large btrfs user we wouldn't really care, and the latency induced could be a bummer. Clearly I'm still chewing on the problem, maybe I'll have a better plan in the future, but I think I've bitten a bit much off with the current set of extent-tree-v2 changes, and I'd rather not try to tackle something I don't have a clear path forward on. |
Also I should note we could probably go ahead and implement snapshot aware defrag to do essentially what relocation does today. So if a user were so inclined to do snapshot-aware defrag, and was willing to eat the cost of the unsharing of the data, then they could get it done with an ioctl. But I'd want to put a giant message that says "btw this could suuuuuuuuuuuck if you have a lot of snapshots" on our tools. |
Right, but do we have to unshare all pages in all of those paths? Obviously the roots are all different so we need 1000 new pages there (and delete 1000 because they were all unshared to begin with), but if the leaves all end up identical, can we keep the leaves and some interior nodes shared, and only update say 1002 pages (one leaf that stays shared, one interior that stays shared, 1000 roots that weren't shared before and still aren't now)? But then all the extent items might have a backref pointing to the metadata page. If these extents all use inline extent backrefs then they're OK (as long as the tree lookup still points to the new page), but if they're using shared block backrefs then we have to update them all--worst case a few hundred extent tree items and their parents. |
Hmm we could maybe do this, it would be akin to what relocation does, so it wouldn't unshare anything that is currently shared. We may fuck up if a snapshot gets changed in the middle of our operation and we can't link in the shared block anymore, we would either simply not defrag that inode, or have to loop through again to see if there was anybody that still needed to be updated. I'll think about this some more. |
bees ends up being a more or less perfect metadata unsharer in practice--dedupe touches just enough of every page in a snapshot to force unsharing on all of them. I am more than a little familiar with the worst case suck, because it's the state my larger snapshot-heavy filesystems are in all the time. In theory we don't have to unshare anything in the dedupe case, but userspace doesn't have the interfaces needed to avoid it. I've had "figure out how to make dedupe work more like balance" on my TODO list for years. I've also had "figure out how to make defrag and garbage collection that sucks less than the current stone tools" on the same TODO list...
My thinking here is that userspace defrag/garbage-collect would say "please relocate [list of extent, offset, length tuples] contiguously", so the kernel would be looping through backrefs to every metadata leaf page referencing the extents. After one iteration of the loop, the kernel might run the whole loop again to see if any new references to the extents popped up (new snapshots, reflink copies, whatever). This wouldn't treat shared or unshared pages differently, so if a page gets unshared in the middle then it's handled the same way as if it was unshared from the beginning. |
I had a couple of other "while we're making big incompatible on-disk changes" requests which I wrote while you were writing "please no more this is too much churn already" ;) but well here they are:
|
Thanks for implementing extent tree v2.
|
This should not happen and is not relevant for extent tree v2, it's either a hardware problem that does not properly flush the data or a bug. |
I'd like to ask something that it's hopefully a bit more relevant: right now enabling quotas with lots of snapshots hangs the whole system for minutes, is it a problem extent tree v2 is going to alleviate? |
Yes, there's a lot of reasons for this, and I'm addressing all of them (hopefully?)
|
Thanks very much for all improvements. |
It looks like squota won't be the magic bullet we where hoping for. |
Squotas still works in this case, it just creates this awkward thing where the usage will be attributed to the deleted subvolume and you won't be able to see that. This is the example
This is what we're referring to. Because we're not live tracking shared accounting, and we only attribute the usage to the original root, you can end up "losing track" of your actual usage. Now of course you can du -h /my/snap and see that's where the space is, we don't lie about actual usage. But if you list the usage from squotas it's only going to tell you what is actually accounted to the quota for that subvolume. As for extent tree v2, yes I'm actively working on it. With the scope of the project the design has had to change while I was developing it and discovering flaws in certain areas. I hope to be code complete in the next couple of months. |
Existing extent tree
The existing extent tree has a lot of big advantages, but some other disadvantages.
Advantages
However there are some disadvantages that are starting to become more and more of a problem. We also have some other things that are made much worse by the fact that we didn't design a system with the set of features we have today, so I would like to take this opportunity to rethink a few things related to the extent tree in order to set ourselves up better for the future.
Problems that I want to solve
finish_ordered_extent
go up because of the threads getting held up in lock contention on the csum tree. Latencies infinish_ordered_extent
can cause latencies on anything that needs to wait for ordered extents, like the ENOSPC flushing code or fsync.Uh, so how does changing the extent tree fix all these problems?
Firstly, it doesn't, at least not all of the problems. I'm wrapping all of these changes under the label "extent tree v2" because it's the biggest part, but it's actually going to be a series of on disk changes. However I do not want to do these things piecemeal because it'll hamper adoption. There needs to be a clear line in the sand where you choose "extent tree v2" and you get all these disk format changes moving forward. This will drastically simplify our testing matrix, as right now we have a lot of "features" that require toggling on and thus we have a lot of different features that can be mixed and matched. This will be a lot of changes, but it will make our lives easier going forward. My plan is to make the following specific changes
This all sounds great, are there any downsides?
Scrub
The biggest downside that I can think of right now is scrub. Right now scrub is relatively straightforward, it just marks a block group as read only, finds all the extents in that block group, reads them because you can find the owner easily, and bam you're done. With the extent tree reference counting no longer including information about cowonly trees and non-shared fs blocks we'd have to read all of the trees by searching down them. This means that we would have to keep a cache of fs tree's blocks that we've read in order to make sure we don't scrub the same tree's over and over again. This also makes it tricky because we can no longer just mark a block group read only while we do our work.
This isn't a huge drawback, data would still work the same, and we can simply search the commit roots and skip any blocks that are younger than the generation that we started the scrub with. It will be more memory to track where we've been, but the overall complexity should be relatively minimal.
Backreferences
We would no longer be able to lookup who owns arbitrary metadata blocks. I don't believe this to be a big issue because
I think the only thing this affects is something like scrub, but that's because before we'd read arbitrary locations and use the reference to know which tree it belonged to. With the new scheme we'd be searching down from the tree root, so we'd know which tree we were in, and thus could print the correct information.
The specifics
A block group tree
As indicated in a comment below, our block group items are spread all over the disk and makes mount times a big problem with really large drives. Fix this by making a tree to hold all the block group items. That's it, its pretty straightforward, no other real changes here.
Per-block group trees
This will be easy enough, we know the logical offset of the bytenr we care about, we can find the block group, and thus will be able to lookup the corresponding per-bg root for whatever operation we're doing.
Track only shared blocks and data in the extent tree - CHANGING DRASTICALLY, SEE COMMENT ABOUT DROP TREES
This is the big scary thing. You can find a description of how references currently work here and here. I'm not proposing changing the items as they stand right now, simply the rules around how we change them.
We no longer add extent references for cowonly trees. This is simple and straightforward, we just don't update the extent tree with those references. These blocks only get a ref added on allocation and it removed on deletion, there's no special rules for them.
We no longer add extent references for non-shared fs tree blocks. This is where things get a little tricky. A non-snapshotted fs tree will have 0 entries for its metadata in the extent tree. In practice a non-snapshotted fs tree acts the same as a cowonly tree, we add a ref on allocation, delete it on deletion. The trick comes into play when we snapshot. The same rules will apply as they always have, with a slight change if we are COW'ing down from the owner root.
We will have a normal reference at everything at level 1 for A' but not for A. We'll cover two full examples, first cow'ing from A and then A', then the reverse.
COW from A
COW from A'
And now for the other way, starting with a cow down from A'
COW from A'
COW from A
The implied reference from the original owner is somewhat tricky, so the logic in update_for_cow() would need to be updated to account for these rules, which are simply
Data extent references
Data extent references need to continue to work as before, as we have more complicated operations we can do with data, such as clone. The only change here is we no longer do bookend extents. Currently the way we handle writing to the middle of an existing file extent is this (this is the normal non-compressed/encrypted case)
The space that is no longer referenced that exists in the slot where the new file extent was placed is now wasted, as it will not be free'd until the left and right side of the extents are eventually freed.
The new scheme will be the following.
The space in the area that the new file extent replaces will be freed to be re-allocated again.
The compression case will remain the same, as we have to have the entire compressed extent to extract the area the file extent points to, but this isn't as bad because we limit our compressed extent sizes to 128k, thus we can only waste 128k-4k worth of space per extent at any given time because of bookend extents, compared to 128m-4k worth of wasted space per extent with the normal case.
Stripe tree - PROBABLY NOT GOING TO DO THIS
EDIT: What I want to accomplish and what @morbidrsa want to accomplish are slightly different things, and trying to combine them will remove flexibility for both of us. I'm still going to tackle relocation, but this idea is being set aside.
This is fairly straightforward, it'll track the phyisical location of the logical address space. Traditionally this was just some math, we had a chunk that mapped the physical offset of a block group, and so we would just take the offset from logical address from the block group and use that offset off of the physical start of the block group.
We would replace this with a stripe tree that actually tracked physical locations for the logical offset, so it could be any arbitrary device and physical offset within that device. This would have the following items, again blatantly ripped off from @morbidrsa with some modifications for my uses as well
Relocation - CHANGING FROM MY ORIGINAL IDEA
EDIT: Since I'm not doing the stripe tree I want to handle this in a different way. My plan is to do something sort of like what is described below, but instead make a new REMAP tree. If we relocate a block group we will set a REMAP flag on it's block group flags (maybe, I have to see if I can actually set the flags to something other than data type), and then populate the REMAP tree with where I've relocated the extents inside the block group. On mount this gets loaded up and we will translate any IO to the new logical offset where the extent resides. Once all of the extents have been freed from the block group the remap items will be deleted and the block group itself will be deleted. Of course the chunk will have been reclaimed by the time all of the blocks are remapped in the block group, so the space will be available, just the accounting will be removed later.
The new relocation behavior would be to migrate block groups as normal, but instead of walking the extent tree we would walk the stripe tree finding all stripes that exist in our logical space. We would then go find gaps in other logical areas that could host our stripe, read in the bytes from a btrfs_stripe_extent, and then write them to a new stripe and update the on disk stripe for the extent. We would keep track of the in memory mapping in an extent_map, so the operation would look something like this
Direct and shared bytes tracking per root - PROBABLY NOT GOING TO DO THIS
EDIT: We still have to do lookups after the fact to figure out if a shared extent went from N to 1 references. And we can likely accomplish this behavior with the current qgroup code, just would require some changes to runtime tracking and we don't need to do it on disk.
There is one tricky problem with qgroups, and that is the conversion of a data extent from shared to exclusive. This is tricky because it requires a full backref lookup everytime we modify an extent so that we can determine if we need to convert the file extent bytes from shared to exclusive. There is not much that can be done about this problem unfortunately, however we can make the intermediate tracking much simpler by storing the current shared and exclusive counts in the root items themselves. This would work like the following
This will reduce the complexity of tracking these counters across the board, and reduce the amount of backref lookups we have to do.
The text was updated successfully, but these errors were encountered: