-
Notifications
You must be signed in to change notification settings - Fork 83
Btrfs extent info
On this page I intend to write down my understanding of how btrfs handles extents on disk and how they can be re-read to build up enough state to count extent references. Most of this I learned as I worked on two projects - qgroups and btrfs-extent-same.
This document will not go into the delayed ref handling code (but that's all in memory anyhow so we need to understand what's on disk first).
https://en.wikipedia.org/wiki/Btrfs#Extents to understand that we have two mechanisms at work here. the extent-tree
is what is actually holding the extents. Files track their references to extents via extent data items
. The extent data items
are kept in leaf nodes of the corresponding fs tree.
btrfs-progs/qgroup-verify.c - the userspace quota group checker. Easier to understand than the kernel and by necessity uses the same algorithm.
btrfs-progs/ulist.c - A list implementation that is transparently handles duplicate entries. The API is frankly, ugly and error-prone. However ulist provides an important function in that it helps us avoid some otherwise naturally recursive algorithms.
An old e-mail from Josef Ignore the qgroup and defrag talk it's all ancient by now, but Josef does a good job of laying out out how extents and snapshots work together at the top of that e-mail.
The goal of qgroup rescan is to rebuild the qgroup counters. We do this by visiting every extent in the extent-tree
and finding all the subvolume roots which point to it. From there the qgroup code can easily figure which extents are exclusive or shared and adjust the counters appropriately. While the qgroup code is concerned with counting roots this is also essentially reference counting our extents.
The extent tree contains a few different item types depending on what the extent is used for. The ones we are concerned with are BTRFS_EXTENT_ITEM_KEY
and BTRFS_METADATA_ITEM_KEY
which point to data and metadata extents, respectively.
Aside from pointing to the data on disk, the extent tree items keep limited information on the number of refs and at least one backref pointer. A backref points to a either a tree root or a parent tree node (depending on type, more on that below).
We call the backrefs that are bundled in the extent item inline refs
- see struct btrfs_extent_inline_ref
and struct btrfs_extent_data_ref
for the specifics of each structure.
We get inline refs
by simply adding an extent to the filesystem - it will start with at least one inlined backpointer.
In addition to inline refs, there are keyed refs
which hold the same type of information but are stored within their own items inside the extent tree. Tree ordering ensures that keyed refs always come after the inline extent reference.
Btrfs uses keyed refs
when the space for inline refs in the extent item
is exhausted.
References (whether inline or keyed) have one of the following keys depending on the ref type and the extent type (data or metadata) - BTRFS_TREE_BLOCK_REF_KEY
, BTRFS_EXTENT_DATA_REF_KEY
, BTRFS_SHARED_DATA_REF_KEY
and BTRFS_SHARED_BLOCK_REF_KEY
The key type informs us as to what sort of information is stored. BTRFS_TREE_BLOCK_REF_KEY
and BTRFS_EXTENT_DATA_REF_KEY
contain a direct pointer to a root node - we can call these direct refs
. BTRFS_SHARED_DATA_REF_KEY
and BTRFS_SHARED_BLOCK_REF_KEY
contain a pointer to a parent tree block - these are indirect refs which have to be resolved.
Lastly, an implied ref is when a tree block has refs on it that may not exist in any of its child nodes. Even though the refs might not exist further down the tree, the fact that our interior node has a ref means we need to account all nodes below it.
We get implied refs when snapshotting a deep (> 1 level) tree. Basically btrfs wants the snapshotting operation to be as cheap as possible. In order to do this we just cow the root of the tree (NOTE: Double check, this may be the root + 1 level). This makes the snapshot extremely cheap but requires far more work at accounting time as we can no longer rely only on extent direct or indirect pointers to find the total number of roots referencing a given extent.
In the userspace code, this is handled by map_implied_refs()
and account_all_refs()
.
To find a root from an extent, we can simply follow the backpointers. If we have a direct ref, then we are done. Otherwise we use the parent pointer to look up the parent extent block and so on until we have reached a root.
To map implied refs, we visit every tree ref. If it's an interior node, we first map a root to it via the above algorithm. Then we descend down the actual tree node to each leaf. Note that this is the first time we're reading outside the extent tree. For each fs tree leaf we walk the items array and add refs for each EXTENT_DATA_KEY
we find (again via the above algorithm).
For setup, we find all the roots and allocate some tracking objects for each one.
Next is a straight-forward walk of the extent tree. This happens in scan_extents()
. For each extent ref (whether from a key or inline, direct, or indirect) we allocate a ref object. We also track tree interior nodes during this process to make the walk in map_implied_refs()
easier.
Then the ref objects are resolved via map_implied_refs()
first, then the rest in account_all_refs()
. The resolution algorithms used are described above.
After that we have a fully resolved tree of refs and can get our root counts by just walking them.
Clone generally places file_extent_items wherever asked. It makes no attempt to merge adjacent items that may be contiguous.