Automatic balance after btrfs-cleaner #63

Atemu · 2025-02-15T05:08:16Z

Basic idea

When the cleaner finishes and made usage% go below a certain threshold within a block group, an automatic filtered balance is queued to get back above the threshold
This is on by default (important)
Can be turned off via mount option
Target % can be tuned via mount option

Intended effect

This should mitigate many cases of ENOSPC
- I've seen countless records of users running into ENOSPC out of nowhere because they thought they had free space when in actuality it was all allocated but unused
It should also mitigate free space fragmentation to a degree
It would provide this baseline without requiring any userspace setup
It shouldn't cost very much to do in most cases
- IME most chunks will be quite empty after the cleaner cleans a significant amount of data; allowing for quick compaction
- If you free a whole bunch of data, you will suffer some amount of IOPS degradation for a longer period due to the cleaner anyways

What about btrfs-maintenance?

Yes, you can already use e.g. btrfs-maintenance to get most of the benefit
That runs on a schedule rather than whenever space is actually freed however
It requires userspace setup for each mounted btrfs
Users who like to run btrfs-maintenance could still do so; either as a replacement for auto balances or in addition
- It requires setup for each mount point anyways, so it's not unreasonable to require setting a mount option
- It doesn't really hurt to have both

Further refinements

Perhaps there should be some sort hysteresis as to not constantly queue balances; being slightly above or below the target is fine
ENOSPC could trigger an automatic balance to try and get the fs back into a usable state
Perhaps there could also be an absolute threshold for free chunks; to always try to keep some amounts of chunks free, even if you'd need more usage% than the threshold to achieve that
- It's better to have IOPS fall off a cliff due to automatic balances when the disk is close to full than to ENOSPC
The automatic balances could be ran at a very low priority as to not impact workloads as much

Alternatives

The automatic balance target could be per-chunk usage%
- This would put better constraints on how much IO would be done, just like with manual balances
- Easier to run into ENOSPC but that should only happen with very pathological usage patterns
- Potentially more free space fragmentation?
Some way to nofity a userspace process when the cleaner has finished
- Would allow implementing all of this in userspace
- Would lose the benefit of this being present OOTB for any btrfs without setup

Zygo · 2025-02-15T07:02:21Z

Basic idea

Sounds like what echo 75 | tee /sys/fs/btrfs/*/allocation/data/bg_reclaim_threshold already does. Most of the proposal seems to be stuff we've already been running for years.

The only thing that is new or different in this section is the "on by default" part, which is a problematic change in behavior (but it could be enabled in userspace by distros etc).

should mitigate many cases of ENOSPC

I'm not sure what you mean here:

ordinary ENOSPC, where write(2) returns an error because all data space is allocated and used, or
btrfs catastrophic ENOSPC, where metadata runs out and forces the filesystem read-only.

These are very different and some of what you propose is counterproductive for one case but useful for the other.

If you free a whole bunch of data, you will suffer some amount of IOPS degradation for a longer period

This is a problem, and balancing makes it worse. It would be useful to slow down the deletion process so that the IOPS are less degraded, e.g. keep it down to under 100 refs/sec and split it up into smaller transactions, to avoid long stalls during transaction commits that lock out all writers to the filesystem. Some relief for this might be available through extent tree v2 changes (TL;DR don't delete everything in a huge burst in the transaction critical section, write the delayed refs to disk and process them at a sustainable rate instead).

On filesystems as small as 20 TiB, big deletes can lock up the filesystem for 20 minutes or more. Balances then lock up the filesystem in 2-10 minute bursts for some hours after. We try to schedule both during maintenance windows, but sometimes you just have to delete something in the middle of the working day. We still want the balance in the maintenance window, and there's a good chance we've refilled the free space so balance doesn't have to do anything by then.

It doesn't really hurt to have both

I would say that if you're still scheduling balances even though the kernel has supported automatic balances for years now, it's because it absolutely hurts to have both.

This already exists through sysfs and it already allows using one, the other, both, or neither.

ENOSPC could trigger an automatic balance to try and get the fs back into a usable state

On medium-to-large filesystems, free space allocation speed drops dramatically somewhere above 95% utilizations, but balances stop being possible somewhere above 90% utilization. Balances don't pack data as efficiently as normal writes do theoretically because they can't change extent sizes, and practically because balancing also changes some other allocation parameters. There's some possible relief coming via #54 on the packing efficiency, but that could move the ENOSPC problem into REMAP block groups without solving it.

On small filesystems, there's few block groups and simply no space to put any data other than in existing block groups. On those filesystems there's no benefit from balancing, so there's never a need to balance. A naive automatic balancer can end up wasting IOPS all day, pushing data back and forth between the same two locations on disk.

Perhaps there could also be an absolute threshold for free chunks; to always try to keep some amounts of chunks free, even if you'd need more usage% than the threshold to achieve that

That is a good idea. We are having a lot of success with the formula:

    min_unallocated + metadata_allocated_but_unused > SZ_1G * (3 + nr_devs)

which accounts for the worst case scenario:

Minimum 1G reserved (512M isn't enough to delete a big snapshot)
one block group locked by balance
one block group locked by discard
one block group locked for each device during scrub

After balancing with usage = 75%, if the above equation hasn't already happened, we send an alert for manual intervention. Balancing more block groups is generally futile as a filesystem over 90% full can't balance anyway, and might ENOSPC (case 2 above, the bad one) for even trying.

It's better to have IOPS fall off a cliff due to automatic balances when the disk is close to full than to ENOSPC

Hard NAK on this statement. We want ENOSPC (case 1 above, which only returns an error to userspace) before the filesystem gets slow.

Right now, when we are somewhere over 95%, allocation speeds drop below 4K/second, but we have over a terabyte of data space free. It can take multiple hours to finish a commit and recover use of the filesystem if we immediately SIGSTOP or SIGKILL all writing applications. If we let applications keep trying to write, the commit time keeps exponentially increasing, until forced reboot becomes the only path to recovery (along with loss of any data that did manage to get written in the hours before the reboot).

We'd definitely like a knob that stops writes with ENOSPC well before that happens (ideally subtracting the unusable space from df too). We can figure out how much space is "too full", all we need is a way to tell btrfs not to use more than that.
Applications in our workloads can handle ENOSPC easily, but they're not very good at handling their writing threads just grinding to a dead stop.

There are multiple problems that occur in this type of scenario. Running out of space simply isn't possible on many of our filesystems because the drives will crumble to dust long before btrfs can allocate the last data block. Metadata ENOSPC is simply impossible.

The automatic balances could be ran at a very low priority as to not impact workloads as much

That would require rework of the existing balance code. Right now a balance cannot be deprioritized because it holds the transaction lock for a long time, so lowering the priority causes priority inversion that prevents any other users from writing to the filesystem until the balances are done. Balances can only be deferred, i.e. scheduled to run at some later time when high-priority tasks are not running. Raising the priority of the balance helps a little, because it locks everything out for a shorter time.

That balance rework might already be coming (#54, #25) but it's not here yet, and that limits what can be done in the short term.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Automatic balance after btrfs-cleaner #63

Automatic balance after btrfs-cleaner #63

Atemu commented Feb 15, 2025

Zygo commented Feb 15, 2025

Automatic balance after btrfs-cleaner #63

Automatic balance after btrfs-cleaner #63

Comments

Atemu commented Feb 15, 2025

Basic idea

Intended effect

What about btrfs-maintenance?

Further refinements

Alternatives

Zygo commented Feb 15, 2025