Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic balance after btrfs-cleaner #63

Open
Atemu opened this issue Feb 15, 2025 · 1 comment
Open

Automatic balance after btrfs-cleaner #63

Atemu opened this issue Feb 15, 2025 · 1 comment

Comments

@Atemu
Copy link

Atemu commented Feb 15, 2025

Basic idea

  • When the cleaner finishes and made usage% go below a certain threshold within a block group, an automatic filtered balance is queued to get back above the threshold
  • This is on by default (important)
  • Can be turned off via mount option
  • Target % can be tuned via mount option

Intended effect

  • This should mitigate many cases of ENOSPC
    • I've seen countless records of users running into ENOSPC out of nowhere because they thought they had free space when in actuality it was all allocated but unused
  • It should also mitigate free space fragmentation to a degree
  • It would provide this baseline without requiring any userspace setup
  • It shouldn't cost very much to do in most cases
    • IME most chunks will be quite empty after the cleaner cleans a significant amount of data; allowing for quick compaction
    • If you free a whole bunch of data, you will suffer some amount of IOPS degradation for a longer period due to the cleaner anyways

What about btrfs-maintenance?

  • Yes, you can already use e.g. btrfs-maintenance to get most of the benefit
  • That runs on a schedule rather than whenever space is actually freed however
  • It requires userspace setup for each mounted btrfs
  • Users who like to run btrfs-maintenance could still do so; either as a replacement for auto balances or in addition
    • It requires setup for each mount point anyways, so it's not unreasonable to require setting a mount option
    • It doesn't really hurt to have both

Further refinements

  • Perhaps there should be some sort hysteresis as to not constantly queue balances; being slightly above or below the target is fine
  • ENOSPC could trigger an automatic balance to try and get the fs back into a usable state
  • Perhaps there could also be an absolute threshold for free chunks; to always try to keep some amounts of chunks free, even if you'd need more usage% than the threshold to achieve that
    • It's better to have IOPS fall off a cliff due to automatic balances when the disk is close to full than to ENOSPC
  • The automatic balances could be ran at a very low priority as to not impact workloads as much

Alternatives

  • The automatic balance target could be per-chunk usage%
    • This would put better constraints on how much IO would be done, just like with manual balances
    • Easier to run into ENOSPC but that should only happen with very pathological usage patterns
    • Potentially more free space fragmentation?
  • Some way to nofity a userspace process when the cleaner has finished
    • Would allow implementing all of this in userspace
    • Would lose the benefit of this being present OOTB for any btrfs without setup
@Zygo
Copy link

Zygo commented Feb 15, 2025

Basic idea

Sounds like what echo 75 | tee /sys/fs/btrfs/*/allocation/data/bg_reclaim_threshold already does. Most of the proposal seems to be stuff we've already been running for years.

The only thing that is new or different in this section is the "on by default" part, which is a problematic change in behavior (but it could be enabled in userspace by distros etc).

should mitigate many cases of ENOSPC

I'm not sure what you mean here:

  1. ordinary ENOSPC, where write(2) returns an error because all data space is allocated and used, or
  2. btrfs catastrophic ENOSPC, where metadata runs out and forces the filesystem read-only.

These are very different and some of what you propose is counterproductive for one case but useful for the other.

If you free a whole bunch of data, you will suffer some amount of IOPS degradation for a longer period

This is a problem, and balancing makes it worse. It would be useful to slow down the deletion process so that the IOPS are less degraded, e.g. keep it down to under 100 refs/sec and split it up into smaller transactions, to avoid long stalls during transaction commits that lock out all writers to the filesystem. Some relief for this might be available through extent tree v2 changes (TL;DR don't delete everything in a huge burst in the transaction critical section, write the delayed refs to disk and process them at a sustainable rate instead).

On filesystems as small as 20 TiB, big deletes can lock up the filesystem for 20 minutes or more. Balances then lock up the filesystem in 2-10 minute bursts for some hours after. We try to schedule both during maintenance windows, but sometimes you just have to delete something in the middle of the working day. We still want the balance in the maintenance window, and there's a good chance we've refilled the free space so balance doesn't have to do anything by then.

It doesn't really hurt to have both

I would say that if you're still scheduling balances even though the kernel has supported automatic balances for years now, it's because it absolutely hurts to have both.

This already exists through sysfs and it already allows using one, the other, both, or neither.

ENOSPC could trigger an automatic balance to try and get the fs back into a usable state

On medium-to-large filesystems, free space allocation speed drops dramatically somewhere above 95% utilizations, but balances stop being possible somewhere above 90% utilization. Balances don't pack data as efficiently as normal writes do theoretically because they can't change extent sizes, and practically because balancing also changes some other allocation parameters. There's some possible relief coming via #54 on the packing efficiency, but that could move the ENOSPC problem into REMAP block groups without solving it.

On small filesystems, there's few block groups and simply no space to put any data other than in existing block groups. On those filesystems there's no benefit from balancing, so there's never a need to balance. A naive automatic balancer can end up wasting IOPS all day, pushing data back and forth between the same two locations on disk.

Perhaps there could also be an absolute threshold for free chunks; to always try to keep some amounts of chunks free, even if you'd need more usage% than the threshold to achieve that

That is a good idea. We are having a lot of success with the formula:

    min_unallocated + metadata_allocated_but_unused > SZ_1G * (3 + nr_devs)

which accounts for the worst case scenario:

  1. Minimum 1G reserved (512M isn't enough to delete a big snapshot)
  2. one block group locked by balance
  3. one block group locked by discard
  4. one block group locked for each device during scrub

After balancing with usage = 75%, if the above equation hasn't already happened, we send an alert for manual intervention. Balancing more block groups is generally futile as a filesystem over 90% full can't balance anyway, and might ENOSPC (case 2 above, the bad one) for even trying.

It's better to have IOPS fall off a cliff due to automatic balances when the disk is close to full than to ENOSPC

Hard NAK on this statement. We want ENOSPC (case 1 above, which only returns an error to userspace) before the filesystem gets slow.

Right now, when we are somewhere over 95%, allocation speeds drop below 4K/second, but we have over a terabyte of data space free. It can take multiple hours to finish a commit and recover use of the filesystem if we immediately SIGSTOP or SIGKILL all writing applications. If we let applications keep trying to write, the commit time keeps exponentially increasing, until forced reboot becomes the only path to recovery (along with loss of any data that did manage to get written in the hours before the reboot).

We'd definitely like a knob that stops writes with ENOSPC well before that happens (ideally subtracting the unusable space from df too). We can figure out how much space is "too full", all we need is a way to tell btrfs not to use more than that.
Applications in our workloads can handle ENOSPC easily, but they're not very good at handling their writing threads just grinding to a dead stop.

There are multiple problems that occur in this type of scenario. Running out of space simply isn't possible on many of our filesystems because the drives will crumble to dust long before btrfs can allocate the last data block. Metadata ENOSPC is simply impossible.

The automatic balances could be ran at a very low priority as to not impact workloads as much

That would require rework of the existing balance code. Right now a balance cannot be deprioritized because it holds the transaction lock for a long time, so lowering the priority causes priority inversion that prevents any other users from writing to the filesystem until the balances are done. Balances can only be deferred, i.e. scheduled to run at some later time when high-priority tasks are not running. Raising the priority of the balance helps a little, because it locks everything out for a shorter time.

That balance rework might already be coming (#54, #25) but it's not here yet, and that limits what can be done in the short term.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants