Skip to content

Conversation

@ticpu
Copy link

@ticpu ticpu commented Dec 2, 2025

Implement a new storage driver for bcachefs filesystems that uses subvolumes and snapshots for container layer management, similar to the existing btrfs driver.

Features:

  • Implementation using direct ioctl syscalls
  • Subvolume creation via BCH_IOCTL_SUBVOLUME_CREATE
  • Snapshot creation with BCH_SUBVOL_SNAPSHOT_CREATE flag
  • Subvolume detection using statx() with STATX_SUBVOL
  • Recursive nested subvolume deletion
  • Support for both root and rootless operation

Tested on my system with multiple images:

❯ podman run --rm docker.io/library/nginx:latest nginx -v
Trying to pull docker.io/library/nginx:latest...
Getting image source signatures
Copying blob 53d743880af4 done   | 
Copying blob 0e4bc2bd6656 done   | 
Copying blob 108ab8292820 done   | 
Copying blob 192e2451f875 done   | 
Copying blob 77fa2eb06317 done   | 
Copying blob b5feb73171bf done   | 
Copying blob de57a609c9d5 done   | 
Copying config 60adc2e137 done   | 
Writing manifest to image destination
/docker-entrypoint.sh: /docker-entrypoint.d/ is not empty, will attempt to perform configuration
/docker-entrypoint.sh: Looking for shell scripts in /docker-entrypoint.d/
/docker-entrypoint.sh: Launching /docker-entrypoint.d/10-listen-on-ipv6-by-default.sh
10-listen-on-ipv6-by-default.sh: info: Getting the checksum of /etc/nginx/conf.d/default.conf
10-listen-on-ipv6-by-default.sh: info: Enabled listen on IPv6 in /etc/nginx/conf.d/default.conf
/docker-entrypoint.sh: Sourcing /docker-entrypoint.d/15-local-resolvers.envsh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/20-envsubst-on-templates.sh
/docker-entrypoint.sh: Launching /docker-entrypoint.d/30-tune-worker-processes.sh
/docker-entrypoint.sh: Configuration complete; ready for start up
nginx version: nginx/1.29.3

❯ podman run --rm docker.io/library/python:3.12-slim python --version
Trying to pull docker.io/library/python:3.12-slim...
Getting image source signatures
Copying blob b7ba6d2a1fc7 done   | 
Copying blob 490b9a1c25e4 done   | 
Copying blob 0e4bc2bd6656 skipped: already exists  
Copying blob 0674d14a155c done   | 
Copying config 445121148b done   | 
Writing manifest to image destination
Python 3.12.12

❯ podman image mount ceph:v18 
/var/lib/containers/storage/bcachefs/subvolumes/7a295044d828c8a95725ef60009582c7a8a0c455ab9abd9ee9b350b0dd4c6d30

❯ ls /var/lib/containers/storage/bcachefs/subvolumes/7a295044d828c8a95725ef60009582c7a8a0c455ab9abd9ee9b350b0dd4c6d30
afs  bin  boot  dev  etc  home  lib  lib64  lost+found  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var

❯ podman run --rm --entrypoint=/bin/python3 -ti ceph:v18 
Python 3.9.21 (main, Feb 10 2025, 00:00:00) 
[GCC 11.5.0 20240719 (Red Hat 11.5.0-5)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

❯ podman image ls
REPOSITORY                TAG         IMAGE ID      CREATED       SIZE
docker.io/library/python  3.12-slim   445121148b18  13 days ago   123 MB
docker.io/library/nginx   latest      60adc2e137e7  13 days ago   155 MB
docker.io/library/alpine  latest      706db57fb206  7 weeks ago   8.62 MB
quay.io/ceph/ceph         v18         0f5473a1e726  6 months ago  1.27 GB

❯ podman image rm ceph:v18 
Error: image used by 085e9c2853013e627c41e8e1833655a7b73cf4ba45a19556102c3675dc840900: image is in use by a container: consider listing external containers and force-removing image

❯ podman rm -f ceph18 
ceph18

❯ podman image rm ceph:v18 
Untagged: quay.io/ceph/ceph:v18
Deleted: 0f5473a1e726b0feaff0f41f8de8341c0a94f60365d4584f4c10bd6b40d44bc1

❯ pwd && ls | wc -l
/var/lib/containers/storage/bcachefs/subvolumes
11

❯ podman image prune -a
WARNING! This command removes all images without at least one container associated with them.
Are you sure you want to continue? [y/N] y
706db57fb2063f39f69632c5b5c9c439633fda35110e65587c5d85553fd1cc38
60adc2e137e757418d4d771822fa3b3f5d3b4ad58ef2385d200c9ee78375b6d5
445121148b187db67e48799f002500623fa22d9f635e522f4e0f345414bd9107

❯ ls | wc -l
0

Implement a new storage driver for bcachefs filesystems that uses
subvolumes and snapshots for container layer management, similar to
the existing btrfs driver.

Features:
- Implementation using direct ioctl syscalls
- Subvolume creation via BCH_IOCTL_SUBVOLUME_CREATE
- Snapshot creation with BCH_SUBVOL_SNAPSHOT_CREATE flag
- Subvolume detection using statx() with STATX_SUBVOL
- Recursive nested subvolume deletion
- Support for both root and rootless operation

Signed-off-by: Jérôme Poulin <jeromepoulin@gmail.com>
@github-actions github-actions bot added the storage Related to "storage" package label Dec 2, 2025
podmanbot pushed a commit to podmanbot/buildah that referenced this pull request Dec 2, 2025
@podmanbot
Copy link

✅ A new PR has been created in buildah to vendor these changes: containers/buildah#6559

Copy link
Member

@Luap99 Luap99 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref #146

Personally I am not convinced of the value of yet another storage driver in this code base. bcachefs was dropped from the upstream kernel which means doing any sort of testing more complicated as it will likely not be be default in either fedora or debian which are the only distros we test in CI.

We already have largely unmaintained btrfs and zfs drivers in this code base so I don't want to add more of this.

Or to turn the question around what does using this driver over overlayfs on top of bcachefs offer?

cc @mtrmac @giuseppe @nalind @mheon

@giuseppe
Copy link
Member

giuseppe commented Dec 2, 2025

how does it behave when you use user namespaces? e.g. what is the performance when you run a bunch of podman run --userns auto ... containers?

@mtrmac
Copy link
Contributor

mtrmac commented Dec 2, 2025

Personally I am not convinced of the value of yet another storage driver in this code base.

As in #146 (comment), the future directions that have been most discussed are, indeed, ~incompatible with filesystem-snapshot-based layer storage.

OTOH, *shrug* the generic code is already constrained by the need to support unprivileged vfs, so having 2/3/4 snapshot-based graph drivers is not that much of a difference, assuming that everyone is fine with … “benign neglect”, where PR authors might be asked to ensure the other graph drivers compile but nothing beyond that.

E.g., AFAICS we do no other testing of ZFS in this repo. From time to time, there are issues reported, and they frequently go without substantive response.

But then again, the “no upstream testing” situation seems to work for FreeBSD … maybe, well enough? (Or maybe it is going so badly that I don’t even know how bad it is.)

There is some middle ground where a non-default approach is clearly not the primary focus, not recommended and not shipped as primary release artifact of e.g. Podman, but support present in the main project’s repository makes things a bit easier for people interested in that non-default approach.

Maybe the important components for this are:

  • Most importantly, users are not misled into the non-default approach. (FreeBSD users know they don’t want to use Linux, and that was ~clearly a considered choice.)
  • The cost on the project’s primary deliverables, and maintainers, is close enough to zero that no-one is motivated to propose removing the support.
  • There is, nevertheless, some benefit to maintaining the non-default approach in the projects’ repo directly.

Purely in the abstract, new filesystems are a Big Deal, and … potentially very valuable? The promise of Btrfs as the universal future, I guess, didn’t quite pan out that way (yet?), but maybe something (bcachefs or something created in the future) could happen in this decade (or the next one). And such future filesystems need some place to experiment and grow into maturity, even if users were kept strongly discouraged from deploying the new filesystems into production for many years. So, in principle, I might be fine with paying a trivial non-recurring cost just for the project to have an opportunity to experiment — as long as the experiment is, in some sense, going in the right direction. So, the overlay-over-bcachefs question.


bcachefs was dropped from the upstream kernel which means doing any sort of testing more complicated as it will likely not be be default in either fedora or debian which are the only distros we test in CI.

This is important — I think realistically we would not be testing the proposed driver, limiting it to the “will be kept buildable” position.


Or to turn the question around what does using this driver over overlayfs on top of bcachefs offer?

Yes — the original discussion in #146 motivated this by saying that overlay on top of bcachefs is not possible, but later there was a pointer at patch set implementing this. Why shouldn’t this be the universally adopted setup?

If overlay is viable in the short (or intermediate?) term (I don’t know), and if we’d prefer users to use overlay over a new graph driver (I’m not sure but it seems very likely), then it might be better to not add a new graph driver that could direct users away from the preferred path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

storage Related to "storage" package

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants