Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement minimal ext2/3/4 filesystem driver #1179

Closed
wkozaczuk opened this issue Nov 28, 2021 · 4 comments · Fixed by #1310
Closed

Implement minimal ext2/3/4 filesystem driver #1179

wkozaczuk opened this issue Nov 28, 2021 · 4 comments · Fixed by #1310

Comments

@wkozaczuk
Copy link
Collaborator

This is a verbatim conversation from the mailing list:

On Wed, Oct 13, 2021 at 12:16 AM Gregory Burd gr...@burd.me wrote:
Hello OSv-ers,

I'm a huge fan of ZFS, it's an amazing bit of work and I'm thrilled it's a core component in OSv. That said, it's not a great choice in all cases, the overhead of ZFS can outweigh the benefits. I've heard many references to "adding another filesystem" into the mix in different contexts, most recently in the (amazing) talk given at p99conf by Waldek.

So, how about ext2 pulled straight from the BSD tree?
https://github.com/freebsd/freebsd-src/tree/main/sys/fs/ext2fs

Why ext2 and not ? Well, it's not my favorite filesystem either but it is popular and well known. It's easy for Linux users to get comfortable with and the tools are generally installed by default on most distros. I would imagine that the BSD code is fairly complete and supported and I believe it supports ext2, 3, and 4 (https://wiki.freebsd.org/Ext2fs).

anyone have thoughts?

On Wednesday, October 13, 2021 at 9:32:18 AM UTC-4 Nadav Har'El wrote:

I think it makes sense, but only if it's something that you personally care about for some reason - e.g., that it's important for you for OSv to be smaller and you believe that replacing ZFS will make it smaller. Or that some other advantage of ext2 over zfs is interesting for you.

Something worth keeping in mind is that one of the claimed advantages of OSv over, say, Linux, is that OSv does not need to support a gazillion different drivers and filesystems. It's not like anyone will ever plug a ext2-formatted disk or ntfs-formatted or whatever into OSv - so we don't need to support any of these filesystems. If we do want to support them, it should be out of some expected benefit - not out of necessity. So let's just spell out in advance what this benefit might be over the filesystems we already have (zfs, ramfs and rofs).

Waldek responded:
If we go ahead with implementing ext2 support, we should define a minimal subset of it we want to implement (do we need extents, large files, etc?). We should probably also not make the same mistake as with ZFS and NOT implement tools equivalent to zpool.so, mkfs.so, etc. Let us make it so for all admin functionality we would delegate to the toolset on host OS.

Also, I don't remember if Waldek did this or only partially (?), but if you're adding ext2 to reduce the kernel size, we first need to compile a kernel without zfs. We could add a build-time feature of removing zfs (see #1110) or build it into a shared library that doesn't need to be loaded (#1009). This would be similar to the Linux build system - which allows keeping some parts of the kernel out of the build, but also keeping some parts of the kernel in the build but as separate modules (sort of shared libraries).

anyone have suggestions on where to start?

I would start with making (at least to myself) the case of what the benefit of adding ext2 would be.
If you think it is the code size, I would start by trying to estimate how much smaller the kernel would be without ZFS. For that I would start adding to our build system an option to compile without ZFS - or compile ZFS into a shared library.

I think in my presentation (slides 9/10) I claimed that I was able to trim the size of the kernel by ~0.7MB (after enabling GC). Also adding an option to build OSv without ZFS is something that I was planning to prepare proper patches for. However, this would come in the order described in the presentation - the 3rd step, but does not have to: we can work on the ability to compile ZFS out independently. I think I might have something ready for steps 1 and 2 (hide C++ std, enable GC) in the next 2-4 weeks. Meanwhile, my WIP branch - https://github.com/wkozaczuk/osv/commits/minimize_kernel_size - has all code changes I made for the presentation including adding commenting out ZFS in the right places so you can start experimenting with it. This particular commit deals with ZFS - wkozaczuk@df98287.

Then of course you can start implementing ext2. I agree you should try to find existing code in freebsd. You can see the other examples in the fs/ subdirectory (ramfs, rofs, devfs, nfs) on how to plug that code into OSv.

Yeah ideally, as Nadav points out, we would want to make ext2 driver a pluggable shared library - module. A good example of how to do it is how nfs was changed to become a shared library with this patch - 4ffb0fa. Hopefully, once you read the comments and the code changes it will all make sense. For example, these fragments - 4ffb0fa#diff-4dcb4336d0285de24fc7f3ebdb6805eb0c10ea645d9848dcfd38daa7742b363c and 4ffb0fa#diff-df0d94aa12dd9f4772529c9060f40a3492dc1b034f8bc2ae79bd2f978231f8ed, are key to see how OSv would automatically try to find an ext2 shared library under /usr/lib/fs and call its INIT functions to let it register its vfsops structure.

Now with nfs it is easier to achieve because we have a thin adapter layer under modules/nfs which delegates to https://github.com/sahlberg/libnfs which is built separately as libnfs.so.4.0.0. It would be ideal to do a similar thing with ext2 but can you build https://github.com/freebsd/freebsd-src/tree/main/sys/fs/ext2fs as a library or at least not have to copy it under OSv tree and then be stuck with that version of code forever? My concern is that if we fork the code or copy it, it will be difficult or a burden to maintain it in terms of bringing in bug fixes from upstream for example. Is it possible to achieve?.

One of the problems you'll encounter will be the cache. I have to admit I don't remember everything we did there (Waldek might have a fresher memory, as he did rofs more recently), but because ZFS has such an elaborate caching mechanism, and OSv used ZFS, we avoided having yet another page-cache layer. That means that if ext2 doesn't come with its own page cache (because freebsd assumes a different layer handles the caching) your ext2 will not do any caching, which isn't great. Waldek's rofs dealt a bit with caching, so maybe you can copy it or be inspired from it, or also copy some caching code from freebsd.

ROFS comes with its own simple cache layer (https://github.com/cloudius-systems/osv/blob/master/fs/rofs/rofs_cache.cc and the commit to integrate it with pagecache - 54b3071) and here is the commit that finally integrated it into page cache - 4c0bdbc. Now, the changes I made are enough for read-only functionality. If we want full read-write ext2 support we would also need to figure out the "write" part.

Now having made all these comments about ext2, I think we should consider virtiofs as well. We have already the read-only implementation of virtiofs in OSv thanks to Fotis Xenakis (see this wiki page - https://github.com/cloudius-systems/osv/wiki/virtio-fs - it has many good references as well). Given that we can consider adding the write-support part to it and then delegate it to the host to provide whatever sophisticated filesystem it comes with. It is the beauty of virtiofs. But the downside is that virtiofs is only supported by some VMMs like QEMU and Intel's cloud hypervisor. So I still think it would be nice to have a simple and pluggable reasonably fast read-write filesystem support of ext2.

-greg

wkozaczuk added a commit that referenced this issue Dec 21, 2021
Originally I thought that extracting ZFS out of the kernel
as a shared library would not be as easy as it it has turned out to
be. Obviously after figuring couple of important gotchas which I
describe below and in the code comments.

The advantages of moving ZFS to a separate library are following:
- kernel becomes ~900K smaller
- there are at least 10 less threads needed to run non-ZFS image
  (running ROFS image on 1 cpu requires only 25 threads)

I also hope this patch provides a blueprint of how we could implement
another ext2/3/4 filesystem driver (see #1179) or other true kernel modules.

The essence of this patch are changes to the main makefile to build
new libsolaris.so and various ZFS-related parts of the kernel like
pagecache, arc_shrinker and ZFS dev driver to make them call into
libsolaris.so upon dynamically registering handful of callbacks.

The new libsolaris.so is mainly composed of the solaris and zfs sets
as defined in the makefile (and not part of the kernel anymore)
plus bsd RPC code (xdr*), kobj and finally new fs/zfs/zfs_initialize.c
which provides main INIT function - zfs_initialize(). The
zfs_initialize() initializes various ZFS resources like threads and
memory and registers various callback functions into the main kernel
(see comments in zfs_initialize.c).

Two important gotchas I have discovered are:
1) The libsolaris.so needs to build with BIND_NOW to make all symbols
   resolved eagerly to avoid page faults to resolve those symbols later
   if the ZFS code in libsolaris.so is called to resolve other faults.
   This would cause deadlocks.
2) The libsolaris.so needs the osv-mlock note so that dynamic linker
   would populate the mappings. This is similar to above to avoid page
   faults later that would lead to deadlocks.

Please note the libsolaris.so is built with most symbols hidden
and code garbage collection on to help minimize its size (804K)
and expose minimum number of symbols (< 100) needed by libzfs.so.
The latter also helps avoid possible symbol collision with other apps.

We also make changes to loader.cc to dlopen("/libsolaris.so") before
we mount ZFS filesystem (for that reason libsolaris.so needs to be part
of the bootfs for ZFS images). Because ZFS is root filesystem, we cannot
use the same approach we used for nfs which is also implemented as a
shared library but loaded in pivot_rootfs() which happens much later.

In theory we could build mixed disk with two partitions - 1st ROFS
one with libsolaris.so on it and the 2nd ZFS one which would be mounted
after we mount ROFS and load and initialize libsolaris.so from it.

I have tested this patch by running unit tests (all pass) and also using
tests/misc-zfs-io.cc as well as running stress test of MySQL on ZFS
image.

Fixes #1009

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
@wkozaczuk
Copy link
Collaborator Author

I have just come across this ext4 library which we can use to add support to OSv - https://github.com/gkostka/lwext4. We should build it as a shared library and load and initialize it like ZFS.

@fish4terrisa-MSDSM
Copy link

It will really helps if we added ext4 support to osv.I'm currently porting RVVM to osv, and boot it on a vps.Since the vps didn't support isos, it would be very difficult for me to convert the vps's default ext4 partition to zfs. Ext4 support will enable me to boot osv at normal time and boot to the archlinux when something wrong happened.(also, the image I wanted to use is just a bit large, which makes it impossible to install osv to another partition)

@raphaelsc
Copy link
Member

I have just come across this ext4 library which we can use to add support to OSv - https://github.com/gkostka/lwext4. We should build it as a shared library and load and initialize it like ZFS.

Great, it always felt to me a lightweight OS would greatly benefit from a more lightweight fs than ZFS.

@wkozaczuk
Copy link
Collaborator Author

Unfortunately, we need to wait until I find some time or somebody else volunteers to implement it.

@fish4terrisa-MSDSM Last year I made some improvements to OSv ZFS implementation and tooling and these days there is a lot of flexibility in how to build and run ZFS including building ZFS disk image on host and mounting it - see https://github.com/cloudius-systems/osv/wiki/Filesystems#zfs and https://github.com/cloudius-systems/osv/wiki/Filesystems#creating-and-manipulating-zfs-disks-on-host.

I hope it helps until we have extfs support.

wkozaczuk added a commit to wkozaczuk/osv that referenced this issue Mar 18, 2024
This commit adds an initial implementation of the ext4
filesystem driver based on the lwext4 project
(https://github.com/gkostka/lwext4). It provides a light weight
read-write alternative to ZFS filesystem.

Please note this implementation is NOT thread-safe
and will need to be enhanced to be so in future. However it is
functional enough to support the test cases examined by
modules/libext/test.sh.

One can build the OSv like so:

./scripts/manifest_from_host.sh -w /usr/bin/find && ./scripts/build fs=rofs image=libext,native-example -j$(nproc)
--append-manifest

Then create an ext4 filesystem:

mkdir -p ext_images
dd if=/dev/zero of=ext_images/ext4 bs=1M count=128
sudo mkfs.ext4 ext_images/ext4

Add some files to it if needed:

sudo losetup -o 0 -f --show ext_images/ext4
sudo mount /dev/loop0 ext_images/image

.. update content

sudo umount ext_images/image
sudo losetup -d /dev/loop0

qemu-img convert -f raw -O qcow2 ext_images/ext4 ext_images/ext4.img

And then run it:

./scripts/run.py --execute='--mount-fs=ext,/dev/vblk1,/data /hello' --second-disk-image ./ext_images/ext4.img

or using the test.sh

./modules/libext/test.sh '/find /data/ -ls'

Fixes cloudius-systems#1179

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
wkozaczuk added a commit to wkozaczuk/osv that referenced this issue Mar 18, 2024
This commit adds an initial implementation of the ext4
filesystem driver based on the lwext4 project
(https://github.com/gkostka/lwext4). It provides a light weight
read-write alternative to ZFS filesystem.

Please note this implementation is NOT thread-safe
and will need to be enhanced to be so in future. However it is
functional enough to support the test cases examined by
modules/libext/test.sh.

One can build the OSv like so:

./scripts/manifest_from_host.sh -w /usr/bin/find && ./scripts/build fs=rofs image=libext,native-example -j$(nproc)
--append-manifest

Then create an ext4 filesystem:

mkdir -p ext_images
dd if=/dev/zero of=ext_images/ext4 bs=1M count=128
sudo mkfs.ext4 ext_images/ext4

Add some files to it if needed:

sudo losetup -o 0 -f --show ext_images/ext4
sudo mount /dev/loop0 ext_images/image

.. update content

sudo umount ext_images/image
sudo losetup -d /dev/loop0

qemu-img convert -f raw -O qcow2 ext_images/ext4 ext_images/ext4.img

And then run it:

./scripts/run.py --execute='--mount-fs=ext,/dev/vblk1,/data /hello' --second-disk-image ./ext_images/ext4.img

or using the test.sh

./modules/libext/test.sh '/find /data/ -ls'

Fixes cloudius-systems#1179

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
wkozaczuk added a commit to wkozaczuk/osv that referenced this issue Mar 19, 2024
This commit adds an initial implementation of the ext4
filesystem driver based on the lwext4 project
(https://github.com/gkostka/lwext4). It provides a light weight
read-write alternative to ZFS filesystem.

Please note this implementation is NOT thread-safe
and will need to be enhanced to be so in future. However it is
functional enough to support the test cases examined by
modules/libext/test.sh.

One can build the OSv like so:

./scripts/manifest_from_host.sh -w /usr/bin/find && ./scripts/build fs=rofs image=libext,native-example -j$(nproc)
--append-manifest

Then create an ext4 filesystem:

mkdir -p ext_images
dd if=/dev/zero of=ext_images/ext4 bs=1M count=128
sudo mkfs.ext4 ext_images/ext4

Add some files to it if needed:

sudo losetup -o 0 -f --show ext_images/ext4
sudo mount /dev/loop0 ext_images/image

.. update content

sudo umount ext_images/image
sudo losetup -d /dev/loop0

qemu-img convert -f raw -O qcow2 ext_images/ext4 ext_images/ext4.img

And then run it:

./scripts/run.py --execute='--mount-fs=ext,/dev/vblk1,/data /hello' --second-disk-image ./ext_images/ext4.img

or using the test.sh

./modules/libext/test.sh '/find /data/ -ls'

Fixes cloudius-systems#1179

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
wkozaczuk added a commit that referenced this issue Mar 19, 2024
This commit adds an initial implementation of the ext4
filesystem driver based on the lwext4 project
(https://github.com/gkostka/lwext4). It provides a light weight
read-write alternative to ZFS filesystem.

Please note this implementation is NOT thread-safe
and will need to be enhanced to be so in future. However it is
functional enough to support the test cases examined by
modules/libext/test.sh.

One can build the OSv like so:

./scripts/manifest_from_host.sh -w /usr/bin/find && ./scripts/build fs=rofs image=libext,native-example -j$(nproc)
--append-manifest

Then create an ext4 filesystem:

mkdir -p ext_images
dd if=/dev/zero of=ext_images/ext4 bs=1M count=128
sudo mkfs.ext4 ext_images/ext4

Add some files to it if needed:

sudo losetup -o 0 -f --show ext_images/ext4
sudo mount /dev/loop0 ext_images/image

.. update content

sudo umount ext_images/image
sudo losetup -d /dev/loop0

qemu-img convert -f raw -O qcow2 ext_images/ext4 ext_images/ext4.img

And then run it:

./scripts/run.py --execute='--mount-fs=ext,/dev/vblk1,/data /hello' --second-disk-image ./ext_images/ext4.img

or using the test.sh

./modules/libext/test.sh '/find /data/ -ls'

Fixes #1179

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
osvunikernel pushed a commit to osvunikernel/lwext4 that referenced this issue Jul 9, 2024
This patch modifies this fork of lwext4 to make is safe to
interact with by multiple threads running in OSv.

The key assumption is, that OSv VFS layer provides necessary
locking around all interactions with lwext4 to guard ext filesystem
metadata (i-node table, directory entries, etc) modifications confined
to specific vnode.

Beyond that, we add necessary locking around 3 key common data
structures:
- i-node bitmaps in ext4_ialloc.c
- data block bitmaps in ext4_balloc.c
- metadata blockcache in ext4_bcache.c and related files

More specifically following functions are protected with
inode_alloc_lock()/unlock() to make sure no two files/directories
get assigned same inode number:
- ext4_ialloc_alloc_inode()
- ext4_ialloc_free_inode()

Next, following functions are protected with block_alloc_lock()/unlock()
to make sure no two files/directories use same data block:
- ext4_balloc_alloc_block()
- ext4_balloc_free_block()
- ext4_balloc_free_blocks()

Finally, these functions in ext4_bcache.c and related source files
are protected with bcache_lock()/unlock() to make sure the global
metadata block cache access is synchronized:
- ext4_bcache_invalidate_lba() in __ext4_balloc_free_block() and
  __ext4_balloc_free_blocks()
- ext4_bcache_find_get(), ext4_block_flush_buf() and ext4_bcache_free()
  in ext4_block_flush_lba()
- ext4_block_get_noread(), ext4_bcache_test_flag() ext4_bcache_free() in
  ext4_block_get()
- ext4_bcache_free() in ext4_block_set()
- ext4_block_get_noread() in ext4_trans_block_get_noread()

Ref gkostka#83
Ref cloudius-systems/osv#1179

Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants