-
-
Notifications
You must be signed in to change notification settings - Fork 605
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement minimal ext2/3/4 filesystem driver #1179
Comments
Originally I thought that extracting ZFS out of the kernel as a shared library would not be as easy as it it has turned out to be. Obviously after figuring couple of important gotchas which I describe below and in the code comments. The advantages of moving ZFS to a separate library are following: - kernel becomes ~900K smaller - there are at least 10 less threads needed to run non-ZFS image (running ROFS image on 1 cpu requires only 25 threads) I also hope this patch provides a blueprint of how we could implement another ext2/3/4 filesystem driver (see #1179) or other true kernel modules. The essence of this patch are changes to the main makefile to build new libsolaris.so and various ZFS-related parts of the kernel like pagecache, arc_shrinker and ZFS dev driver to make them call into libsolaris.so upon dynamically registering handful of callbacks. The new libsolaris.so is mainly composed of the solaris and zfs sets as defined in the makefile (and not part of the kernel anymore) plus bsd RPC code (xdr*), kobj and finally new fs/zfs/zfs_initialize.c which provides main INIT function - zfs_initialize(). The zfs_initialize() initializes various ZFS resources like threads and memory and registers various callback functions into the main kernel (see comments in zfs_initialize.c). Two important gotchas I have discovered are: 1) The libsolaris.so needs to build with BIND_NOW to make all symbols resolved eagerly to avoid page faults to resolve those symbols later if the ZFS code in libsolaris.so is called to resolve other faults. This would cause deadlocks. 2) The libsolaris.so needs the osv-mlock note so that dynamic linker would populate the mappings. This is similar to above to avoid page faults later that would lead to deadlocks. Please note the libsolaris.so is built with most symbols hidden and code garbage collection on to help minimize its size (804K) and expose minimum number of symbols (< 100) needed by libzfs.so. The latter also helps avoid possible symbol collision with other apps. We also make changes to loader.cc to dlopen("/libsolaris.so") before we mount ZFS filesystem (for that reason libsolaris.so needs to be part of the bootfs for ZFS images). Because ZFS is root filesystem, we cannot use the same approach we used for nfs which is also implemented as a shared library but loaded in pivot_rootfs() which happens much later. In theory we could build mixed disk with two partitions - 1st ROFS one with libsolaris.so on it and the 2nd ZFS one which would be mounted after we mount ROFS and load and initialize libsolaris.so from it. I have tested this patch by running unit tests (all pass) and also using tests/misc-zfs-io.cc as well as running stress test of MySQL on ZFS image. Fixes #1009 Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
I have just come across this ext4 library which we can use to add support to OSv - https://github.com/gkostka/lwext4. We should build it as a shared library and load and initialize it like ZFS. |
It will really helps if we added ext4 support to osv.I'm currently porting RVVM to osv, and boot it on a vps.Since the vps didn't support isos, it would be very difficult for me to convert the vps's default ext4 partition to zfs. Ext4 support will enable me to boot osv at normal time and boot to the archlinux when something wrong happened.(also, the image I wanted to use is just a bit large, which makes it impossible to install osv to another partition) |
Great, it always felt to me a lightweight OS would greatly benefit from a more lightweight fs than ZFS. |
Unfortunately, we need to wait until I find some time or somebody else volunteers to implement it. @fish4terrisa-MSDSM Last year I made some improvements to OSv ZFS implementation and tooling and these days there is a lot of flexibility in how to build and run ZFS including building ZFS disk image on host and mounting it - see https://github.com/cloudius-systems/osv/wiki/Filesystems#zfs and https://github.com/cloudius-systems/osv/wiki/Filesystems#creating-and-manipulating-zfs-disks-on-host. I hope it helps until we have extfs support. |
This commit adds an initial implementation of the ext4 filesystem driver based on the lwext4 project (https://github.com/gkostka/lwext4). It provides a light weight read-write alternative to ZFS filesystem. Please note this implementation is NOT thread-safe and will need to be enhanced to be so in future. However it is functional enough to support the test cases examined by modules/libext/test.sh. One can build the OSv like so: ./scripts/manifest_from_host.sh -w /usr/bin/find && ./scripts/build fs=rofs image=libext,native-example -j$(nproc) --append-manifest Then create an ext4 filesystem: mkdir -p ext_images dd if=/dev/zero of=ext_images/ext4 bs=1M count=128 sudo mkfs.ext4 ext_images/ext4 Add some files to it if needed: sudo losetup -o 0 -f --show ext_images/ext4 sudo mount /dev/loop0 ext_images/image .. update content sudo umount ext_images/image sudo losetup -d /dev/loop0 qemu-img convert -f raw -O qcow2 ext_images/ext4 ext_images/ext4.img And then run it: ./scripts/run.py --execute='--mount-fs=ext,/dev/vblk1,/data /hello' --second-disk-image ./ext_images/ext4.img or using the test.sh ./modules/libext/test.sh '/find /data/ -ls' Fixes cloudius-systems#1179 Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This commit adds an initial implementation of the ext4 filesystem driver based on the lwext4 project (https://github.com/gkostka/lwext4). It provides a light weight read-write alternative to ZFS filesystem. Please note this implementation is NOT thread-safe and will need to be enhanced to be so in future. However it is functional enough to support the test cases examined by modules/libext/test.sh. One can build the OSv like so: ./scripts/manifest_from_host.sh -w /usr/bin/find && ./scripts/build fs=rofs image=libext,native-example -j$(nproc) --append-manifest Then create an ext4 filesystem: mkdir -p ext_images dd if=/dev/zero of=ext_images/ext4 bs=1M count=128 sudo mkfs.ext4 ext_images/ext4 Add some files to it if needed: sudo losetup -o 0 -f --show ext_images/ext4 sudo mount /dev/loop0 ext_images/image .. update content sudo umount ext_images/image sudo losetup -d /dev/loop0 qemu-img convert -f raw -O qcow2 ext_images/ext4 ext_images/ext4.img And then run it: ./scripts/run.py --execute='--mount-fs=ext,/dev/vblk1,/data /hello' --second-disk-image ./ext_images/ext4.img or using the test.sh ./modules/libext/test.sh '/find /data/ -ls' Fixes cloudius-systems#1179 Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This commit adds an initial implementation of the ext4 filesystem driver based on the lwext4 project (https://github.com/gkostka/lwext4). It provides a light weight read-write alternative to ZFS filesystem. Please note this implementation is NOT thread-safe and will need to be enhanced to be so in future. However it is functional enough to support the test cases examined by modules/libext/test.sh. One can build the OSv like so: ./scripts/manifest_from_host.sh -w /usr/bin/find && ./scripts/build fs=rofs image=libext,native-example -j$(nproc) --append-manifest Then create an ext4 filesystem: mkdir -p ext_images dd if=/dev/zero of=ext_images/ext4 bs=1M count=128 sudo mkfs.ext4 ext_images/ext4 Add some files to it if needed: sudo losetup -o 0 -f --show ext_images/ext4 sudo mount /dev/loop0 ext_images/image .. update content sudo umount ext_images/image sudo losetup -d /dev/loop0 qemu-img convert -f raw -O qcow2 ext_images/ext4 ext_images/ext4.img And then run it: ./scripts/run.py --execute='--mount-fs=ext,/dev/vblk1,/data /hello' --second-disk-image ./ext_images/ext4.img or using the test.sh ./modules/libext/test.sh '/find /data/ -ls' Fixes cloudius-systems#1179 Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This commit adds an initial implementation of the ext4 filesystem driver based on the lwext4 project (https://github.com/gkostka/lwext4). It provides a light weight read-write alternative to ZFS filesystem. Please note this implementation is NOT thread-safe and will need to be enhanced to be so in future. However it is functional enough to support the test cases examined by modules/libext/test.sh. One can build the OSv like so: ./scripts/manifest_from_host.sh -w /usr/bin/find && ./scripts/build fs=rofs image=libext,native-example -j$(nproc) --append-manifest Then create an ext4 filesystem: mkdir -p ext_images dd if=/dev/zero of=ext_images/ext4 bs=1M count=128 sudo mkfs.ext4 ext_images/ext4 Add some files to it if needed: sudo losetup -o 0 -f --show ext_images/ext4 sudo mount /dev/loop0 ext_images/image .. update content sudo umount ext_images/image sudo losetup -d /dev/loop0 qemu-img convert -f raw -O qcow2 ext_images/ext4 ext_images/ext4.img And then run it: ./scripts/run.py --execute='--mount-fs=ext,/dev/vblk1,/data /hello' --second-disk-image ./ext_images/ext4.img or using the test.sh ./modules/libext/test.sh '/find /data/ -ls' Fixes #1179 Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This patch modifies this fork of lwext4 to make is safe to interact with by multiple threads running in OSv. The key assumption is, that OSv VFS layer provides necessary locking around all interactions with lwext4 to guard ext filesystem metadata (i-node table, directory entries, etc) modifications confined to specific vnode. Beyond that, we add necessary locking around 3 key common data structures: - i-node bitmaps in ext4_ialloc.c - data block bitmaps in ext4_balloc.c - metadata blockcache in ext4_bcache.c and related files More specifically following functions are protected with inode_alloc_lock()/unlock() to make sure no two files/directories get assigned same inode number: - ext4_ialloc_alloc_inode() - ext4_ialloc_free_inode() Next, following functions are protected with block_alloc_lock()/unlock() to make sure no two files/directories use same data block: - ext4_balloc_alloc_block() - ext4_balloc_free_block() - ext4_balloc_free_blocks() Finally, these functions in ext4_bcache.c and related source files are protected with bcache_lock()/unlock() to make sure the global metadata block cache access is synchronized: - ext4_bcache_invalidate_lba() in __ext4_balloc_free_block() and __ext4_balloc_free_blocks() - ext4_bcache_find_get(), ext4_block_flush_buf() and ext4_bcache_free() in ext4_block_flush_lba() - ext4_block_get_noread(), ext4_bcache_test_flag() ext4_bcache_free() in ext4_block_get() - ext4_bcache_free() in ext4_block_set() - ext4_block_get_noread() in ext4_trans_block_get_noread() Ref gkostka#83 Ref cloudius-systems/osv#1179 Signed-off-by: Waldemar Kozaczuk <jwkozaczuk@gmail.com>
This is a verbatim conversation from the mailing list:
On Wed, Oct 13, 2021 at 12:16 AM Gregory Burd gr...@burd.me wrote:
Hello OSv-ers,
I'm a huge fan of ZFS, it's an amazing bit of work and I'm thrilled it's a core component in OSv. That said, it's not a great choice in all cases, the overhead of ZFS can outweigh the benefits. I've heard many references to "adding another filesystem" into the mix in different contexts, most recently in the (amazing) talk given at p99conf by Waldek.
So, how about ext2 pulled straight from the BSD tree?
https://github.com/freebsd/freebsd-src/tree/main/sys/fs/ext2fs
Why ext2 and not ? Well, it's not my favorite filesystem either but it is popular and well known. It's easy for Linux users to get comfortable with and the tools are generally installed by default on most distros. I would imagine that the BSD code is fairly complete and supported and I believe it supports ext2, 3, and 4 (https://wiki.freebsd.org/Ext2fs).
anyone have thoughts?
anyone have suggestions on where to start?
-greg
The text was updated successfully, but these errors were encountered: