Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP]Migrate the fuser code to be fully async #112

Closed
wants to merge 7 commits into from

Conversation

ianoc
Copy link
Contributor

@ianoc ianoc commented Feb 9, 2021

Exploring migrating the API/code to be async friendly.

Due to the nature of it, without some heavy features usage i'm not sure one can not have the tokio dependency.

This is really an exploration/thing for thoughts and feedback. Making it all async here is a pretty large change alas, so might not make sense to merge.

needs more clean up regardless.

@ianoc ianoc changed the title Migrate the fuser code to be fully async [WIP]Migrate the fuser code to be fully async Feb 9, 2021
Cargo.toml Outdated Show resolved Hide resolved
@@ -425,7 +425,7 @@ impl SimpleFS {
}

impl Filesystem for SimpleFS {
fn init(&mut self, _req: &Request, _config: &mut KernelConfig) -> Result<(), c_int> {
fn init(&self, _req: &Request, _config: &mut KernelConfig) -> Result<(), c_int> {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for something to be async friendly requiring interior mutability in the file system seems to be really a requirement

@ianoc
Copy link
Contributor Author

ianoc commented Feb 9, 2021

Been using https://github.com/filebench/filebench to explore the performance/behavior of this,

Current config is something like:

set mode quit alldone
set $dir=/tmp/benchmark_fuse/
set $nfiles=400
set $meandirwidth=20
set $nthreads=30
set $size1=16384k

define fileset name=bigfileset, path=$dir, size=$size1, entries=$nfiles, dirwidth=$meandirwidth, prealloc=80

define process name=fileserver,instances=2
{
        thread name=fileserverthread, memsize=50m, instances=$nthreads
        {
                flowop createfile name=createfile1,filesetname=bigfileset,fd=1
                flowop writewholefile name=wrtfile1,srcfd=1,fd=1,iosize=1m,filesetname=bigfileset
                flowop closefile name=closefile1,fd=1
                flowop deletefile name=deletefile1,fd=1,filesetname=bigfileset

                flowop openfile name=openfile1,filesetname=bigfileset,fd=3
                flowop appendfilerand name=appendfilerand1,iosize=16k,fd=3
                flowop closefile name=closefile2,fd=3

                flowop readwholefile filesetname=bigfileset,name=readfile1,iosize=1m,fd=2,opennext

                flowop finishoncount name=finish, value=30000
        }
}

Benchmarks are comparing this personality across fuser with xmp migrated to fuser from zargony/fuse-rs#141

The async code in here

stackfs -- https://github.com/sbu-fsl/fuse-stackfs

And raw file system.

Comparison is done here via an SSD on my machine:




RUST


180.643: Run took 174 seconds...
180.646: Per-Operation Breakdown
finish               6673ops       38ops/s   0.0mb/s    0.000ms/op [0.000ms - 0.001ms]
readfile1            6733ops       39ops/s 616.3mb/s    4.185ms/op [0.952ms - 269.908ms]
closefile2           6673ops       38ops/s   0.0mb/s    1.415ms/op [0.001ms - 145.007ms]
appendfilerand1      6673ops       38ops/s   0.3mb/s    2.787ms/op [0.017ms - 248.646ms]
openfile1            6673ops       38ops/s   0.0mb/s    0.787ms/op [0.036ms - 55.331ms]
deletefile1          6605ops       38ops/s   0.0mb/s    3.150ms/op [1.641ms - 17.411ms]
closefile1           6673ops       38ops/s   0.0mb/s    0.013ms/op [0.002ms - 5.760ms]
wrtfile1             6673ops       38ops/s 613.6mb/s  455.436ms/op [52.915ms - 1726.729ms]
createfile1          6673ops       38ops/s   0.0mb/s    0.755ms/op [0.068ms - 44.040ms]
180.646: IO Summary: 53376 ops 306.740 ops/s 39/77 rd/wr 1230.1mb/s 58.576ms/op
180.646: Shutting down processes


RUST_ASYNC


64.775: Run took 46 seconds...
64.780: Per-Operation Breakdown
finish               4469ops       97ops/s   0.0mb/s    0.000ms/op [0.000ms - 0.001ms]
readfile1            4529ops       98ops/s 1558.8mb/s    7.112ms/op [0.057ms - 296.907ms]
closefile2           4469ops       97ops/s   0.0mb/s    0.018ms/op [0.005ms - 6.634ms]
appendfilerand1      4469ops       97ops/s   0.8mb/s    0.856ms/op [0.023ms - 34.194ms]
openfile1            4469ops       97ops/s   0.0mb/s    4.144ms/op [0.046ms - 120.010ms]
deletefile1          4395ops       96ops/s   0.0mb/s    4.001ms/op [0.027ms - 108.894ms]
closefile1           4469ops       97ops/s   0.0mb/s    0.098ms/op [0.007ms - 19.757ms]
wrtfile1             4469ops       97ops/s 1554.2mb/s   31.839ms/op [11.631ms - 197.638ms]
createfile1          4469ops       97ops/s   0.0mb/s    4.209ms/op [0.095ms - 141.379ms]
64.780: IO Summary: 35738 ops 776.812 ops/s 98/194 rd/wr 3113.8mb/s 6.541ms/op
64.780: Shutting down processes


STACK_FS_NO_SPLICE


62.421: Run took 44 seconds...
62.423: Per-Operation Breakdown
finish               4496ops      102ops/s   0.0mb/s    0.000ms/op [0.000ms - 0.001ms]
readfile1            4556ops      104ops/s 1640.5mb/s    7.767ms/op [0.048ms - 407.050ms]
closefile2           4496ops      102ops/s   0.0mb/s    0.663ms/op [0.010ms - 40.810ms]
appendfilerand1      4496ops      102ops/s   0.8mb/s    0.778ms/op [0.019ms - 38.119ms]
openfile1            4496ops      102ops/s   0.0mb/s    3.298ms/op [0.032ms - 149.472ms]
deletefile1          4422ops      100ops/s   0.0mb/s    6.810ms/op [0.306ms - 111.047ms]
closefile1           4496ops      102ops/s   0.0mb/s    0.805ms/op [0.012ms - 51.996ms]
wrtfile1             4496ops      102ops/s 1634.7mb/s   36.651ms/op [11.429ms - 249.419ms]
createfile1          4496ops      102ops/s   0.0mb/s    5.252ms/op [0.065ms - 148.200ms]
62.423: IO Summary: 35954 ops 817.040 ops/s 104/204 rd/wr 3276.0mb/s 7.755ms/op
62.423: Shutting down processes


STACK_FS_SPLICE


43.404: Run took 25 seconds...
43.408: Per-Operation Breakdown
finish               4278ops      171ops/s   0.0mb/s    0.000ms/op [0.000ms - 0.001ms]
readfile1            4338ops      174ops/s 2744.5mb/s    7.797ms/op [0.033ms - 341.664ms]
closefile2           4278ops      171ops/s   0.0mb/s    0.431ms/op [0.010ms - 38.044ms]
appendfilerand1      4278ops      171ops/s   1.3mb/s    0.565ms/op [0.019ms - 28.630ms]
openfile1            4278ops      171ops/s   0.0mb/s    2.515ms/op [0.038ms - 106.473ms]
deletefile1          4205ops      168ops/s   0.0mb/s    7.047ms/op [0.026ms - 113.189ms]
closefile1           4278ops      171ops/s   0.0mb/s    0.545ms/op [0.016ms - 30.744ms]
wrtfile1             4278ops      171ops/s 2737.7mb/s   29.499ms/op [9.590ms - 186.023ms]
createfile1          4278ops      171ops/s   0.0mb/s    3.711ms/op [0.070ms - 111.412ms]
43.408: IO Summary: 34211 ops 1368.319 ops/s 174/342 rd/wr 5483.5mb/s 6.515ms/op
43.408: Shutting down processes


RAW


16.356: Run took 13 seconds...
16.358: Per-Operation Breakdown
finish               6677ops      514ops/s   0.0mb/s    0.000ms/op [0.000ms - 0.014ms]
readfile1            6737ops      518ops/s 8244.0mb/s   13.104ms/op [0.004ms - 119.153ms]
closefile2           6677ops      514ops/s   0.0mb/s    0.010ms/op [0.001ms - 33.004ms]
appendfilerand1      6677ops      514ops/s   4.0mb/s    0.541ms/op [0.003ms - 62.801ms]
openfile1            6677ops      514ops/s   0.0mb/s    0.053ms/op [0.004ms - 58.258ms]
deletefile1          6616ops      509ops/s   0.0mb/s    7.698ms/op [1.436ms - 89.598ms]
closefile1           6677ops      514ops/s   0.0mb/s    0.013ms/op [0.001ms - 13.353ms]
wrtfile1             6677ops      514ops/s 8216.9mb/s   24.210ms/op [4.767ms - 171.393ms]
createfile1          6677ops      514ops/s   0.0mb/s    0.220ms/op [0.011ms - 39.077ms]
16.358: IO Summary: 53415 ops 4108.380 ops/s 518/1027 rd/wr 16464.9mb/s 5.737ms/op
16.358: Shutting down processes

@ianoc ianoc force-pushed the ianoc/async_fs branch 8 times, most recently from 721014e to df5f911 Compare February 9, 2021 22:28
@cberner
Copy link
Owner

cberner commented Feb 12, 2021

Wow, this looks really exciting, thanks! I've been wanting to add async for a while. I'll try to look it over this weekend and provide some feedback

@ianoc
Copy link
Contributor Author

ianoc commented Feb 12, 2021

Great, look forward to it thanks!

Playing around some more to try work on making the benchmarks more reproducible has been interesting:

but default features in the crate means very small bump in perf -- due to the low write size, i think its basically a huge number of serial tiny 4k writes just swamp everything, so little gains to be had. Enabling abi-7-31 (though i think just 7-10 would do looking at the code) yields :

(This is using cgroups to assign 2 logical cores to filebench and 3 logical cores to the client process be it rust or c code. hyper threading enabled/workstation so little fuzzyness on the exact cpu allocation/usage -- done to make numbers more stable.)

master: 74.340: IO Summary:  5028 ops 73.936 ops/s 15/15 rd/wr 3794.4mb/s 105.050ms/op
async: 53.401: IO Summary:  5023 ops 143.501 ops/s 29/29 rd/wr 7364.4mb/s 42.985ms/op
async second run for stability: 49.378: IO Summary:  5023 ops 162.021 ops/s 33/32 rd/wr 8314.9mb/s 46.103ms/op
stackfs: 53.201: IO Summary:  5028 ops 143.646 ops/s 29/29 rd/wr 7371.9mb/s 42.525ms/op
raw: 18.132: IO Summary:  5023 ops 334.839 ops/s 67/67 rd/wr 17183.9mb/s 19.829ms/op

the blocksize for the operation has a huge effect on the efficiency seemingly of the async one, i'm guessing this is just the serial latency through the kernel is the limiting factor all told, but wouldn't be 100% on that.

results_set_short.txt

Copy link
Owner

@cberner cberner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks about like I expected! I skimmed some of it, so may have missed something. I left a few questions and comments.

Also, there are a bunch of style check failures. Can you run the default make target? It should run all the checks for you.

Cargo.toml Outdated Show resolved Hide resolved
src/channel.rs Outdated Show resolved Hide resolved
src/request.rs Outdated Show resolved Hide resolved
src/lib.rs Show resolved Hide resolved
src/channel.rs Outdated Show resolved Hide resolved
src/session.rs Outdated Show resolved Hide resolved
src/session.rs Show resolved Hide resolved
src/request.rs Outdated Show resolved Hide resolved
@ianoc
Copy link
Contributor Author

ianoc commented Feb 15, 2021

I addressed all the comments here and did some more error handling/clean up some more comments and such. The failing test now is related to osx and mounting -- it seems setting O_NONBLOCK on the macfuse implementation of fuse is failing. Something that could be queried upstream, but otherwise i'm not sure of a simple work around.

Overall this is kinda a large/breaking delta and would require tokio in the deps for this. An alternate thought here is that we/I could split this crate into 3 of them:
-> Core/ABI specs (fuser-core)
-> Existing release/master API (fuser)
-> Async/tokio (fuser-tokio)

If we were to add in future an async async-std type implementation or other async based ones we could name them fuser-* and then have an fuser-async shared crate of code. (Alternative here is to have fuser-async and use feature flags on it for the impl?).

Thoughts?

@cberner
Copy link
Owner

cberner commented Feb 16, 2021

Cool! I'll take another look, and also try building fleetfs against this branch. That's the filesystem I use this crate for the most.

it seems setting O_NONBLOCK on the macfuse implementation of fuse is failing

Ah ya, I've definitely noticed that macfuse has different behavior in some cases. I would just special case it.

Can you take a look at the Github Actions test? It looks like they didn't run, and that's where most of the test coverage is. It includes things like the xfstests suite.

I'd like to keep everything in this crate. I know some other projects split features into different crates, but I don't really like that approach. Feature flags sound good, and will make this easy to merge & release, without immediately breaking backwards compatibility.

I haven't been following async-std vs tokio. Is async-std expected to become stable at some point?

@ianoc
Copy link
Contributor Author

ianoc commented Feb 16, 2021

Yeah I’m afraid i edited/added another test into the github actions which seems to have disabled them in this repo. (I wanted to get the Mac ones running in github actions since the appveyor thing takes for ever). But I’ll revert it here, and try add the github actions to a separate PR so more of the coverage is done there/Better latency where possible.

@ianoc
Copy link
Contributor Author

ianoc commented Feb 16, 2021

For the Mac one i think since no O_NONBLOCK I’ll have to spawn the operations into the blocking thread pool for the reads. Will see about setting the flags up to control that behavior and get it working. It’ll be less performant by a little i imagine, but its all a bit of black magic how the Mac stuff works that i can work out, its all closed source in the main impl

@cberner
Copy link
Owner

cberner commented Feb 16, 2021

Ya, it's too bad macfuse switched to closed source :( I wrote a pure Rust mount implementation for Linux, and FreeBSD should be doable too, but without seeing the source for macfuse I didn't want to reverse engineer it

@cberner
Copy link
Owner

cberner commented Feb 16, 2021

Switching my other project, fleetfs, was pretty easy. Was almost entirely mechanical changes like the SimpleFS example. However, there seems to be a memory leak. I'm not sure if it's in your async changes, or something I missed when switching fleetfs over to the async code. I'll look into it, if the Github CI tests on this PR pass

@ianoc ianoc force-pushed the ianoc/async_fs branch 2 times, most recently from ab7e4cb to 1cde8d6 Compare February 16, 2021 19:27
@ianoc
Copy link
Contributor Author

ianoc commented Feb 19, 2021

Ok -- so it does all pass now for osx/linux.

-> I refactored some of the handling out into a sub mod. Probably should be a feature flag to pick either blocking or async handling there. But both work happily

-> I replaced any timeouts with using oneshot channels so tear down happens faster -- the xfstests break if the shutdown takes more than some relatively low number of MS

-> Even after I did that, some of the tests combined with other tests seem to still flake, i changed the xfstests for now to have a 1 second sleep after issuing the umount which has fixed them all. Let me know if this seems like something we can't have and i can try see if there is more time we can shave off. (Or possibly its a case of we fire a umount in the channel drop and it drops the new mount if thats possible).

@cberner
Copy link
Owner

cberner commented Feb 20, 2021

Cool! I'll take a look this weekend

Copy link
Owner

@cberner cberner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is looking good! I left a few comments.

Do you know why the xftests fail without the sleep? I do think that needs to be fixed. The xfstest test suite is pretty much the gold standard for verifying that a filesystem works correctly, so I'd guess it's uncovering a real race or some other issue

src/channel.rs Outdated
Ok(res)
}

fn internal_new_on_err_cleanup(
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function doesn't seem to do anything? It always returns Ok

fuse_session,
) {
Ok(r) => Ok(r),
Err(err) => {
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, I don't think this code path can trigger

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After more iterations, i'm not sure the inner function here is adding much value at this point -- but it can fail via

Channel::create_sub_channels(&fd, worker_channel_count)?

The design/approach of the 2 layers was to just allow the inner function to sprinkle ? around everywhere and then the outer function would clean up. Now, though only one spot is fallible, so i'll collapse these and handle directly failure to create the sub channels.

#[derive(Debug, Clone)]
pub struct SubChannel {
fd: FileDescriptorRawHandle,
shared: bool, // If its a shared file handle when asked to close, noop.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems error prone. Is there a way to make FileDescriptorRawHandle close on drop, and then pass around a reference counted file handle?

#[derive(Debug, Clone)]
pub struct SubChannel {
fd: Arc<AsyncFd<FileDescriptorRawHandle>>,
shared: bool, // If its a shared file handle when asked to close, noop.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment about relying on Drop instead

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sg, i think Arc should be free to use/access outside of drop/clone, and not involve any indirection, so either using a single Arc wrapping each value or shared Arc's should achieve the same purpose and be much simpler. Will update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at this more, there is a little bit of awkwardness that we want to be able to call close -- but we won't know everything has dropped. (Once FD closes, so close the others to make it all shutdown). I've refactored it now i think to be more clear, use Arc + Drop, but also for the inner FD we do have a close method that synchronizes using an atomic boolean to ensure we only call it once.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW as part of my work removing as much unsafe as possible I've been replacing the raw fd handing (and calls to libc::write, libc::close, etc.) with a normal std::fs::File. That way we have confidence that things are only closed once and there are no data races because such things are impossible without unsafe.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a safe/guaranteed operation for fuse? I had wondered when looking at this, but I couldn't find any documentation on either side saying the atomic operation's that fuse uses in the kernel would be safe/respected going via the file API. This code relies on using iovec writes which are guaranteed atomic in the fuse kernel handling. Will the std::fs::File API make the same guaranteed?(Wondering if stuff winds up getting chunked, written in smaller/multiple writes it could get interleaved or not as one fuse operation?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless I’m mistaken, I think your taking some of these guarantee’s more strongly than they are intended. — write_vectored doesn’t guarantee that there is one underlying write call — indeed it has a default fall back to call write on the first non-empty buffer (which would be an invalid write in our use case here). There is a nightly API to query if the write would be vectored. This would corrupt the fuse operations afaik.

(One attempt doesn’t necessarily mean one system call no? It means if there’s a failure or partial write at the OS level it won’t attempt to call write again on that and instead would return Err or Ok(num bytes) processed. Since we explicitly require the system call parity i don’t think these are quite the same strength of a contract. )

I guess for these internal ones where the fuse kernel is abusing the file system API, I’m pretty slow to move to a higher level API not designed for this. The current implementations for Unix look right to me (though we pass the FD around more and explicitly call close to terminate operations which isn’t supported in the std::fs::File api), but I believe they could remain within their contract and break fuse in subtle ways(iovec write being an example).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we pass the FD around more and explicitly call close to terminate operations which isn’t supported in the std::fs::File api

This was one of my motivations for moving to the File API. I was working through the fuse_sys code and found a few places where it looked like we could close() twice or perform operations after the close. By using the File API and removing the unsafe blocks we'd be forced to get this right.

The exception is when linking against libfuse3 where the FD is by rights owned by the libfuse session, but my plan there would be to do File::into_raw_fd before destroying the session.

where the fuse kernel is abusing the file system API

Fuse isn't that special in this regard. The concerns about atomic writes would apply to named pipes just the same (PIPE_BUF)

I believe they could remain within their contract and break fuse in subtle ways (iovec write being an example).

Rust being a systems language I think this is safe to assume that a single call to write_vectored results in a single call to writev. The Rust maintainers take backwards compatibility very seriously, Changing this behaviour would break users that reasonably rely on it, as such would be a compatibility break.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

write_vectored calling writev is explicitly not in the file API though as an example. The current implementation on Unix does this. But the API does not require/and by default doesn’t do this.

(there’s also the requirement that you have a mutable handle on the File in order to read/write from it in the std::fs::File api that I’m aware which wouldn’t mix well with the current API where its presumed that users can use the Reply passed in from another thread — unless we wrap it in mutex locks?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway I’m relatively new to this library so I’d defer to @cberner as the maintainer, this is a low level library, so I’d be inclined to say it should aim to use the bare metal API if its at all ambiguous. Limiting the access surface of the unsafe/using Arc’s and drop behaviors should all have i believe the same effect as a reduced surface File API.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even as the maintainer, I'm only medium familiar with this part of the code, since I took it over from the original author :)

I'd like to have as little unsafe code as possible. For the specific case of write_vectored though, I think we should stick with writev until there's a way to check that write_vectored is atomic. Debugging problems from a non-atomic write_vectored implementation would be frustrating

join_handles.push(tokio::spawn(async move {
let r =
Session::spawn_worker_loop(active_session, ch, receiver, filesystem, idx).await;
// once any worker finishes/exits, then then the entire session shout be shut down.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some typos in this comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworded

@@ -0,0 +1,1012 @@
//! Analogue of fusexmp
//!
//! See also a more high-level example: https://github.com/wfraser/fuse-mt/tree/master/example
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong link?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually cribbed this right from the fuse-rs pr https://github.com/zargony/fuse-rs/pull/141/files, but its quite possibly out of date/incorrect, i'lll take a look tomorrow

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The link is intentional i think since this seems to be a port of the code in https://github.com/wfraser/fuse-mt/blob/master/example/src/passthrough.rs which is aimed at the libfuse C bindings. (I think the reference here is fusexmp is low level/C, the rust one is higher level?). Open to new wording or changing anything here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm actually in the process of changing fuse-mt over to fuser now fyi, heh

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wfraser cool! btw, what do you think of making the fuser API async?

@wmanley
Copy link
Contributor

wmanley commented Feb 21, 2021

I'm surprised that using async helped XMP performance. I'd expect async not to help much as the underlying FS APIs are all synchronous, so typically async FS apis are just punting operations to a threadpool. This just imposes more copying and synchronisation overhead. Is the "RUST" (e.g. sync) example you benchmarked above using multiple threads, or just a single thread?

@cberner
Copy link
Owner

cberner commented Feb 23, 2021

I think its because umount/the cycle of mount/unmount doesn't wait for the rust process to terminate, it just starts the next one. So the old one is still shutting down when the next one starts. I'm not really sure if that behavior is part of the test/accidental side product or important.

If you mean that the old process would race and unmount the new process' mountpoint, I think that case should be handled already. libfuse has logic to check that the fd is still valid before unmounting, and I implemented the same logic in fuse_unmount_pure

@ianoc
Copy link
Contributor Author

ianoc commented Feb 23, 2021

@cberner Thats pretty much exactly what i had been telling myself was causing this. Let me go take another look at it again

@cberner
Copy link
Owner

cberner commented Feb 23, 2021

Cool, will be interested in what you find! It's certainly possible the logic to guard against that race is broken

@ianoc
Copy link
Contributor Author

ianoc commented Feb 25, 2021

I tracked this down to a few poorly interacting things:

  1. The tests run with libfuse2, which doesn't have a race condition check. So autounmount being enabled (As it is in the sample) can cause in the env of the tests a bad unmount. It was added in libfuse3
  2. There is some checks in the nonlibfuse paths you've added, but the fuser unmount on error/drop code itself will also trigger this same bug by default. Its quite similar to the libfuse behavior in that regard. We could probably leave the underlying system handle unmounting (i.e. autounmount?) on exit, or we could continue to do the unmount but likely perform the tests there too.
  3. The move to the O_NONBLOCK code causes these issues to get highlighted, it seems the unmount kernel interactions seem a bit special here in that they invalidate the file handle which doesn't cause polling to get woken up. Right now (since we had some hanging issues from this i had fixed before), this code will ensure we call the read on the FD once per second, and then if the fuse has gone away we'll exit.

(3) combined with (1) or (2), means that the process exits 1 second after, and will unmount whatever new fuse session is in place, which is what is breaking the tests.

@ianoc
Copy link
Contributor Author

ianoc commented Feb 26, 2021

A simpler example of I believe hitting the same race condition is:

ianoc@b70310a

it is pretty hard to make a repro with the master code since it has the blocking reads/and it shuts down so quickly. Though if the user code does anything after replying in some actions it can cause it.

(If you block before/arbitrarily then unmount will complain that there's open activity on the file system)

@cberner
Copy link
Owner

cberner commented Feb 26, 2021

Ah cool, nice find!

use std::time::Duration;
use tokio::time::timeout;
loop {
if let Ok(guard_result) = timeout(Duration::from_millis(1000), self.fd.readable()).await
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of a 1sec timeout, is it possible to wait on both the read and the terminated future that's passed into Channel::receive? I think that would remove any unmount delay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the terminated future is only satisfied by one of these inner reads firing and triggering the exit cascade. -- we actually do wait on it outside of here, the outer select! should get satisfied based on terminated if/when it happens.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, maybe I misunderstood your previous comment, "(3) combined with (1) or (2), means that the process exits 1 second after". Is that 1sec delay caused by this timeout, or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I probably didn’t explain this super well from an IO/FD perspective if we go through the last request done:

  1. User/someone calls umount on mount point
  2. Poll on socket readiness fires
  3. recv on socket gets a Statfs request(for some ABI’s the kernel always issues this after an unmount it seems)
  4. readiness will return true on next try
  5. recv Returns E_WOULDBLOCK
  6. readiness blocks
  7. ..some delay in kernel...
  8. FD handles become invalid
  9. Timeout fires in code
  10. Call recv again — returns INVALID FD i think is the error message

The issue is around step 8 we would expect readiness Call to return and wake things up, but this doesn’t seem to happen with fuse, since the FD isn’t closed/marked as readable by the kernel, its just invalid somehow. The 1 second timeout/timeout in general is an arbitrary workaround for this issue. (Maybe there is some lower level thing if we used epoll directly that does fire? )

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, and where in that sequence do we receive the Operation::Destroy? Since the code does select! on the terminated signal, it seems like that would happen before the timeout fires and avoid the 1sec delay

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might have changed in the mean time, but libfuse/libfuse@b9ea32f with the original changelog entry would explain never seeing it:

	* Add a synchronous DESTROY message to kernel interface.  This is
	invoked from umount, when the final instance of the filesystem is
	released.  It is only sent for filesystems mounted with the
	'blkdev' option for security reasons.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, well that's good to know! Ok, I probably need to read through some of this again with that in mind.

So, if I understand correctly then, previously we only used libc::read which returns immediately when the FD becomes invalid. However, now we call readiness and that blocks for an unknown (and long) delay after the FD is invalidated, which is why we need a timeout now. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, thats it exactly. It seems that the kernel component of fuse must just kill off blocked read operations. But whatever way it works(or possibly interacts with tokio's asyncfd stack/mio in rust), poll calls don't get woken up the same way. If the file handle got closed then the readiness would wake properly. But something 'special' seemingly about fuse happens where where it becomes invalid. so it doesn't wake.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, ya it now makes sense why this races and causes more failures than before :/ Perhaps we should make the timeout configurable, and have a single worker loop that uses a short (~1ms) timeout, and the rest of them use the 1sec timeout

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah i had been avoiding too low a timeout, but 1ms for just one FD seems reasonable enough to me. I've added it so the root/original FD will now have a 1ms timeout. And re-enabled the mount tests. It is still a race thats possible in the code but it should be much less likely?

(Also lots of churn since i rebased on the master changes around the API separation)

}

impl FileDescriptorRawHandle {
pub fn new(fd: RawFd) -> Self {
Copy link
Contributor

@wmanley wmanley Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this function should be unsafe, and fd should private too, otherwise it can be used from safe code to close or read from any arbitrary FD.

}

/// Receives data up to the capacity of the given buffer (can block).
fn blocking_receive(fd: &FileDescriptorRawHandle, buffer: &mut [u8]) -> io::Result<Option<usize>> {
Copy link
Contributor

@wmanley wmanley Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should also be unsafe, fd is not guaranteed to point at an valid owned fd.

pub fn close(&self) {
let already_closed = self
.is_closed
.swap(true, std::sync::atomic::Ordering::SeqCst);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The atomic here guarantees that a file is only closed once, but there are no guarantees that read or write aren't called on the file after it has been closed. This is unsafe as FDs are reused.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsafe in rust's definition isn't generally taken to include this type of logic behavior anymore/generally afaik.

nomicon: https://doc.rust-lang.org/nomicon/what-unsafe-does.html

&& similar types in say tokio are not marked as unsafe, e.g. https://docs.rs/tokio/1.2.0/tokio/io/unix/struct.AsyncFd.html
(the new here based on the logic above should be unsafe(though it was discussed in their api, and they decided not to), since there's no guarantee the FD exists, isn't duplicated or anything else).

this stuff we should limit how much it can go wrong(try make it as hard/impossible as we can) and put the right privacy controls in place, but from what i can work out in rust these days the unsafe keyword is more kept for memory safety.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if you have access to a more concrete guide i'd love to see it, does seem to be lots of confusion in this space in general that i've seen on different crates.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar types in say tokio are not marked as unsafe, e.g. https://docs.rs/tokio/1.2.0/tokio/io/unix/struct.AsyncFd.html

Note: AsyncFd<T> is not the same. In that case of AsyncFd<File> The AsyncFd owns the File and the file owns the underlying FD and is responsible closing it. In the case of AsyncFd<RawFd> the AsyncFd owns the RawFd, but the RawFd is really only a (non-owning) reference to the underlying fd, and is not responsible for closing it. Indeed RawFd does nothing on drop, it's just a typedef for c_int.

This is the reason that FromRawFd::from_raw_fd is unsafe:

This function is also unsafe as the primitives currently returned have the contract that they are the sole owner of the file descriptor they are wrapping. Usage of this function could accidentally allow violating this contract which can cause memory unsafety in code that relies on it being true.

AsyncFd<RawFd> will call poll on the underlying FD, but this should be safe as poll shouldn't have side-effects. Calling read or write or close do have side-effects, so are unsafe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsafe in rust's definition isn't generally taken to include this type of logic behavior anymore/generally afaik. ...these days the unsafe keyword is more kept for memory safety.

To be clear: this isn't unsafe because it itself violates memory safety, but because it could cause memory unsafety in other code that does expect to be the sole owner of a file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does reading different data than expected by an interleaving seek count as memory in safety? That’s the main sort of memory issue I could think of. But do you have something else in mind ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Memory mapping a file via the fd? Just curious here as to how these interact.

Copy link
Contributor Author

@ianoc ianoc Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://users.rust-lang.org/t/why-is-fromrawfd-unsafe/39670/5 Seems to suggest this is an historical artifact from an older definition of unsafe ? Or I’m not really following still.

Specifically when it comes to this code I was planning to look at more code level things when the overall api/behaviors seemed sane — I think we probably want to migrate some of the sub channel setup code from channel into the per-type files. And make it all private with the goal the exposure should be Safer api that doesn’t expose the raw fd in construction or anywhere else.

@ianoc
Copy link
Contributor Author

ianoc commented Mar 5, 2021

in the vein of how we might want to merge this if we didn't want to break everyone/or allow other async apis

multi-crate: https://github.com/ianoc/fuser/tree/ianoc/multi-crate
feature-flags: https://github.com/ianoc/fuser/tree/ianoc/tokio_flagged

attempts better encapsulation of different bits of the stack for the aync bits so we can add more/flag more in, with a much more reduced set of deps to be 'always on'. thoughts @cberner ?

@wmanley
Copy link
Contributor

wmanley commented Mar 5, 2021

The way I'd imagine this working in the future is separate crates. The low level ones would be:

  • fuse-proto: Sans-io protocol implementation using enums providing type-safe serialisation/deserialisation. Contains what is currently in src/ll/. Could also contain some helpers that help managing objects like FileHandles, INodes, etc.
  • fuse-device: Mounting/unmounting, opening the /dev/fuse device and interacting with it including read/write and ioctls. This is what is under src/channel/. It might make some sense to add async/tokio specific code here too behind a feature flag - like having Device implement AsyncRead and AsyncWrite for example in addition to Read and Write.

Then on top of that there would be higher level crates implemented that tie the two together. This would be the equivalent of the current Session. The job is to read from the fuse device, deerialise using fuse-proto and dispatch to user code. This might look like the current sync trait callback API - but I think it would be better to expose the enum API from fuse-proto. With an async implementation we could do some really cool things like using a multi-threaded dispatcher and automatically implementing cancellation based on FUSE_INTERRUPT.

FWIW I think the enum based API could work really well for async. Rust is limited by the lack of async in traits - but an enum based API wouldn't be bound in the same way.

@cberner
Copy link
Owner

cberner commented Mar 6, 2021

I'd like to go with feature flags, and not a multi-crate. I'm open to hearing arguments for why multi-crate would be better though.

My reasoning is that publishing separate crates is only useful if they have separate public APIs which users will build on top of. And, I don't really want to support a fuse-proto API which provides super low level deserialization APIs. Right now only the constants in fuse_api are public.

Another reason is trust, which I realize is subjective, but I dislike the approach of having many crates named "project-*" because it makes it difficult to know which are official crates of the project. For example, I know "tokio" is an official crate, and I know "tokio-util" is too (but only because I've looked it up). However, there are a bunch of other crates, like "tokio-file", and I don't know if they're official or not. By having a single crate with feature flags, it's clear to users that everything is from the official project.

@wmanley
Copy link
Contributor

wmanley commented Mar 9, 2021

I'd like to go with feature flags, and not a multi-crate

You've convinced me.

@ianoc: I like how your async_api is a module preserving backwards compatiblity. I'd quite like to put the existing sync API in a module too, but that's for another PR.

@ianoc
Copy link
Contributor Author

ianoc commented Mar 10, 2021

Superceeded by doing separate smaller PRs and flagging

@ianoc ianoc closed this Mar 10, 2021
@cberner cberner mentioned this pull request May 8, 2021
6 tasks
@PlamenHristov
Copy link

@ianoc , @cberner has there been any progress on that one that I've missed?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants