-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hang in eio_linux tests #319
Comments
Sorry I had missed the original message, yes it hangs for me, rather fast:
with
|
I reduced the program that hangs as much as I could to this, I think my initial impression is correct, somehow the EOF is never seen by the reader. let test_copy () =
Eio_linux.run ~queue_depth:10 @@ fun _stdenv ->
Eio.Switch.run @@ fun sw ->
let from_pipe, to_pipe = Eio_linux.pipe sw in
let buffer = Cstruct.create 20 in
Eio.Flow.copy (Eio.Flow.string_source "a") to_pipe;
Eio.Flow.close to_pipe;
let () =
try
while true do
ignore (Eio.Flow.read from_pipe buffer)
done
with End_of_file -> ()
in
Eio.Flow.close from_pipe strace here: https://gist.github.com/haesbaert/437fd9e30e4568cc3f5ba95f0387d63a Writer is FD6, which is actually closed during the hang (by looking at /proc/foo), FD5 (reader) is still opened and we are blocked in The pattern I see is that if the My only theory of why you can't trigger the bug is because on your tests the writer/reader dance terminates always in the order where the close only happens after the reader is queued, the order really depends on which CQE comes back first on the Fiber.both() tests. This program above should always trigger the bad case, I can hang it in < 5 seconds. Tomorrow I wanna try to peek at the uring stats, like dropped requests and whatnot, I'll also write the equivalent in C and try to trigger. At this point, it smells like a kernel bug though. |
Turns out the kernel needs to handle blocking file descriptors differently, it needs to do thread-pooling and it's a different code path (also slower). It also seems that this is bugged in some cases. This makes sense as I believe most people don't test with blocking FDs. I believe we should make all our FDs non blocking as a precaution. There is a program listed in issue ocaml-multicore#319 that replicates this. I could hang it in < 5 seconds, and it's now running for 10minutes with the fix, so kaboom. TLDR: linux being linux.
I can confirm the bug with a simple C program https://gist.github.com/haesbaert/10d3e3bb5fa9171dfcf65e1f5b58e95c |
OK, that hangs for me after a while too! Using Linux 5.19.9. |
I've tested the kernel patch from axboe in axboe/liburing#665 (comment) and indeed it fixes the bug. |
Turns out the kernel needs to handle blocking file descriptors differently, it needs to do thread-pooling and it's a different code path (also slower). It also seems that this is bugged in some cases. This makes sense as I believe most people don't test with blocking FDs. I believe we should make all our FDs non blocking as a precaution. There is a program listed in issue ocaml-multicore#319 that replicates this. I could hang it in < 5 seconds, and it's now running for 10minutes with the fix, so kaboom. A fix has been committed to mainline Linux (should be in Linux 6.1): torvalds/linux@46a525e
This issue turned out to be a kernel bug which has been fixed in: torvalds/linux@46a525e Reported here: axboe/liburing#665 (comment) We can workaround it by making sure pipes are non-blocking, uring considers pipes unbounded work and relies on uring worker threads if they are blocking, the bug is only triggered if this is the case, so force them to be non-blocking. If we use splice, the splice call itself needs the worker threads and the bug surfaces again (I verified this with perf probes), so disable splice for now. Even with the fix, it's desirable to keep pipes as non-blocking to avoid thread pooling. The splice call can return EAGAIN in uring, this happens even with the kernel patched, so handle it for the future. We can tune this better by disabling splice only for the unpatched kernels.
Make sure we workaround uring pipe bug from ocaml-multicore#319 in Eio_unix.pipe. Zap Eio_linux.pipe since it's only used in tests.
Make sure we use pipes as nonblocking see ocaml-multicore#319 in Eio_unix.pipe. Zap Eio_linux.pipe since it's only used in tests.
@haesbaert wrote:
We should try to find out what's causing this. I wrote:
The text was updated successfully, but these errors were encountered: