eio_linux: drain ring on exit #469

talex5 · 2023-03-22T11:07:32Z

Normally, all operations should have finished by the time we exit because we don't exit until all fibers have finished, and a fiber can't finish while an operation is in progress. However, this is not the case for cancellation operations, which may still be active in rare cases. So drain any remaining CQEs at exit.

Other options here are to allow cancellation operations to block (which they used to do, but that caused other problems), or to have the main operation wait for its cancellation too (but that's tricky and affects the fast-path).

Fixes #467.

Normally, all operations should have finished by the time we exit because we don't exit until all fibers have finished, and a fiber can't finish while an operation is in progress. However, this is not the case for cancellation operations, which may still be active in rare cases. So drain any remaining CQEs at exit. Other options here are to allow cancellation operations to block (which they used to do, but that caused other problems), or to have the main operation wait for its cancellation too (but that's tricky and affects the fast-path). Fixes ocaml-multicore#467.

avsm · 2023-03-22T13:43:00Z

lib_eio_linux/sched.ml

+  let rec aux errors =
+    if Uring.active_ops uring = 0 then errors
+    else (
+      match Uring.wait ~timeout:1.0 uring with


I'm not so sure about a 1s timeout here. I could imagine an inflight operation taking longer than that (slow disk or net). Is there any use in having a timeout here at all? It might be better just to leave the process hanging if there's something weird going on with Uring requests not completing. It won't be deadlocked, since in theory the kernel can unwedge the process by pushing a completion event...

If it's working correctly then the only things that can be in the ring here are cancellation requests for operations that have already finished, which should therefore complete immediately.

If there's anything else, then we'd probably like to see the error rather than hanging (because that means there's a bug in Eio).

I'm just wary of magic timeout numbers (the 1s). Looping indefinitely with a log seems better. In a stress test, it's kind of hard to spot log entries with errors (although there would be a non-zero exit code here), but quite easy to spot hanging processes taking up all the room on the machine...

talex5 · 2023-03-23T10:03:59Z

Closing this in favour of #470. The main loop isn't supposed to finish until active_ops is zero, but it exits early if the root fiber raises an exception, which is wrong. Once that's fixed, this PR is unnecessary.

talex5 added the bug Something isn't working label Mar 22, 2023

avsm reviewed Mar 22, 2023

View reviewed changes

talex5 closed this Mar 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eio_linux: drain ring on exit #469

eio_linux: drain ring on exit #469

talex5 commented Mar 22, 2023

avsm Mar 22, 2023

talex5 Mar 22, 2023

avsm Mar 22, 2023

talex5 commented Mar 23, 2023

eio_linux: drain ring on exit #469

eio_linux: drain ring on exit #469

Conversation

talex5 commented Mar 22, 2023

avsm Mar 22, 2023

Choose a reason for hiding this comment

talex5 Mar 22, 2023

Choose a reason for hiding this comment

avsm Mar 22, 2023

Choose a reason for hiding this comment

talex5 commented Mar 23, 2023