Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eio_linux: drain ring on exit #469

Closed
wants to merge 1 commit into from

Conversation

talex5
Copy link
Collaborator

@talex5 talex5 commented Mar 22, 2023

Normally, all operations should have finished by the time we exit because we don't exit until all fibers have finished, and a fiber can't finish while an operation is in progress. However, this is not the case for cancellation operations, which may still be active in rare cases. So drain any remaining CQEs at exit.

Other options here are to allow cancellation operations to block (which they used to do, but that caused other problems), or to have the main operation wait for its cancellation too (but that's tricky and affects the fast-path).

Fixes #467.

Normally, all operations should have finished by the time we exit
because we don't exit until all fibers have finished, and a fiber can't
finish while an operation is in progress. However, this is not the case
for cancellation operations, which may still be active in rare cases.
So drain any remaining CQEs at exit.

Other options here are to allow cancellation operations to block (which
they used to do, but that caused other problems), or to have the main
operation wait for its cancellation too (but that's tricky and affects
the fast-path).

Fixes ocaml-multicore#467.
@talex5 talex5 added the bug Something isn't working label Mar 22, 2023
let rec aux errors =
if Uring.active_ops uring = 0 then errors
else (
match Uring.wait ~timeout:1.0 uring with
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not so sure about a 1s timeout here. I could imagine an inflight operation taking longer than that (slow disk or net). Is there any use in having a timeout here at all? It might be better just to leave the process hanging if there's something weird going on with Uring requests not completing. It won't be deadlocked, since in theory the kernel can unwedge the process by pushing a completion event...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's working correctly then the only things that can be in the ring here are cancellation requests for operations that have already finished, which should therefore complete immediately.

If there's anything else, then we'd probably like to see the error rather than hanging (because that means there's a bug in Eio).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm just wary of magic timeout numbers (the 1s). Looping indefinitely with a log seems better. In a stress test, it's kind of hard to spot log entries with errors (although there would be a non-zero exit code here), but quite easy to spot hanging processes taking up all the room on the machine...

@talex5
Copy link
Collaborator Author

talex5 commented Mar 23, 2023

Closing this in favour of #470. The main loop isn't supposed to finish until active_ops is zero, but it exits early if the root fiber raises an exception, which is wrong. Once that's fixed, this PR is unnecessary.

@talex5 talex5 closed this Mar 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

eio_linux: race cancelling jobs and exiting at the same time
2 participants