Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eio_linux: race cancelling jobs and exiting at the same time #467

Closed
talex5 opened this issue Mar 21, 2023 · 0 comments · Fixed by #470
Closed

eio_linux: race cancelling jobs and exiting at the same time #467

talex5 opened this issue Mar 21, 2023 · 0 comments · Fixed by #470
Labels
bug Something isn't working

Comments

@talex5
Copy link
Collaborator

talex5 commented Mar 21, 2023

When the main loop exits, it checks that no uring operations are still in progress. Normally this will be the case because we don't allow a fiber to resume until its operation is complete, and the event loop doesn't exit until all fibers have done so.

However, cancel operations don't stop and wait. So I think this can happen:

  1. Fiber A starts an operation.
  2. Fiber B asks to cancel it and finishes.
  3. Fiber A's operation completes by itself and fiber A finishes.
  4. We try to end the loop, but get an error because the cancel operation hasn't yet been processed.

I think this is the cause of #466 (comment).

We should probably just wait for cancel operations to complete the same way we do for other operations.

This issue was previously hidden because the exception was reported using Log.warn, which silently drops messages by default. #465 changed this to raise an exception instead.

@talex5 talex5 added the bug Something isn't working label Mar 21, 2023
talex5 added a commit to talex5/eio that referenced this issue Mar 22, 2023
Normally, all operations should have finished by the time we exit
because we don't exit until all fibers have finished, and a fiber can't
finish while an operation is in progress. However, this is not the case
for cancellation operations, which may still be active in rare cases.
So drain any remaining CQEs at exit.

Other options here are to allow cancellation operations to block (which
they used to do, but that caused other problems), or to have the main
operation wait for its cancellation too (but that's tricky and affects
the fast-path).

Fixes ocaml-multicore#467.
talex5 added a commit to talex5/eio that referenced this issue Mar 23, 2023
If a fiber raises an exception then the mainloop exits immediately.
This normally indicates a bug in Eio, since spawned fibers have
exception handlers that pass the error to the switch instead. However,
that wasn't the case for the root fiber.

If the root fiber raised an exception after submitting a cancel
operation then we would try to exit the ring too early, resulting in:

    Invalid_argument("exit: 1 request(s) still active!")

Fixes ocaml-multicore#467.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
1 participant