Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Be more robust against file descriptors/handles disappearing under our feet #1272

Closed
njsmith opened this issue Oct 27, 2019 · 2 comments · Fixed by #1285
Closed

Be more robust against file descriptors/handles disappearing under our feet #1272

njsmith opened this issue Oct 27, 2019 · 2 comments · Fixed by #1285

Comments

@njsmith
Copy link
Member

njsmith commented Oct 27, 2019

In this discussion we discovered that the epoll and Windows backends can get into quite a mess if a file descriptor/socket handle gets closed under our feet without telling us – deadlocks, loop crashes, etc. Technically this isn't our problem, because the user should have called notify_closing. But the experience would be better if we can provide some better error messages and survive, instead of locking.

Here's an example of one of the nasty cases:

a, b = stdlib_socket.socketpair()
with a, b, a.dup() as a2:
    a.setblocking(False)
    b.setblocking(False)
    fill_socket(a)
    async with trio.open_nursery() as nursery:
        nursery.start_soon(trio.hazmat.wait_readable, a)
        nursery.start_soon(trio.hazmat.wait_writable, a)
        await wait_all_tasks_blocked()
        a.close()
        nursery.cancel_scope.cancel()

Since we dup'ed a, the underlying kernel object survives the call to a.close(). So the wait_readable/wait_writable tasks keep going, and if we manipulated b we could even make them wake up as normal. But instead we cancel. Whichever task is cancelled first attempts to modify the underlying primitive to only wait for the other task, but the fd it uses to do this has disappeared under its feet, and things go bad quickly.

This case is particularly nasty on the epoll backend, because there it's actually impossible to deregister our interest in a: we're going to keep getting events for a whether we want them or not. The only way to survive is to close our epoll descriptor and rebuild it. This is possible, but I'm not sure whether it's actually better than crashing... The thing is that we can't really detect these situations reliably. It's possible in this particular test case, but in a more complex program the file descriptor might get reassigned between the call to close and the call to cancel, in which case the user will silently start notifications sent to the wrong task. So even if we can do some kind of best-effort recovery, we still need to somehow tell users to fix their code, not just pretend that everything is fine.

@njsmith
Copy link
Member Author

njsmith commented Oct 28, 2019

For the epoll issue, another possibility to consider would be to use EPOLLONESHOT all the time. That naturally limits the damage caused by a spurious file descriptor being left in the interest set, because it can cause at most 1 spurious wakeup. It would also let us skip explicitly unregistering fd's after each wait, which in theory should give a small speedup.

An annoying thing about EPOLLONESHOT is that you have to know whether you've previously used the fd with this epoll handle – the first time you have to "add" the fd with EPOLL_CTL_ADD, and then on subsequent times you have to "re-arm" it with EPOLL_CTL_MOD. This is the main reason I gave up on it before, because tracking this state seemed like it would add more complications and edge cases that we were fixing.

However, on further thought, I think it would probably be OK to use an "optimistic" strategy, where we first try EPOLL_CTL_MOD, and then if that fails we fall back on EPOLL_CTL_ADD. In the worst case we're still doing two syscalls, so that's the same as we are now in every case, and the vast majority of the time people pass in the same fd over and over.

Importantly, I think it's actually safe to leave stale fd's hanging around inside the epoll handle, because epoll only weak-references the file descriptions you pass in, so when a descriptor is closed then it automatically disappears from the epoll set. I guess notify_closing might want to explicitly unregister it anyway just to be tidy? I can't think of any reason this would matter either way.

I do want to double-check what happens if you have the sequence:

  1. Fd A, referring to socket S1 gets registered with epoll.
  2. Fd A gets dup'ed, to create Fd B.
  3. Fd A gets closed. (But Fd B keeps socket S1 alive, so S1 remains in the epoll interest set.)
  4. Fd A gets reassigned to refer to socket S2.
  5. Fd A gets registered with epoll a second time, with the new socket.

@njsmith
Copy link
Member Author

njsmith commented Oct 28, 2019

OK yeah so I tried it out, and it looks like if you register the same file descriptor number twice, referring to two sockets, then you end up with two totally independent registrations, and epoll reports events for both. You can even unregister the live fd, and that works just fine. Of course when you get events you can't tell which socket they refer to, but I think this means the ONESHOT approach would work fine? Even if a closed fd hangs around in the "disabled" part of the interest set, that shouldn't interfere with new fd's with the same number being added.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant