Prevent file descriptor leak in util.just_run #237

mariokostelac · 2022-06-28T15:42:45Z

While working with papermill I've found that repeated running of notebooks is eventually bumping into the limit of number of opened files.

Following he stacktrace led me to util.just_run function that creates asyncio loop in some cases, but doesn't close them, which leaves some eventpoll file descriptors opened forever.

This PR consists of two commits

The first commit creates a failing test, showing that the current implementation is leaving file descriptors opened after each run. On my machine (unix selectors), it leaves 3 FDs per just_run call. It also explicitly declares psutil as a dependency. It was already a transitive dependency, but I'm promoting to be explicit since I'm calling it directly.
The second commit fixes the issue by stopping and closing used loop, but just in cases where the method created the loop.

Happy to discuss the approach, missing context, and next steps to fix this bug.

mariokostelac · 2022-06-29T06:58:13Z

I see some type checks are failing. I'll probably need help to get around it. I've tried using MagicMock to swap loop object, but some code asserts that loop objects is AbstractEventLoop, which makes it fail.

davidbrochart

Thanks a lot @mariokostelac, that looks good.

nbclient/tests/test_util.py

davidbrochart · 2022-06-29T07:35:34Z

nbclient/tests/test_util.py

+    loop.stop = MagicMock(wraps=loop.stop)
+    loop.close = MagicMock(wraps=loop.close)


Let's just ignore type checks on these.

Suggested change

loop.stop = MagicMock(wraps=loop.stop)

loop.close = MagicMock(wraps=loop.close)

loop.stop = MagicMock(wraps=loop.stop) # type: ignore

loop.close = MagicMock(wraps=loop.close) # type: ignore

I think I might have a better solution for these, without ignoring type checks.

I can't run tests locally, though.

(venv) dev-envs-ad\mario@a-2zurb4h4bbny ➜ nbclient git:(mario/fix_fd_leak) ls binder CHANGELOG.md CONTRIBUTING.md docs LICENSE MANIFEST.in nbclient nbclient.egg-info pyproject.toml pytest.ini README.md RELEASING.md requirements-dev.txt requirements.txt setup.cfg setup.py venv (venv) dev-envs-ad\mario@a-2zurb4h4bbny ➜ nbclient git:(mario/fix_fd_leak) tox -e py38 ERROR: tox config file (either pyproject.toml, tox.ini, setup.cfg) not found

We're not using tox, you should just run pytest.

Ah I see, I was running pytest, but it was missing the type checking piece that failed. I think https://github.com/jupyter/nbclient/blob/main/CONTRIBUTING.md might need an update.

Re: type checking. Would it make sense to put it as pre-commit check?

Ah I see, I was running pytest, but it was missing the type checking piece that failed. I think https://github.com/jupyter/nbclient/blob/main/CONTRIBUTING.md might need an update.

Indeed!

Re: type checking. Would it make sense to put it as pre-commit check?

Definitely, if you want to open a PR for that, it would be great.

Co-authored-by: David Brochart <david.brochart@gmail.com>

mariokostelac · 2022-06-29T14:35:20Z

@davidbrochart I think this is in better shape now.

davidbrochart · 2022-06-29T14:56:12Z

Trigger CI.

mariokostelac · 2022-06-29T20:43:08Z

This is interesting. It seems that something in jupyter_client is now failing.
I've taken a look at the stacktrace, it might be that https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#14 or https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#18 is getting our closed loop, and trying to run on it.

It's also strange that the function I'm changing looks remarkably similar to the one that's in the stacktrace. Why is that? @davidbrochart I see that you've authored both of them (at least parts).

I'll do some debugging tomorrow or on Friday. If you have some context that might help, I'm all ears.

Also sorry for not running the whole testsuite and delaying this bug discovery till CI run.

mariokostelac · 2022-06-30T20:16:20Z

@davidbrochart there's some discussion in jupyter/jupyter_client#772 (comment).

mariokostelac · 2022-06-30T20:25:02Z

I've tried hiding our loop from jupyter_client entirely by running asyncio.set_event_loop(None) after https://github.com/jupyter/nbclient/pull/237/files#diff-ce3eb5ea15b9f32af7e1dc032980287831900dc5538d389fc9875d7133c52220R63.

Unfortunately,

nbclient/nbclient/tests/test_client.py

Line 488 in 33b6e5f

kc.shutdown()

(and test after) then fail because kc.shutdown eventually tries to send something to control channel, which expect some running loop.
I'm not 100% sure I can locate the bug. Is the bug in the test, running kc.shutdown outside of the loop?

mariokostelac · 2022-07-01T13:21:10Z

I've spent more time investigating this and I think this solution might be dangerous. It's changing the "current loop", and then stopping it. jupyter_client expects that the loop returned by get_event_loop is not closed, and there might be other libraries that expect the same. Shipping it could break client's processes in unexpected ways, and it might be very difficult for them to figure out why it happened.

There are several approaches we could take:

At the end of just_run, if there wasn't a running loop, set it to none with set_event_loop(None). Drawback of that approach is that, prior to just_run call, there might had been a non-running (but open!) loop attached to the running thread. In that case, just_run would create a new loop, overwrite the previous loop, and nothing would close the old loop. That'd leak FDs as well.
We could try harder to get some loop to run our coro, effectively do the same thing as https://github.com/jupyter/jupyter_client/blob/3697a0d5e528be4c61fdfaaec0923acbfe0c1ea8/jupyter_client/utils.py#L18. In cases of getting a non-running, non-closed loop, we'd reuse it (by running again). That solves the leaking problem, and doesn't leave the system in some possibly broken state (closed loop attached to the thread).
We could import jupyter_client.utils.run_sync method and use it directly, instead of reimplementing the same functionality.

Happy to hear your thoughts on these approaches, or new approaches entirely.

blink1073 · 2022-07-28T15:26:52Z

I like the idea of using jupyter_client.utils.run_sync here, and fixing it properly upstream.

mariokostelac added 2 commits June 28, 2022 17:43

Add a test ensuring that just_run doesn't leak fds

9a877a1

Make sure just_run doesn't leak file descriptors of new loops.

818f2a7

mariokostelac force-pushed the mario/fix_fd_leak branch from c87ce65 to 818f2a7 Compare June 28, 2022 15:43

davidbrochart requested changes Jun 29, 2022

View reviewed changes

mario-at-intercom and others added 3 commits June 29, 2022 10:32

Update nbclient/tests/test_util.py

47283dc

Co-authored-by: David Brochart <david.brochart@gmail.com>

Call sleep without helper function

fbf02e9

Rewrite test to avoid mypy issues

11fea9b

davidbrochart closed this Jun 29, 2022

davidbrochart reopened this Jun 29, 2022

mariokostelac mentioned this pull request Jun 30, 2022

Fix shutdown and cleanup behavior jupyter/jupyter_client#772

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent file descriptor leak in util.just_run #237

Prevent file descriptor leak in util.just_run #237

mariokostelac commented Jun 28, 2022 •

edited

Loading

mariokostelac commented Jun 29, 2022 •

edited

Loading

davidbrochart left a comment

davidbrochart Jun 29, 2022

mariokostelac Jun 29, 2022

davidbrochart Jun 29, 2022

mariokostelac Jun 29, 2022

davidbrochart Jun 29, 2022

mariokostelac commented Jun 29, 2022

davidbrochart commented Jun 29, 2022

mariokostelac commented Jun 29, 2022 •

edited

Loading

mariokostelac commented Jun 30, 2022

mariokostelac commented Jun 30, 2022

mariokostelac commented Jul 1, 2022

blink1073 commented Jul 28, 2022

		loop.stop = MagicMock(wraps=loop.stop)
		loop.close = MagicMock(wraps=loop.close)

Prevent file descriptor leak in util.just_run #237

Are you sure you want to change the base?

Prevent file descriptor leak in util.just_run #237

Conversation

mariokostelac commented Jun 28, 2022 • edited Loading

mariokostelac commented Jun 29, 2022 • edited Loading

davidbrochart left a comment

Choose a reason for hiding this comment

davidbrochart Jun 29, 2022

Choose a reason for hiding this comment

mariokostelac Jun 29, 2022

Choose a reason for hiding this comment

davidbrochart Jun 29, 2022

Choose a reason for hiding this comment

mariokostelac Jun 29, 2022

Choose a reason for hiding this comment

davidbrochart Jun 29, 2022

Choose a reason for hiding this comment

mariokostelac commented Jun 29, 2022

davidbrochart commented Jun 29, 2022

mariokostelac commented Jun 29, 2022 • edited Loading

mariokostelac commented Jun 30, 2022

mariokostelac commented Jun 30, 2022

mariokostelac commented Jul 1, 2022

blink1073 commented Jul 28, 2022

mariokostelac commented Jun 28, 2022 •

edited

Loading

mariokostelac commented Jun 29, 2022 •

edited

Loading

mariokostelac commented Jun 29, 2022 •

edited

Loading