-
Notifications
You must be signed in to change notification settings - Fork 203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
explicitly close runner loop of function process #4307
Conversation
Codecov Report
@@ Coverage Diff @@
## release/1.3.1 #4307 +/- ##
=================================================
+ Coverage 79.10% 79.12% +0.02%
=================================================
Files 467 467
Lines 34492 34498 +6
=================================================
+ Hits 27283 27292 +9
+ Misses 7209 7206 -3
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
Well, this could certainly be one of potential file leaks. However, it is impossible to say whether this was the only one. Could you or Daniele check out this branch (or apply the fix) and run the test again to see if now the file handles remain more or less constant? |
aiida/engine/processes/functions.py
Outdated
@@ -156,6 +156,7 @@ def kill_process(_num, _frame): | |||
if original_handler: | |||
signal.signal(signal.SIGINT, original_handler) | |||
runner.close() | |||
runner.loop.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking at the close
method of the Runner
, I see that it calls stop
:
aiida-core/aiida/engine/runners.py
Line 152 in f2d1e94
self.stop() |
Which in turn calls
stop
on its loop
:aiida-core/aiida/engine/runners.py
Line 142 in f2d1e94
self._loop.stop() |
However, apparently it does not call
close
on the loop
. Is it known that closing a loop is necessary to free all resources, including open file handles?
The question arises why Runner.close
should not also call loop.close()
? My guess is because the loop
could be shared since it can be passed into the constructor of the Runner
. Would it make sense though to have Runner.close()
also close the loop if and only if the loop was constructed by the runner itself? @muhrin what was the original design here? Would this difference in behavior I propose be more confusing in the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a chance to look through this now and here are my thoughts.
- Yes, it is expected that
close()
frees file descriptors, there's even a parameter to explicitly free all those created from the scope of the loop by client code (https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.close). - I agree that the
Runner
should take responsibility for cleaning it up if it created it. Personally, I would find that less confusing. Just drop a flag intoRunner.__init__
that keeps track.
One thing that will have to be born in mind is that this will have to change once @unkcpz 's work gets merged as there will only be one loop (used in a re-entrant manner). For this, I suggest we keep track of how many files are open before and after the run_get_node
execution so at least we know if it's leaking.
Sure, I can confirm that the basic file descriptor leak when running WorkChains is fixed (as @unkcpz mentioned). My question was more whether closing the loop is the right thing to do since I vaguely remembered some discussion on a related question on slack - if it is, then please go ahead with this change! As for releasing this, it seems to me this warrants a bug fix release. |
I understand that that was what you were asking, and I realize that @unkcpz checked that this fixes a file leak, but not specifically the one that you experienced. There could be more sources of leaks and without repeating the test that showed the initial leak, that is impossible to say if this PR fixes it.
I agree, and if we confirm that we see no more leaks in normal usage, I will make a patch release. |
aiida/engine/processes/functions.py
Outdated
@@ -156,6 +156,7 @@ def kill_process(_num, _frame): | |||
if original_handler: | |||
signal.signal(signal.SIGINT, original_handler) | |||
runner.close() | |||
runner.loop.close() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've had a chance to look through this now and here are my thoughts.
- Yes, it is expected that
close()
frees file descriptors, there's even a parameter to explicitly free all those created from the scope of the loop by client code (https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.close). - I agree that the
Runner
should take responsibility for cleaning it up if it created it. Personally, I would find that less confusing. Just drop a flag intoRunner.__init__
that keeps track.
One thing that will have to be born in mind is that this will have to change once @unkcpz 's work gets merged as there will only be one loop (used in a re-entrant manner). For this, I suggest we keep track of how many files are open before and after the run_get_node
execution so at least we know if it's leaking.
To summarize, all that remains to be done before we merge this: close the loop if and only if the loop was created by the runner itself in its constructor. If it was passed in as an argument, we do not claim ownership and will not close it. |
Sorry for didn't read Martin's reply carefully. I thought we were waiting for asyncio version to fix the issue. |
@unkcpz implementation looks good to me. The idea is to release this as a hotfix in |
If the loop is not closed, the file handles it managed during its lifetime might not be properly cleaned up, leading to file leaks. Therefore, if the loop is created by the `Runner` upon construction, it should also close it when the runner closes. It cannot be done for loops passed into the constructor, because they migth actually still be in use by other parts of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @unkcpz
A couple of files were opened and not closed Maybe this was partially related to aiidateam#4307 that closed as a "tentative" fix of aiidateam#2386, but maybe a few more open files were remaining? Also, I'm adding a return status to most of the verdi daemon commands, so tests will catch possible errors (they weren't being catched in my own setup). In addition, I'm also silencing some additional warnings that should be silenced (the main point of this PR). Note that while unmuting in general ResourceWarnings was good to spot some issues (see bug described above), there are a couple more that now issue it and are not obvious to fix (typically related to the daemon starting some process in the background - or being started itself - and not being closed before the test actually finished). I think this is an acceptable compromise - maybe we'll figure out how to selectively silence those, and keeping warnings visible will help us figure out possible leaks in the future.
Fixes #2386 and fixes #2895
This is the tentative fix of issue #2386. See the issue for more information.