Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

explicitly close runner loop of function process #4307

Merged
merged 1 commit into from
Aug 27, 2020

Conversation

unkcpz
Copy link
Member

@unkcpz unkcpz commented Aug 15, 2020

Fixes #2386 and fixes #2895

This is the tentative fix of issue #2386. See the issue for more information.

@codecov
Copy link

codecov bot commented Aug 15, 2020

Codecov Report

Merging #4307 into release/1.3.1 will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@                Coverage Diff                @@
##           release/1.3.1    #4307      +/-   ##
=================================================
+ Coverage          79.10%   79.12%   +0.02%     
=================================================
  Files                467      467              
  Lines              34492    34498       +6     
=================================================
+ Hits               27283    27292       +9     
+ Misses              7209     7206       -3     
Flag Coverage Δ
#django 71.04% <100.00%> (+0.02%) ⬆️
#sqlalchemy 71.91% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
aiida/engine/runners.py 93.38% <100.00%> (+0.28%) ⬆️
aiida/transports/plugins/local.py 80.47% <0.00%> (+0.26%) ⬆️
aiida/engine/daemon/client.py 73.72% <0.00%> (+1.15%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 856fc06...75ad468. Read the comment docs.

@ltalirz ltalirz requested review from sphuber and muhrin August 15, 2020 12:13
@ltalirz
Copy link
Member

ltalirz commented Aug 15, 2020

Thanks @unkcpz !
Can the others confirm that this is the proper way to address #2386 ?

@sphuber
Copy link
Contributor

sphuber commented Aug 15, 2020

Can the others confirm that this is the proper way to address #2386 ?

Well, this could certainly be one of potential file leaks. However, it is impossible to say whether this was the only one. Could you or Daniele check out this branch (or apply the fix) and run the test again to see if now the file handles remain more or less constant?

@@ -156,6 +156,7 @@ def kill_process(_num, _frame):
if original_handler:
signal.signal(signal.SIGINT, original_handler)
runner.close()
runner.loop.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking at the close method of the Runner, I see that it calls stop:


Which in turn calls stop on its loop:
self._loop.stop()

However, apparently it does not call close on the loop. Is it known that closing a loop is necessary to free all resources, including open file handles?

The question arises why Runner.close should not also call loop.close()? My guess is because the loop could be shared since it can be passed into the constructor of the Runner. Would it make sense though to have Runner.close() also close the loop if and only if the loop was constructed by the runner itself? @muhrin what was the original design here? Would this difference in behavior I propose be more confusing in the end?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a chance to look through this now and here are my thoughts.

  • Yes, it is expected that close() frees file descriptors, there's even a parameter to explicitly free all those created from the scope of the loop by client code (https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.close).
  • I agree that the Runner should take responsibility for cleaning it up if it created it. Personally, I would find that less confusing. Just drop a flag into Runner.__init__ that keeps track.

One thing that will have to be born in mind is that this will have to change once @unkcpz 's work gets merged as there will only be one loop (used in a re-entrant manner). For this, I suggest we keep track of how many files are open before and after the run_get_node execution so at least we know if it's leaking.

@ltalirz
Copy link
Member

ltalirz commented Aug 15, 2020

Could you or Daniele check out this branch (or apply the fix) and run the test again to see if now the file handles remain more or less constant?

Sure, I can confirm that the basic file descriptor leak when running WorkChains is fixed (as @unkcpz mentioned).

My question was more whether closing the loop is the right thing to do since I vaguely remembered some discussion on a related question on slack - if it is, then please go ahead with this change!

As for releasing this, it seems to me this warrants a bug fix release.

@sphuber
Copy link
Contributor

sphuber commented Aug 16, 2020

(as @unkcpz mentioned).

My question was more whether closing the loop is the right thing to do since I vaguely remembered some discussion on a related question on slack

I understand that that was what you were asking, and I realize that @unkcpz checked that this fixes a file leak, but not specifically the one that you experienced. There could be more sources of leaks and without repeating the test that showed the initial leak, that is impossible to say if this PR fixes it.

As for releasing this, it seems to me this warrants a bug fix release.

I agree, and if we confirm that we see no more leaks in normal usage, I will make a patch release.

muhrin
muhrin previously requested changes Aug 17, 2020
@@ -156,6 +156,7 @@ def kill_process(_num, _frame):
if original_handler:
signal.signal(signal.SIGINT, original_handler)
runner.close()
runner.loop.close()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've had a chance to look through this now and here are my thoughts.

  • Yes, it is expected that close() frees file descriptors, there's even a parameter to explicitly free all those created from the scope of the loop by client code (https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.close).
  • I agree that the Runner should take responsibility for cleaning it up if it created it. Personally, I would find that less confusing. Just drop a flag into Runner.__init__ that keeps track.

One thing that will have to be born in mind is that this will have to change once @unkcpz 's work gets merged as there will only be one loop (used in a re-entrant manner). For this, I suggest we keep track of how many files are open before and after the run_get_node execution so at least we know if it's leaking.

@ltalirz
Copy link
Member

ltalirz commented Aug 24, 2020

@unkcpz Is it clear to you what @muhrin is asking? Feel free to ask for clarification if not.

It would be great to get this PR merged since it fixes a major issue in 1.3.0 and we should probably make a separate patch release before moving to asyncio.

@sphuber
Copy link
Contributor

sphuber commented Aug 24, 2020

To summarize, all that remains to be done before we merge this: close the loop if and only if the loop was created by the runner itself in its constructor. If it was passed in as an argument, we do not claim ownership and will not close it.

@unkcpz
Copy link
Member Author

unkcpz commented Aug 25, 2020

Sorry for didn't read Martin's reply carefully. I thought we were waiting for asyncio version to fix the issue.
Now, I push the changes, is this the expected implementation?

@unkcpz unkcpz requested review from sphuber and muhrin August 25, 2020 11:39
@sphuber
Copy link
Contributor

sphuber commented Aug 27, 2020

@unkcpz implementation looks good to me. The idea is to release this as a hotfix in v1.3.1. I am just thinking of how to approach this since current develop also contains quite some new features which should really go with v1.4.0, however, this we cannot release right now as we are waiting for some additional features to be finalized. Tentative timeline for v1.4.0 is in two/three weeks.

@sphuber sphuber changed the base branch from develop to release/1.3.1 August 27, 2020 15:51
If the loop is not closed, the file handles it managed during its
lifetime might not be properly cleaned up, leading to file leaks.
Therefore, if the loop is created by the `Runner` upon construction, it
should also close it when the runner closes. It cannot be done for loops
passed into the constructor, because they migth actually still be in
use by other parts of the code.
@sphuber sphuber dismissed muhrin’s stale review August 27, 2020 18:31

Request addressed

Copy link
Contributor

@sphuber sphuber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @unkcpz

@sphuber sphuber merged commit f9f6c9c into aiidateam:release/1.3.1 Aug 27, 2020
giovannipizzi added a commit to ltalirz/aiida-core that referenced this pull request Feb 9, 2021
A couple of files were opened and not closed
Maybe this was partially related to aiidateam#4307 that closed
as a "tentative" fix of aiidateam#2386, but maybe a few more
open files were remaining?

Also, I'm adding a return status to most of the verdi daemon
commands, so tests will catch possible errors (they weren't being
catched in my own setup).

In addition, I'm also silencing some additional warnings that should be
silenced (the main point of this PR).
Note that while unmuting in general ResourceWarnings was good to spot
some issues (see bug described above), there are a couple more that now
issue it and are not obvious to fix (typically related to the daemon
starting some process in the background - or being started itself -
and not being closed before the test actually finished).
I think this is an acceptable compromise - maybe we'll figure out
how to selectively silence those, and keeping warnings visible will
help us figure out possible leaks in the future.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
4 participants