explicitly close runner loop of function process #4307

unkcpz · 2020-08-15T05:19:36Z

Fixes #2386 and fixes #2895

This is the tentative fix of issue #2386. See the issue for more information.

codecov · 2020-08-15T05:33:59Z

Codecov Report

Merging #4307 into release/1.3.1 will increase coverage by 0.02%.
The diff coverage is 100.00%.

@@                Coverage Diff                @@
##           release/1.3.1    #4307      +/-   ##
=================================================
+ Coverage          79.10%   79.12%   +0.02%     
=================================================
  Files                467      467              
  Lines              34492    34498       +6     
=================================================
+ Hits               27283    27292       +9     
+ Misses              7209     7206       -3

Flag	Coverage Δ
#django	`71.04% <100.00%> (+0.02%)`	⬆️
#sqlalchemy	`71.91% <100.00%> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
aiida/engine/runners.py	`93.38% <100.00%> (+0.28%)`	⬆️
aiida/transports/plugins/local.py	`80.47% <0.00%> (+0.26%)`	⬆️
aiida/engine/daemon/client.py	`73.72% <0.00%> (+1.15%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 856fc06...75ad468. Read the comment docs.

ltalirz · 2020-08-15T12:14:23Z

Thanks @unkcpz !
Can the others confirm that this is the proper way to address #2386 ?

sphuber · 2020-08-15T15:48:45Z

Can the others confirm that this is the proper way to address #2386 ?

Well, this could certainly be one of potential file leaks. However, it is impossible to say whether this was the only one. Could you or Daniele check out this branch (or apply the fix) and run the test again to see if now the file handles remain more or less constant?

sphuber · 2020-08-15T16:00:10Z

aiida/engine/processes/functions.py

@@ -156,6 +156,7 @@ def kill_process(_num, _frame):
                if original_handler:
                    signal.signal(signal.SIGINT, original_handler)
                runner.close()
+                runner.loop.close()


Looking at the close method of the Runner, I see that it calls stop:

aiida-core/aiida/engine/runners.py

Line 152 in f2d1e94

self.stop()

Which in turn calls stop on its loop:

aiida-core/aiida/engine/runners.py

Line 142 in f2d1e94

self._loop.stop()

However, apparently it does not call close on the loop. Is it known that closing a loop is necessary to free all resources, including open file handles?

The question arises why Runner.close should not also call loop.close()? My guess is because the loop could be shared since it can be passed into the constructor of the Runner. Would it make sense though to have Runner.close() also close the loop if and only if the loop was constructed by the runner itself? @muhrin what was the original design here? Would this difference in behavior I propose be more confusing in the end?

I've had a chance to look through this now and here are my thoughts.

Yes, it is expected that close() frees file descriptors, there's even a parameter to explicitly free all those created from the scope of the loop by client code (https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.close).

I agree that the Runner should take responsibility for cleaning it up if it created it. Personally, I would find that less confusing. Just drop a flag into Runner.__init__ that keeps track.

One thing that will have to be born in mind is that this will have to change once @unkcpz 's work gets merged as there will only be one loop (used in a re-entrant manner). For this, I suggest we keep track of how many files are open before and after the run_get_node execution so at least we know if it's leaking.

ltalirz · 2020-08-15T17:35:12Z

Could you or Daniele check out this branch (or apply the fix) and run the test again to see if now the file handles remain more or less constant?

Sure, I can confirm that the basic file descriptor leak when running WorkChains is fixed (as @unkcpz mentioned).

My question was more whether closing the loop is the right thing to do since I vaguely remembered some discussion on a related question on slack - if it is, then please go ahead with this change!

As for releasing this, it seems to me this warrants a bug fix release.

sphuber · 2020-08-16T08:31:35Z

(as @unkcpz mentioned).

My question was more whether closing the loop is the right thing to do since I vaguely remembered some discussion on a related question on slack

I understand that that was what you were asking, and I realize that @unkcpz checked that this fixes a file leak, but not specifically the one that you experienced. There could be more sources of leaks and without repeating the test that showed the initial leak, that is impossible to say if this PR fixes it.

As for releasing this, it seems to me this warrants a bug fix release.

I agree, and if we confirm that we see no more leaks in normal usage, I will make a patch release.

muhrin · 2020-08-17T08:25:54Z

aiida/engine/processes/functions.py

@@ -156,6 +156,7 @@ def kill_process(_num, _frame):
                if original_handler:
                    signal.signal(signal.SIGINT, original_handler)
                runner.close()
+                runner.loop.close()


I've had a chance to look through this now and here are my thoughts.

Yes, it is expected that close() frees file descriptors, there's even a parameter to explicitly free all those created from the scope of the loop by client code (https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop.close).

I agree that the Runner should take responsibility for cleaning it up if it created it. Personally, I would find that less confusing. Just drop a flag into Runner.__init__ that keeps track.

One thing that will have to be born in mind is that this will have to change once @unkcpz 's work gets merged as there will only be one loop (used in a re-entrant manner). For this, I suggest we keep track of how many files are open before and after the run_get_node execution so at least we know if it's leaking.

ltalirz · 2020-08-24T19:23:49Z

@unkcpz Is it clear to you what @muhrin is asking? Feel free to ask for clarification if not.

It would be great to get this PR merged since it fixes a major issue in 1.3.0 and we should probably make a separate patch release before moving to asyncio.

sphuber · 2020-08-24T19:47:25Z

To summarize, all that remains to be done before we merge this: close the loop if and only if the loop was created by the runner itself in its constructor. If it was passed in as an argument, we do not claim ownership and will not close it.

unkcpz · 2020-08-25T00:22:34Z

Sorry for didn't read Martin's reply carefully. I thought we were waiting for asyncio version to fix the issue.
Now, I push the changes, is this the expected implementation?

sphuber · 2020-08-27T15:00:39Z

@unkcpz implementation looks good to me. The idea is to release this as a hotfix in v1.3.1. I am just thinking of how to approach this since current develop also contains quite some new features which should really go with v1.4.0, however, this we cannot release right now as we are waiting for some additional features to be finalized. Tentative timeline for v1.4.0 is in two/three weeks.

If the loop is not closed, the file handles it managed during its lifetime might not be properly cleaned up, leading to file leaks. Therefore, if the loop is created by the `Runner` upon construction, it should also close it when the runner closes. It cannot be done for loops passed into the constructor, because they migth actually still be in use by other parts of the code.

Request addressed

sphuber

Thanks @unkcpz

A couple of files were opened and not closed Maybe this was partially related to aiidateam#4307 that closed as a "tentative" fix of aiidateam#2386, but maybe a few more open files were remaining? Also, I'm adding a return status to most of the verdi daemon commands, so tests will catch possible errors (they weren't being catched in my own setup). In addition, I'm also silencing some additional warnings that should be silenced (the main point of this PR). Note that while unmuting in general ResourceWarnings was good to spot some issues (see bug described above), there are a couple more that now issue it and are not obvious to fix (typically related to the daemon starting some process in the background - or being started itself - and not being closed before the test actually finished). I think this is an acceptable compromise - maybe we'll figure out how to selectively silence those, and keeping warnings visible will help us figure out possible leaks in the future.

unkcpz mentioned this pull request Aug 15, 2020

Too many open files error (file leak when creating runners/communicators) #2386

Closed

ltalirz requested review from sphuber and muhrin August 15, 2020 12:13

sphuber reviewed Aug 15, 2020

View reviewed changes

muhrin previously requested changes Aug 17, 2020

View reviewed changes

unkcpz force-pushed the fix/2386/fds branch from 07542e5 to 275bed6 Compare August 25, 2020 00:17

unkcpz requested review from sphuber and muhrin August 25, 2020 11:39

sphuber changed the base branch from develop to release/1.3.1 August 27, 2020 15:51

sphuber force-pushed the fix/2386/fds branch from 275bed6 to 3662596 Compare August 27, 2020 15:57

sphuber force-pushed the fix/2386/fds branch from 3662596 to 75ad468 Compare August 27, 2020 17:39

sphuber approved these changes Aug 27, 2020

View reviewed changes

sphuber merged commit f9f6c9c into aiidateam:release/1.3.1 Aug 27, 2020

sphuber mentioned this pull request Aug 27, 2020

Too many files open during tutorial #2895

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explicitly close runner loop of function process #4307

explicitly close runner loop of function process #4307

unkcpz commented Aug 15, 2020 •

edited by sphuber

Loading

codecov bot commented Aug 15, 2020 •

edited

Loading

ltalirz commented Aug 15, 2020

sphuber commented Aug 15, 2020

sphuber Aug 15, 2020

muhrin Aug 17, 2020

ltalirz commented Aug 15, 2020

sphuber commented Aug 16, 2020

muhrin Aug 17, 2020

ltalirz commented Aug 24, 2020 •

edited

Loading

sphuber commented Aug 24, 2020

unkcpz commented Aug 25, 2020 •

edited

Loading

sphuber commented Aug 27, 2020

sphuber left a comment

explicitly close runner loop of function process #4307

explicitly close runner loop of function process #4307

Conversation

unkcpz commented Aug 15, 2020 • edited by sphuber Loading

codecov bot commented Aug 15, 2020 • edited Loading

Codecov Report

ltalirz commented Aug 15, 2020

sphuber commented Aug 15, 2020

sphuber Aug 15, 2020

Choose a reason for hiding this comment

muhrin Aug 17, 2020

Choose a reason for hiding this comment

ltalirz commented Aug 15, 2020

sphuber commented Aug 16, 2020

muhrin Aug 17, 2020

Choose a reason for hiding this comment

ltalirz commented Aug 24, 2020 • edited Loading

sphuber commented Aug 24, 2020

unkcpz commented Aug 25, 2020 • edited Loading

sphuber commented Aug 27, 2020

sphuber left a comment

Choose a reason for hiding this comment

unkcpz commented Aug 15, 2020 •

edited by sphuber

Loading

codecov bot commented Aug 15, 2020 •

edited

Loading

ltalirz commented Aug 24, 2020 •

edited

Loading

unkcpz commented Aug 25, 2020 •

edited

Loading