-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data store memory leak fixes #3484
Data store memory leak fixes #3484
Conversation
So with the complex suite:
few minutes in
UIS
few minutes in
|
6408417
to
f522dc7
Compare
Great! How are you producing these dumps:
They look really useful. |
Here it is all condensed (pretty manual route):
|
Useful indeed. Might come in handy when troubleshooting. Perhaps include it in #113 or in this issue. |
Not a great way of showing evolution over time.. But just knocked it up and put it in the main loop. (just outputs to the busy suite log also) |
Thanks! It might actually be good to get this into the code base for future debugging? e.g. Cylc Flow has on-board profiling. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, I ran mprof
, but got similar figures, not a saw tooth pattern (it was going up for 5 minutes), with this suite.
[cylc]
[[parameters]]
m = 0..20
n = 0..20
[scheduling]
initial cycle point = 3000
max active cycle points = 10 # <----
[[queues]]
[[[default]]]
limit = 20
[[graph]]
P1Y = """foo[-P1Y] => foo => bar<m> => qux<m=1,n> => waz
qux<m-1,n> => qux<m,n>"""
Will have another look next either this weekend or Tuesday after holidays and check with more calm 👍
@kinow Not sure what you mean? this only addresses memory leaks in the data-store (so won't solve the 400Mb thing) I guess the above figures are misleading because the I guess the saw tooth pattern won't appear if:
|
Sorry @dwsutherland . I was using a test from #3474 , where @hjoliver was able to show the constant increase of memory over time in a suite (the same I copied above). On my notebook, running the same suite from both Next week I will leave it running for some 20-30 minutes at NIWA and see what happens 👍 |
So, this PR is to fix the contribution of the data-store WRT to "slowly increasing without the memory going down".. It doesn't claim to fix "slowly increasing without the memory going down" in general, there may be other things contributing to the problem. I've shown above that the job pool is now getting pruned like everything else, so if the data-store is still not leveling out (not the application as a whole), then either the suite is designed to take a long time to remove items from the pool or there is still something wrong in the data-store management (Happy to accept there may be!).. I'll test it with your above suite |
Probably no need to test it @dwsutherland, just finished reviewing, sending the approval with some comment 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I was getting similar numbers with master
and with this branch. Then I realized it was weekend and I was using the wrong branch. @hjoliver I compared master
, against the SOD branch merged with this one.
master
branch for hmem1
— suite @hjoliver posted on that PR. (This is where this discussion about memory leaks in the data store started.)
Then, the Spawn-On-Demand branch merged with this branch (no conflicts 🎉 ) and same suite:
If you compare master
with this branch — as I was doing in my previous comments — the graph will look pretty similar.
Also tested with objgraph
+ tracemalloc
, and found nothing wrong. Just 3 executions of main loop in scheduler.py
. It lists top 12 objects in objgraph
, how much it increased since the last run, and then 10 lines where most memory was allocated.
run 1
list 17099
dict 10499
function 9616
OrderedDictWithDefaults 6097
tuple 5049
weakref 2171
getset_descriptor 1431
wrapper_descriptor 1384
set 1308
method_descriptor 1241
type 1232
builtin_function_or_method 1217
list 17099 +17099
function 9616 +9616
dict 9562 +9562
OrderedDictWithDefaults 6097 +6097
tuple 5008 +5008
weakref 2170 +2170
getset_descriptor 1429 +1429
wrapper_descriptor 1384 +1384
set 1308 +1308
method_descriptor 1241 +1241
type 1231 +1231
builtin_function_or_method 1217 +1217
Top 10 lines
#1: flow/data_store_mgr.py:150: 307.1 KiB
#2: flow/task_proxy.py:324: 261.2 KiB
#3: flow/task_pool.py:496: 39.8 KiB
#4: flow/state_summary_mgr.py:159: 32.7 KiB
#5: <frozen importlib._bootstrap_external>:525: 31.3 KiB
#6: python3.7/sre_compile.py:783: 29.8 KiB
#7: flow/task_id.py:46: 27.4 KiB
#8: site-packages/objgraph.py:311: 23.3 KiB
#9: flow/task_proxy.py:341: 21.7 KiB
#10: flow/state_summary_mgr.py:133: 18.7 KiB
452 other: 422.9 KiB
Total allocated size: 1215.8 KiB
run 2
list 17082
function 9616
dict 9587
OrderedDictWithDefaults 6097
tuple 5503
weakref 2171
getset_descriptor 1429
wrapper_descriptor 1384
set 1309
method_descriptor 1241
type 1231
builtin_function_or_method 1220
tuple 5033 +25
dict 9566 +4
builtin_function_or_method 1220 +3
ISO8601Point 504 +3
weakref 2171 +1
set 1309 +1
Snapshot 1 +1
_Traces 1 +1
Top 10 lines
#1: flow/data_store_mgr.py:150: 610.3 KiB
#2: flow/task_proxy.py:324: 261.1 KiB
#3: flow/data_store_mgr.py:676: 106.7 KiB
#4: flow/data_store_mgr.py:670: 102.0 KiB
#5: flow/task_id.py:46: 54.8 KiB
#6: flow/data_store_mgr.py:639: 41.4 KiB
#7: flow/task_pool.py:496: 39.9 KiB
#8: flow/data_store_mgr.py:473: 36.2 KiB
#9: flow/data_store_mgr.py:459: 36.0 KiB
#10: flow/state_summary_mgr.py:159: 33.8 KiB
528 other: 670.7 KiB
Total allocated size: 1992.9 KiB
run 3
list 17342
dict 9682
function 9616
OrderedDictWithDefaults 6097
tuple 5565
weakref 2171
getset_descriptor 1429
wrapper_descriptor 1384
set 1329
method_descriptor 1241
type 1231
builtin_function_or_method 1220
list 17342 +243
dict 9667 +101
TaskActionTimer 100 +40
set 1329 +20
ISO8601Point 524 +20
TaskProxy 504 +20
TaskState 504 +20
TaskOutputs 504 +20
OrderedDict 335 +1
2020-01-25T21:38:20+13:00 INFO - [client-command] put_messages kinow@ranma:cylc-message
2020-01-25T21:38:20+13:00 INFO - [client-command] put_messages kinow@ranma:cylc-message
2020-01-25T21:38:21+13:00 INFO - [client-command] put_messages kinow@ranma:cylc-message
2020-01-25T21:38:21+13:00 INFO - [client-command] put_messages kinow@ranma:cylc-message
Top 10 lines
#1: flow/data_store_mgr.py:150: 614.2 KiB
#2: flow/task_proxy.py:324: 261.3 KiB
#3: flow/task_id.py:46: 54.8 KiB
#4: flow/task_proxy.py:336: 43.1 KiB
#5: flow/task_pool.py:496: 39.2 KiB
#6: flow/data_store_mgr.py:459: 36.0 KiB
#7: flow/state_summary_mgr.py:159: 34.2 KiB
#8: <frozen importlib._bootstrap_external>:525: 31.3 KiB
#9: python3.7/sre_compile.py:783: 30.6 KiB
#10: flow/task_proxy.py:228: 25.0 KiB
540 other: 546.4 KiB
Total allocated size: 1716.1 KiB
Not a lot of new objects created/retained, looking good 👍 (though I still need to get a better felling how Cylc workflows impact the memory with every run, how it varies per suites, etc, but at least it looks way better!)
Everything looking good. Great work @dwsutherland !
BTW, the two top lines allocating more memory (not related to this change, just FWIW):
|
@kinow Wow, thanks, very thorough! BTW here's a break down of the
12min later
You can see the reason the data store manager is the same as the scheduler (
) I'm still puzzled as to why Cylc application as a whole uses so much memory... maybe all the imports do add up:
The running objects appear not to contribute an appreciable amount.. Those top two allocation lines make sense, as all deltas run through |
Done.. However, for the sake of being thread safe it probably makes sense to work on a copy and do an atomic update (although the "work on" is probably just a collection of atomics).. |
BTW - The coverage failure is mostly from a nomenclature fix in an area not tested and hard to test ( |
Usually the bulk of the memory of a suite is in the config and task_pool objects. The difference in size between the "outer" measurement (as recorded from Once you've filtered out shared memory the remainder of the difference should (I think) be Python itself and the code you are running. I haven't done much PY3 profiling, I did a bit of PY2.6 work with Cylc and found that a hell of a lot of memory was going into co code objects which suggested that perhaps the size of the Cylc codebase (and its dependency graph) could be to blame, if so a more modular architecture might help. More investigation needed... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good to me, awaiting test results...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@oliver-sanders can probably merge this? two approvals (and the coverage failure is justified (nomenclature change in unrelated uncovered logic)) |
Jan 2013 - start of the infamously frustrating test battery! |
Weird, I thought I had linked an issue about logging and monitoring. Have no idea where I got this number 😅 |
These changes partially address cylc/cylc-uiserver#112
The
jobs
topic of the data store wasn't getting pruned, so the job elements were piling up:This is now fixed, and the a couple of other delta merge field accumulations fixed:
Requirements check-list
CONTRIBUTING.md
and added my name as a Code Contributor.