Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data store memory leak fixes #3484

Merged
merged 2 commits into from
Jan 28, 2020

Conversation

dwsutherland
Copy link
Member

These changes partially address cylc/cylc-uiserver#112

The jobs topic of the data store wasn't getting pruned, so the job elements were piling up:

data-store:   44.45 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 58,   11.74 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B
data-store:   44.62 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 59,   11.91 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B
data-store:   45.44 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 64,   12.73 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B
data-store:   45.60 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 65,   12.89 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B
data-store:   46.42 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 70,   13.71 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B
data-store:   46.59 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 71,   13.88 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B
data-store:   47.41 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 76,   14.70 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B

This is now fixed, and the a couple of other delta merge field accumulations fixed:

data-store:   39.70 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 29,    6.98 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B
data-store:   39.53 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 28,    6.82 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B
data-store:   39.70 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 29,    6.98 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B
data-store:   39.53 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 28,    6.82 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B
data-store:   39.70 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 29,    6.98 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B|delta_times: 7,  960     B

Requirements check-list

  • I have read CONTRIBUTING.md and added my name as a Code Contributor.
  • Contains logically grouped changes (else tidy your branch by rebase).
  • Does not contain off-topic changes (use other PRs for other changes).
  • Already covered by existing tests.
  • No change log entry required (Cylc8 not released).
  • No documentation update required.

@dwsutherland dwsutherland added the bug Something is wrong :( label Jan 24, 2020
@dwsutherland dwsutherland added this to the cylc-8.0a2 milestone Jan 24, 2020
@dwsutherland dwsutherland self-assigned this Jan 24, 2020
@dwsutherland
Copy link
Member Author

So with the complex suite:
WFS
startup

Scheduler:   19.14 MB
Task Pool:   16.35 MB
Job Pool:    6.70 KB
Server:   19.14 MB
Publisher:   14.20 KB
2020-01-24T02:11:59Z INFO - [client-command] identify sutherlander@cortex-vbox:cylc-uiserver
data-store mgr:   19.14 MB
data-store:  973.17 KB|edges: 1999,  537.09 KB|families: 94,   20.32 KB|family_proxies: 94,   21.75 KB|jobs: 8,    1.85 KB|tasks: 1009,  213.95 KB|task
_proxies: 736,  177.05 KB|workflow: 1,   80     B
[ Top 20 ]
<frozen importlib._bootstrap_external>:525: size=14.5 MiB, count=138935, average=109 B
<frozen importlib._bootstrap>:219: size=1479 KiB, count=11334, average=134 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/parsec/OrderedDict.py:55: size=1296 KiB, count=21585, average=61 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/data_store_mgr.py:149: size=960 KiB, count=7886, average=125 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/taskdef.py:71: size=848 KiB, count=3996, average=217 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/parsec/OrderedDict.py:49: size=769 KiB, count=16036, average=49 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/task_outputs.py:72: size=722 KiB, count=11681, average=63 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/task_proxy.py:228: size=638 KiB, count=2017, average=324 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/task_proxy.py:324: size=563 KiB, count=1004, average=574 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/prerequisite.py:129: size=539 KiB, count=4570, average=121 B
/home/sutherlander/.envs/flow/lib/python3.7/linecache.py:137: size=512 KiB, count=5033, average=104 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/parsec/OrderedDict.py:34: size=480 KiB, count=8822, average=56 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/numpy/core/overrides.py:175: size=472 KiB, count=4504, average=107 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/config.py:1540: size=454 KiB, count=2536, average=183 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/config.py:1676: size=431 KiB, count=6894, average=64 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/graph_parser.py:463: size=416 KiB, count=5021, average=85 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/config.py:1655: size=404 KiB, count=5170, average=80 B
/home/sutherlander/.envs/flow/lib/python3.7/abc.py:126: size=360 KiB, count=1666, average=221 B
/home/sutherlander/.envs/flow/lib/python3.7/sre_compile.py:783: size=335 KiB, count=933, average=367 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/parsec/util.py:266: size=327 KiB, count=2202, average=152 B

few minutes in

Scheduler:   17.37 MB
Task Pool:   15.44 MB
Job Pool:   10.24 KB
Server:   17.37 MB
Publisher:   14.20 KB
2020-01-24T02:14:06Z INFO - [client-command] identify sutherlander@cortex-vbox:cylc-uiserver
data-store mgr:   17.37 MB
data-store:  974.34 KB|edges: 1999,  537.09 KB|families: 94,   20.32 KB|family_proxies: 94,   21.75 KB|jobs: 13,    3.02 KB|tasks: 1009,  213.95 KB|task_proxies: 736,  177.05 KB|workflow: 1,   80     B
[ Top 20 ]
<frozen importlib._bootstrap_external>:525: size=14.5 MiB, count=139221, average=109 B
<frozen importlib._bootstrap>:219: size=1481 KiB, count=11338, average=134 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/parsec/OrderedDict.py:55: size=1294 KiB, count=21553, average=61 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/data_store_mgr.py:149: size=961 KiB, count=7896, average=125 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/taskdef.py:71: size=848 KiB, count=3996, average=217 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/parsec/OrderedDict.py:49: size=767 KiB, count=15998, average=49 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/task_outputs.py:72: size=726 KiB, count=11744, average=63 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/task_proxy.py:228: size=644 KiB, count=2026, average=325 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/task_proxy.py:324: size=566 KiB, count=1009, average=574 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/prerequisite.py:129: size=539 KiB, count=4570, average=121 B
/home/sutherlander/.envs/flow/lib/python3.7/linecache.py:137: size=512 KiB, count=5033, average=104 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/parsec/OrderedDict.py:34: size=480 KiB, count=8822, average=56 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/numpy/core/overrides.py:175: size=472 KiB, count=4504, average=107 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/config.py:1540: size=454 KiB, count=2536, average=183 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/config.py:1676: size=431 KiB, count=6894, average=64 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/graph_parser.py:463: size=416 KiB, count=5021, average=85 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/config.py:1655: size=392 KiB, count=5019, average=80 B
/home/sutherlander/.envs/flow/lib/python3.7/abc.py:126: size=360 KiB, count=1666, average=221 B
/home/sutherlander/.envs/flow/lib/python3.7/sre_compile.py:783: size=335 KiB, count=934, average=367 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/parsec/util.py:266: size=327 KiB, count=2202, average=152 B

UIS
startup (no workflows)

UIS:    5.69 KB
WFS Mgr:    5.69 KB
data-store mgr:    5.69 KB
data-store:  240     B
[ Top 20 ]
<frozen importlib._bootstrap_external>:525: size=62.1 KiB, count=694, average=92 B
/usr/lib/python3.7/mimetypes.py:219: size=45.9 KiB, count=690, average=68 B
/usr/lib/python3.7/urllib/parse.py:607: size=42.4 KiB, count=708, average=61 B
/home/sutherlander/.envs/uiserver/lib/python3.7/abc.py:143: size=34.4 KiB, count=357, average=99 B
/usr/lib/python3.7/mimetypes.py:228: size=24.4 KiB, count=469, average=53 B
/usr/lib/python3.7/mimetypes.py:93: size=24.1 KiB, count=99, average=250 B
/usr/lib/python3.7/mimetypes.py:92: size=18.3 KiB, count=2, average=9376 B
/home/sutherlander/.envs/uiserver/lib/python3.7/sre_parse.py:426: size=14.9 KiB, count=18, average=848 B
/usr/lib/python3.7/mimetypes.py:95: size=13.8 KiB, count=428, average=33 B
/home/sutherlander/.envs/uiserver/lib/python3.7/sre_compile.py:783: size=13.0 KiB, count=38, average=350 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/pympler/asizeof.py:2075: size=10.7 KiB, count=19, average=576 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/graphql/language/visitor.py:188: size=9152 B, count=19, average=482 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/parsec/OrderedDict.py:49: size=8648 B, count=144, average=60 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/tornado/ioloop.py:907: size=6848 B, count=37, average=185 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/graphql/language/parser.py:124: size=5752 B, count=84, average=68 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/parsec/config.py:93: size=5608 B, count=37, average=152 B
/usr/lib/python3.7/json/decoder.py:353: size=5495 B, count=70, average=78 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/tornado/httputil.py:88: size=5024 B, count=74, average=68 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/traitlets/traitlets.py:239: size=4864 B, count=76, average=64 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/graphql/validation/validation.py:35: size=4464 B, count=11, average=406 B

few minutes in

UIS: 1006.26 KB
WFS Mgr: 1006.26 KB
data-store mgr: 1006.26 KB
data-store:  974.93 KB|edges: 1999,  537.09 KB|families: 94,   20.32 KB|family_proxies: 94,   21.75 KB|jobs: 13,    3.02 KB|tasks: 1009,  213.95 KB|task_proxies: 736,  177.05 KB|workflow: 1,   80     B|delta_times: 7,  960     B
[ Top 20 ]
/usr/lib/python3.7/asyncio/base_events.py:405: size=9645 KiB, count=84005, average=118 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/graphql/execution/executors/asyncio.py:69: size=5411 KiB, count=43625, average=127 B
/home/sutherlander/.envs/uiserver/lib/python3.7/_weakrefset.py:84: size=5386 KiB, count=42730, average=129 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/network/schema.py:315: size=1209 KiB, count=19343, average=64 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/data_store_mgr.py:149: size=961 KiB, count=7896, average=125 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/promise/promise.py:433: size=639 KiB, count=2210, average=296 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/graphql/execution/executor.py:255: size=614 KiB, count=10470, average=60 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/promise/promise.py:798: size=587 KiB, count=8107, average=74 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/promise/promise.py:402: size=395 KiB, count=3667, average=110 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/graphql/execution/executors/asyncio.py:72: size=353 KiB, count=1, average=353 KiB
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/promise/promise_list.py:89: size=351 KiB, count=3680, average=98 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/graphene/types/resolver.py:2: size=303 KiB, count=4535, average=68 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/promise/promise.py:705: size=198 KiB, count=2805, average=72 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/graphql/execution/executor.py:241: size=197 KiB, count=1486, average=136 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/network/resolvers.py:202: size=184 KiB, count=335, average=563 B
/usr/lib/python3.7/typing.py:811: size=163 KiB, count=2979, average=56 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/graphql/execution/executor.py:529: size=126 KiB, count=1472, average=88 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/promise/promise.py:317: size=122 KiB, count=742, average=168 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/graphql/execution/executor.py:450: size=121 KiB, count=735, average=168 B
/home/sutherlander/.envs/uiserver/lib/python3.7/site-packages/graphql/execution/executor.py:250: size=99.4 KiB, count=1582, average=64 B

@dwsutherland dwsutherland force-pushed the data_store_leak_fixes branch from 6408417 to f522dc7 Compare January 24, 2020 02:55
@oliver-sanders
Copy link
Member

Great!

How are you producing these dumps:

data-store:   44.45 KB|edges: 32,    8.23 KB|families: 9, ...

They look really useful.

@dwsutherland
Copy link
Member Author

Great!

How are you producing these dumps:

data-store:   44.45 KB|edges: 32,    8.23 KB|families: 9, ...

They look really useful.

Here it is all condensed (pretty manual route):

from pympler import asizeof
from pympler.util import stringutils

import tracemalloc

# export PYTHONTRACEMALLOC=1
tracemalloc.start()

def print_topmemalloc(num):
    snapshot = tracemalloc.take_snapshot()
    top_stats = snapshot.statistics('lineno')
    print(f'[ Top {num} ]')
    for stat in top_stats[:num]:
        print(stat)


class MainOrScheduler:
    def data_stats(self):
        all_data = stringutils.pp(asizeof.asizeof(self.data_store_mgr.data))
        breakdown = f'data-store: {all_data}'
        for wfs in self.data_store_mgr.data:
            for key, val in self.data_store_mgr.data[wfs].items():
                val_size = stringutils.pp(asizeof.asizeof(val))
                if key != 'workflow':
                    key_count = len(val.keys())
                else:
                    key_count = 1
                breakdown = f'{breakdown}|{key}: {key_count}, {val_size}'
        print('Scheduler: %s' % stringutils.pp(asizeof.asizeof(self)))
        print('Task Pool: %s' % stringutils.pp(asizeof.asizeof(self.pool)))
        print('Job Pool: %s' % stringutils.pp(asizeof.asizeof(self.job_pool)))
        print('Server: %s' % stringutils.pp(asizeof.asizeof(self.server)))
        print('Publisher: %s' % stringutils.pp(asizeof.asizeof(self.publisher)))
        print('data-store mgr: %s' % stringutils.pp(asizeof.asizeof(self.data_store_mgr)))
        print(breakdown)
        print_topmemalloc(20)

@kinow
Copy link
Member

kinow commented Jan 24, 2020

Useful indeed. Might come in handy when troubleshooting. Perhaps include it in #113 or in this issue.

@dwsutherland
Copy link
Member Author

Useful indeed. Might come in handy when troubleshooting. Perhaps include it in #113

Not a great way of showing evolution over time.. But just knocked it up and put it in the main loop. (just outputs to the busy suite log also)

@oliver-sanders
Copy link
Member

oliver-sanders commented Jan 24, 2020

Thanks! It might actually be good to get this into the code base for future debugging?

e.g. Cylc Flow has on-board profiling.

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, I ran mprof, but got similar figures, not a saw tooth pattern (it was going up for 5 minutes), with this suite.

[cylc]
   [[parameters]]
       m = 0..20
       n = 0..20
[scheduling]
   initial cycle point = 3000
   max active cycle points = 10  # <----
   [[queues]]
      [[[default]]]
         limit = 20
   [[graph]]
      P1Y = """foo[-P1Y] => foo => bar<m> => qux<m=1,n> => waz
            qux<m-1,n> => qux<m,n>"""

Will have another look next either this weekend or Tuesday after holidays and check with more calm 👍

@dwsutherland
Copy link
Member Author

Code looks good, I ran mprof, but got similar figures, not a saw tooth pattern (it was going up for 5 minutes), with this suite.

@kinow Not sure what you mean? this only addresses memory leaks in the data-store (so won't solve the 400Mb thing) I guess the above figures are misleading because the tracememalloc implemented above doesn't appear to capture the applications as a whole...

I guess the saw tooth pattern won't appear if:

  • New nodes/edges arrive at the same time as nodes/edges get pruned.
  • You're looking at the application memory use as a whole, because the data-store is minuscule in comparison.

@kinow
Copy link
Member

kinow commented Jan 25, 2020

Sorry @dwsutherland . I was using a test from #3474 , where @hjoliver was able to show the constant increase of memory over time in a suite (the same I copied above).

On my notebook, running the same suite from both master and from this branch produced the same pattern, slowly increasing without the memory going down. But I had little time to run (~15 minutes for both runs).

Next week I will leave it running for some 20-30 minutes at NIWA and see what happens 👍

@dwsutherland
Copy link
Member Author

Sorry @dwsutherland . I was using a test from #3474 , where @hjoliver was able to show the constant increase of memory over time in a suite (the same I copied above).

On my notebook, running the same suite from both master and from this branch produced the same pattern, slowly increasing without the memory going down. But I had little time to run (~15 minutes for both runs).

Next week I will leave it running for some 20-30 minutes at NIWA and see what happens 👍

So, this PR is to fix the contribution of the data-store WRT to "slowly increasing without the memory going down".. It doesn't claim to fix "slowly increasing without the memory going down" in general, there may be other things contributing to the problem.

I've shown above that the job pool is now getting pruned like everything else, so if the data-store is still not leveling out (not the application as a whole), then either the suite is designed to take a long time to remove items from the pool or there is still something wrong in the data-store management (Happy to accept there may be!)..

I'll test it with your above suite

@kinow
Copy link
Member

kinow commented Jan 25, 2020

Probably no need to test it @dwsutherland, just finished reviewing, sending the approval with some comment 👍

Copy link
Member

@kinow kinow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I was getting similar numbers with master and with this branch. Then I realized it was weekend and I was using the wrong branch. @hjoliver I compared master, against the SOD branch merged with this one.

master branch for hmem1 — suite @hjoliver posted on that PR. (This is where this discussion about memory leaks in the data store started.)

master-branch

Then, the Spawn-On-Demand branch merged with this branch (no conflicts 🎉 ) and same suite:

sod+sutherland

If you compare master with this branch — as I was doing in my previous comments — the graph will look pretty similar.

Also tested with objgraph + tracemalloc, and found nothing wrong. Just 3 executions of main loop in scheduler.py. It lists top 12 objects in objgraph, how much it increased since the last run, and then 10 lines where most memory was allocated.

run 1

list                       17099
dict                       10499
function                   9616
OrderedDictWithDefaults    6097
tuple                      5049
weakref                    2171
getset_descriptor          1431
wrapper_descriptor         1384
set                        1308
method_descriptor          1241
type                       1232
builtin_function_or_method 1217
list                          17099    +17099
function                       9616     +9616
dict                           9562     +9562
OrderedDictWithDefaults        6097     +6097
tuple                          5008     +5008
weakref                        2170     +2170
getset_descriptor              1429     +1429
wrapper_descriptor             1384     +1384
set                            1308     +1308
method_descriptor              1241     +1241
type                           1231     +1231
builtin_function_or_method     1217     +1217
Top 10 lines
#1: flow/data_store_mgr.py:150: 307.1 KiB
#2: flow/task_proxy.py:324: 261.2 KiB
#3: flow/task_pool.py:496: 39.8 KiB
#4: flow/state_summary_mgr.py:159: 32.7 KiB
#5: <frozen importlib._bootstrap_external>:525: 31.3 KiB
#6: python3.7/sre_compile.py:783: 29.8 KiB
#7: flow/task_id.py:46: 27.4 KiB
#8: site-packages/objgraph.py:311: 23.3 KiB
#9: flow/task_proxy.py:341: 21.7 KiB
#10: flow/state_summary_mgr.py:133: 18.7 KiB
452 other: 422.9 KiB
Total allocated size: 1215.8 KiB

run 2

list                       17082
function                   9616
dict                       9587
OrderedDictWithDefaults    6097
tuple                      5503
weakref                    2171
getset_descriptor          1429
wrapper_descriptor         1384
set                        1309
method_descriptor          1241
type                       1231
builtin_function_or_method 1220
tuple                          5033       +25
dict                           9566        +4
builtin_function_or_method     1220        +3
ISO8601Point                    504        +3
weakref                        2171        +1
set                            1309        +1
Snapshot                          1        +1
_Traces                           1        +1
Top 10 lines
#1: flow/data_store_mgr.py:150: 610.3 KiB
#2: flow/task_proxy.py:324: 261.1 KiB
#3: flow/data_store_mgr.py:676: 106.7 KiB
#4: flow/data_store_mgr.py:670: 102.0 KiB
#5: flow/task_id.py:46: 54.8 KiB
#6: flow/data_store_mgr.py:639: 41.4 KiB
#7: flow/task_pool.py:496: 39.9 KiB
#8: flow/data_store_mgr.py:473: 36.2 KiB
#9: flow/data_store_mgr.py:459: 36.0 KiB
#10: flow/state_summary_mgr.py:159: 33.8 KiB
528 other: 670.7 KiB
Total allocated size: 1992.9 KiB

run 3

list                       17342
dict                       9682
function                   9616
OrderedDictWithDefaults    6097
tuple                      5565
weakref                    2171
getset_descriptor          1429
wrapper_descriptor         1384
set                        1329
method_descriptor          1241
type                       1231
builtin_function_or_method 1220
list               17342      +243
dict                9667      +101
TaskActionTimer      100       +40
set                 1329       +20
ISO8601Point         524       +20
TaskProxy            504       +20
TaskState            504       +20
TaskOutputs          504       +20
OrderedDict          335        +1
2020-01-25T21:38:20+13:00 INFO - [client-command] put_messages kinow@ranma:cylc-message
2020-01-25T21:38:20+13:00 INFO - [client-command] put_messages kinow@ranma:cylc-message
2020-01-25T21:38:21+13:00 INFO - [client-command] put_messages kinow@ranma:cylc-message
2020-01-25T21:38:21+13:00 INFO - [client-command] put_messages kinow@ranma:cylc-message
Top 10 lines
#1: flow/data_store_mgr.py:150: 614.2 KiB
#2: flow/task_proxy.py:324: 261.3 KiB
#3: flow/task_id.py:46: 54.8 KiB
#4: flow/task_proxy.py:336: 43.1 KiB
#5: flow/task_pool.py:496: 39.2 KiB
#6: flow/data_store_mgr.py:459: 36.0 KiB
#7: flow/state_summary_mgr.py:159: 34.2 KiB
#8: <frozen importlib._bootstrap_external>:525: 31.3 KiB
#9: python3.7/sre_compile.py:783: 30.6 KiB
#10: flow/task_proxy.py:228: 25.0 KiB
540 other: 546.4 KiB
Total allocated size: 1716.1 KiB

Not a lot of new objects created/retained, looking good 👍 (though I still need to get a better felling how Cylc workflows impact the memory with every run, how it varies per suites, etc, but at least it looks way better!)

Everything looking good. Great work @dwsutherland !

@kinow
Copy link
Member

kinow commented Jan 25, 2020

BTW, the two top lines allocating more memory (not related to this change, just FWIW):

@dwsutherland
Copy link
Member Author

@kinow Wow, thanks, very thorough!

BTW here's a break down of the data_store_mgr (for my same suite):
stable

2020-01-25T21:57:27+13:00 INFO - [qar.12341001T0101+13] -triggered off ['qar.12340901T0101+13', 'qar.12340901T0101+13']
Scheduler:  592.47 KB
Task Pool:  434.50 KB
Job Pool:   21.01 KB
Server:  592.47 KB
Publisher:   14.20 KB
data-store mgr:  592.47 KB
data-store mgr:  592.47 KB|ancestors: 1,    3.63 KB|cycle_states: 1,    6.29 KB|data: 1,   38.95 KB|deltas: 1,    1.32 KB|descendants: 1,    2.23 KB|edge_points: 1,    2.77 KB|max_point: 1,   56     B|min_point: 1,   56     B|parents: 1,    3.66 KB|pool_points: 1,  448     B|schd: 1,  592.47 KB|updates: 1,   13.04 KB|updates_pending: 1,   24     B|workflow_id: 1,   72     B
data-store:   38.95 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 28,    6.82 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B

12min later

2020-01-25T22:09:38+13:00 INFO - [qar.12370501T0101+13] -health check settings: execution timeout=None
Scheduler:  630.09 KB
Task Pool:  474.38 KB
Job Pool:   21.01 KB
Server:  630.09 KB
Publisher:   14.20 KB
data-store mgr:  630.09 KB
data-store mgr:  630.09 KB|ancestors: 1,    3.63 KB|cycle_states: 1,    6.47 KB|data: 1,   38.95 KB|deltas: 1,    1.32 KB|descendants: 1,    2.23 KB|edge_points: 1,    2.77 KB|max_point: 1,   56     B|min_point: 1,   56     B|parents: 1,    3.66 KB|pool_points: 1,  448     B|schd: 1,  630.09 KB|updates: 1,    3.40 KB|updates_pending: 1,   24     B|workflow_id: 1,   72     B
data-store:   38.95 KB|edges: 32,    8.23 KB|families: 9,    1.70 KB|family_proxies: 45,   12.02 KB|jobs: 28,    6.82 KB|tasks: 8,    1.55 KB|task_proxies: 32,    7.48 KB|workflow: 1,   80     B

You can see the reason the data store manager is the same as the scheduler (630.09 KB) is because of the circular reference data_store_mgr.schd
(done with:

        for slot in self.data_store_mgr.__slots__:
            val = getattr(self.data_store_mgr, slot)
            val_size = stringutils.pp(asizeof.asizeof(val))
            key_count = len(val)
            breakdown2 = f'{breakdown2}|{slot}: {key_count}, {val_size}'

)

I'm still puzzled as to why Cylc application as a whole uses so much memory... maybe all the imports do add up:

[ Top 50 ]
<frozen importlib._bootstrap_external>:525: size=13.2 MiB, count=124033, average=112 B
<frozen importlib._bootstrap>:219: size=1868 KiB, count=14851, average=129 B
/home/sutherlander/.envs/flow/lib/python3.7/linecache.py:137: size=512 KiB, count=5033, average=104 B
/home/sutherlander/.envs/flow/lib/python3.7/abc.py:126: size=356 KiB, count=1647, average=222 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/numpy/core/overrides.py:181: size=264 KiB, count=3133, average=86 B
<frozen importlib._bootstrap_external>:59: size=234 KiB, count=1765, average=135 B
<__array_function__ internals>:2: size=186 KiB, count=960, average=199 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py:3349: size=142 KiB, count=41, average=3537 B
/home/sutherlander/.envs/flow/lib/python3.7/abc.py:127: size=128 KiB, count=1080, average=121 B
<frozen importlib._bootstrap_external>:1416: size=126 KiB, count=180, average=717 B
/home/sutherlander/.envs/flow/lib/python3.7/sre_compile.py:783: size=125 KiB, count=424, average=302 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/numpy/core/function_base.py:511: size=110 KiB, count=186, average=603 B
<frozen importlib._bootstrap_external>:1408: size=107 KiB, count=1756, average=63 B
/usr/lib/python3.7/email/feedparser.py:469: size=106 KiB, count=16, average=6810 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/numpy/ma/core.py:902: size=101 KiB, count=52, average=1994 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py:3439: size=99.7 KiB, count=31, average=3293 B
<frozen importlib._bootstrap>:371: size=91.4 KiB, count=1200, average=78 B
<frozen importlib._bootstrap_external>:606: size=85.1 KiB, count=1405, average=62 B
/home/sutherlander/.envs/flow/lib/python3.7/copy.py:289: size=82.8 KiB, count=355, average=239 B
/home/sutherlander/.envs/flow/lib/python3.7/site.py:170: size=81.7 KiB, count=360, average=232 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/numpy/core/function_base.py:508: size=74.9 KiB, count=63, average=1217 B
<frozen importlib._bootstrap_external>:916: size=71.2 KiB, count=767, average=95 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/graphene/utils/orderedtype.py:9: size=70.0 KiB, count=1273, average=56 B
/home/sutherlander/.envs/flow/lib/python3.7/collections/__init__.py:397: size=70.0 KiB, count=659, average=109 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py:1145: size=69.2 KiB, count=535, average=132 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/graphql/language/visitor_meta.py:77: size=67.1 KiB, count=307, average=224 B
/home/sutherlander/.envs/flow/lib/python3.7/functools.py:54: size=65.4 KiB, count=211, average=318 B
<frozen importlib._bootstrap>:36: size=65.0 KiB, count=837, average=79 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/graphene/types/mountedtype.py:19: size=64.4 KiB, count=932, average=71 B
/home/sutherlander/.envs/flow/lib/python3.7/collections/__init__.py:464: size=62.3 KiB, count=249, average=256 B
/home/sutherlander/cylc/cylc-flow/cylc/flow/parsec/OrderedDict.py:49: size=56.9 KiB, count=926, average=63 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py:1156: size=55.7 KiB, count=356, average=160 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/graphene/types/typemap.py:294: size=52.2 KiB, count=774, average=69 B
/home/sutherlander/.envs/flow/lib/python3.7/functools.py:60: size=48.9 KiB, count=567, average=88 B
/home/sutherlander/.envs/flow/lib/python3.7/enum.py:170: size=47.2 KiB, count=240, average=201 B
/home/sutherlander/.envs/flow/lib/python3.7/sre_parse.py:426: size=47.2 KiB, count=57, average=848 B
<frozen importlib._bootstrap>:420: size=47.0 KiB, count=750, average=64 B
<frozen importlib._bootstrap_external>:1352: size=46.9 KiB, count=857, average=56 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/graphene/types/argument.py:84: size=46.6 KiB, count=322, average=148 B
<frozen importlib._bootstrap_external>:887: size=46.4 KiB, count=1022, average=46 B
/home/sutherlander/.envs/flow/lib/python3.7/sre_parse.py:112: size=43.6 KiB, count=620, average=72 B
<frozen importlib._bootstrap>:971: size=40.2 KiB, count=130, average=317 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/graphene/types/typemap.py:303: size=40.1 KiB, count=399, average=103 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py:3581: size=39.1 KiB, count=16, average=2504 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py:2660: size=38.6 KiB, count=24, average=1647 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/pkg_resources/_vendor/pyparsing.py:3552: size=37.7 KiB, count=68, average=568 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/graphql/type/definition.py:315: size=36.8 KiB, count=320, average=118 B
<string>:1: size=36.3 KiB, count=428, average=87 B
/home/sutherlander/.envs/flow/lib/python3.7/site-packages/graphene/types/inputobjecttype.py:20: size=36.3 KiB, count=3, average=12.1 KiB
<frozen importlib._bootstrap>:316: size=36.0 KiB, count=1, average=36.0 KiB

The running objects appear not to contribute an appreciable amount..

Those top two allocation lines make sense, as all deltas run through apply_deltas and
data_ele = data[key].setdefault(element.id, MESSAGE_MAP[key]())
is used with element creation or reference to existing when manipulating or on merge with the delta.
Hold on I can change that to get rid of the assignment if not needed (should be more efficient also)...

@dwsutherland
Copy link
Member Author

Done.. However, for the sake of being thread safe it probably makes sense to work on a copy and do an atomic update (although the "work on" is probably just a collection of atomics)..

@dwsutherland
Copy link
Member Author

dwsutherland commented Jan 25, 2020

BTW - The coverage failure is mostly from a nomenclature fix in an area not tested and hard to test (scheduler.py)..

@oliver-sanders oliver-sanders added the efficiency For notable efficiency improvements label Jan 26, 2020
@oliver-sanders
Copy link
Member

oliver-sanders commented Jan 26, 2020

I'm still puzzled as to why Cylc application as a whole uses so much memory

Usually the bulk of the memory of a suite is in the config and task_pool objects.

The difference in size between the "outer" measurement (as recorded from /usr/bin/time, psutil, /proc/meminfo, whatever) and the "inner" measurement (as record from cprofile, pympler, whatever) should come down to libraries, code and Python itself. This memory usage isn't necessarily as scary as it looks as there may be many shared memory objects. If you are using RSS measurements all shared memory will show up against what you are profiling. A USS measurement will filter this stuff out, not many tools offer this, I've used awk to sum the memory objects reported by proc/<pid>/smaps to get USS (and PSS) values in the past.

Once you've filtered out shared memory the remainder of the difference should (I think) be Python itself and the code you are running. I haven't done much PY3 profiling, I did a bit of PY2.6 work with Cylc and found that a hell of a lot of memory was going into co code objects which suggested that perhaps the size of the Cylc codebase (and its dependency graph) could be to blame, if so a more modular architecture might help. More investigation needed...

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good to me, awaiting test results...

Copy link
Member

@oliver-sanders oliver-sanders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replicated the memory leak with a simple cycling suite.

Before

master

After

data-store

That result looks pretty conclusive to me!

@dwsutherland
Copy link
Member Author

@oliver-sanders can probably merge this? two approvals (and the coverage failure is justified (nomenclature change in unrelated uncovered logic))

@oliver-sanders oliver-sanders merged commit 5288546 into cylc:master Jan 28, 2020
@hjoliver
Copy link
Member

Useful indeed. Might come in handy when troubleshooting. Perhaps include it in #113 or in this issue.

Jan 2013 - start of the infamously frustrating test battery!

@kinow
Copy link
Member

kinow commented Jan 29, 2020

Useful indeed. Might come in handy when troubleshooting. Perhaps include it in #113 or in this issue.

Jan 2013 - start of the infamously frustrating test battery!

Weird, I thought I had linked an issue about logging and monitoring. Have no idea where I got this number 😅

@dwsutherland dwsutherland deleted the data_store_leak_fixes branch March 27, 2020 00:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :( efficiency For notable efficiency improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants