Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large suites: more efficient dependency matching. #1688

Closed
hjoliver opened this issue Dec 6, 2015 · 6 comments
Closed

Large suites: more efficient dependency matching. #1688

hjoliver opened this issue Dec 6, 2015 · 6 comments
Assignees
Milestone

Comments

@hjoliver
Copy link
Member

hjoliver commented Dec 6, 2015

Companion to #1689. The speed of dependency matching goes down as the number of task proxies goes up. Currently indiscriminate: we get all waiting task proxies to ask all non-waiting ones to satisfy their prerequisites (albeit through a middle-man that records everyone's outputs). Given that the graph specifies exactly who depends on who, we should be able to have a newly completed output immediately update the prerequisites of exactly those tasks that are known to depend on it. This should scale much better to huge suite sizes.

@matthewrmshin
Copy link
Contributor

👍

@hjoliver
Copy link
Member Author

hjoliver commented Dec 7, 2015

(like #1689, this is also an old idea: #108 (comment))

benfitzpatrick added a commit to benfitzpatrick/cylc that referenced this issue Mar 18, 2016
@benfitzpatrick benfitzpatrick self-assigned this Mar 18, 2016
@benfitzpatrick
Copy link
Contributor

I had a look, and I think we can get our performance boost just by using set intersection to cut things down in the Prerequisite.satisfy_me method: see d8b48b3. In our dev/suites/busy suite, this cut broker.negotiate down from the highest bottleneck (57 seconds of cumulative time) to one of the most insignificant (1.1 seconds).

@benfitzpatrick
Copy link
Contributor

Some results for our dev/suites/busy with 10x the tasks (30,000 tasks in total) under this cylc change.

N.B. I ran it with ~/cylc-run/busy symlinked to /dev/shm/busy (a tmpfs/'RAM drive') to spare our NFS home filesystem. Under those conditions, I had to impose a queue limit and switch to background submission to not destroy the machine 😄 ...

         879689015 function calls (861810653 primitive calls) in 4520.325 CPU seconds

   Ordered by: cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.874    0.874 4469.523 4469.523 /opt/cylc/lib/cylc/scheduler.py:905(run)
      781    0.426    0.001  941.539    1.206 /opt/cylc/lib/cylc/scheduler.py:1088(update_state_summary)
      781  235.597    0.302  941.050    1.205 /opt/cylc/lib/cylc/network/suite_state.py:65(update)
      785  786.158    1.001  786.158    1.001 {time.sleep}
 34184093  531.937    0.000  531.937    0.000 {method 'items' of 'dict' objects}
      785  168.758    0.215  485.606    0.619 /opt/cylc/lib/cylc/task_pool.py:1228(process_queued_db_ops)
      783   54.567    0.070  469.201    0.599 /opt/cylc/lib/cylc/task_pool.py:463(submit_tasks)
      781   34.449    0.044  307.822    0.394 /opt/cylc/lib/cylc/suite_state_dumping.py:53(dump)
      783    0.453    0.001  306.895    0.392 /opt/cylc/lib/cylc/task_pool.py:528(submit_task_jobs)
    30000    0.462    0.000  303.674    0.010 /opt/cylc/lib/cylc/task_proxy.py:1262(prep_submit)
      785   20.788    0.026  275.883    0.351 /opt/cylc/lib/cylc/task_pool.py:1033(process_queued_task_event_handlers)
      786    0.769    0.001  262.350    0.334 /opt/cylc/lib/cylc/mp_pool.py:192(handle_results_async)
      745    1.433    0.002  247.798    0.333 /opt/cylc/lib/cylc/task_pool.py:600(submit_task_jobs_callback)
      745   97.774    0.131  246.365    0.331 /opt/cylc/lib/cylc/task_pool.py:914(_manip_task_jobs_callback)
    30000    1.571    0.000  243.200    0.008 /opt/cylc/lib/cylc/task_proxy.py:1294(_prep_submit_impl)
      784    8.113    0.010  230.541    0.294 /opt/cylc/lib/cylc/task_pool.py:1292(spawn_tasks)
 12303944   47.707    0.000  223.320    0.000 /opt/cylc/lib/cylc/task_proxy.py:1774(dump_state)
      785   12.708    0.016  217.209    0.277 /opt/cylc/lib/cylc/task_pool.py:1028(process_queued_task_messages)
  8301000   26.949    0.000  204.479    0.000 /opt/cylc/lib/cylc/task_proxy.py:1576(process_incoming_messages)
1072004/42002   23.805    0.000  202.614    0.005 /opt/cylc/lib/parsec/util.py:90(replicate)
    30000    0.243    0.000  197.365    0.007 /opt/cylc/lib/cylc/task_pool.py:1280(force_spawn)
    40000   15.766    0.000  197.222    0.005 /opt/cylc/lib/cylc/task_proxy.py:215(__init__)
201000/57000    0.663    0.000  195.757    0.003 /opt/cylc/lib/parsec/util.py:113(pdeepcopy)
      784   15.861    0.020  184.593    0.235 /opt/cylc/lib/cylc/task_pool.py:1011(match_dependencies)
      748  141.108    0.189  182.957    0.245 /opt/cylc/lib/cylc/task_pool.py:1144(_process_task_job_logs_register)
    30000    0.157    0.000  181.161    0.006 /opt/cylc/lib/cylc/task_proxy.py:1778(spawn)
 14614155   57.717    0.000  158.322    0.000 /opt/cylc/lib/cylc/task_proxy.py:499(ready_to_run)
 13960620   46.828    0.000  141.544    0.000 /opt/cylc/lib/cylc/task_proxy.py:616(prerequisites_are_all_satisfied)
 24784511  126.993    0.000  139.298    0.000 {sorted}
   548816    2.913    0.000  133.880    0.000 /opt/cylc/lib/cylc/task_proxy.py:459(log)
      784   30.354    0.039  132.735    0.169 /opt/cylc/lib/cylc/task_pool.py:1339(remove_spent_tasks)
   548816    3.046    0.000  130.967    0.000 /usr/lib64/python2.6/logging/__init__.py:1112(log)
 12303944   97.813    0.000  127.762    0.000 /opt/cylc/lib/cylc/task_proxy.py:1814(get_state_summary)
    40000    0.910    0.000  126.106    0.003 /opt/cylc/lib/cylc/network/task_msgqueue.py:29(__init__)
 15308025  124.331    0.000  124.331    0.000 {method 'write' of 'file' objects}
   200006    1.777    0.000  123.528    0.001 /usr/lib64/python2.6/logging/__init__.py:1163(_log)
    30000    0.241    0.000  123.468    0.004 /opt/cylc/lib/cylc/task_pool.py:604(<lambda>)
    30000    1.192    0.000  123.227    0.004 /opt/cylc/lib/cylc/task_proxy.py:846(job_submit_callback)
    40007    0.376    0.000  112.120    0.003 /opt/cylc/lib/cylc/network/pyro_base.py:46(__init__)
    40007    0.480    0.000  111.662    0.003 /usr/lib/python2.6/site-packages/Pyro/core.py:43(__init__)
    40007    3.107    0.000  111.168    0.003 /usr/lib/python2.6/site-packages/Pyro/util.py:327(getGUID)
  8319000   14.915    0.000  108.663    0.000 /opt/cylc/lib/cylc/task_proxy.py:1822(not_fully_satisfied)
114672073  106.210    0.000  106.210    0.000 /opt/cylc/lib/cylc/task_state.py:170(is_currently)
   200006    0.854    0.000  104.228    0.001 /usr/lib64/python2.6/logging/__init__.py:1183(handle)
     1570    1.625    0.001  101.017    0.064 /opt/cylc/lib/cylc/rundb.py:316(execute_queued_items)
    40007  100.244    0.003  100.244    0.003 {_socket.gethostbyname}
    60000    4.230    0.000   98.170    0.002 /opt/cylc/lib/cylc/task_proxy.py:1586(process_incoming_message)
32671439/18966820   17.156    0.000   95.930    0.000 {all}
    15112    0.179    0.000   94.110    0.006 /opt/cylc/lib/cylc/rundb.py:367(_execute_stmt)
    15112   92.951    0.006   92.951    0.006 {method 'executemany' of 'sqlite3.Connection' objects}
   200006    4.048    0.000   88.623    0.000 /usr/lib64/python2.6/logging/__init__.py:1212(callHandlers)
  3626233   41.402    0.000   87.457    0.000 /opt/cylc/lib/parsec/OrderedDict.py:68(keys)
      785    0.030    0.000   86.859    0.111 /opt/cylc/lib/cylc/scheduler.py:1112(process_tasks)
      785    9.969    0.013   86.809    0.111 /opt/cylc/lib/cylc/task_pool.py:1513(waiting_tasks_ready)
  8649285   15.541    0.000   86.708    0.000 /opt/cylc/lib/parsec/OrderedDict.py:61(__setitem__)
 27921240   33.232    0.000   85.663    0.000 /opt/cylc/lib/cylc/task_proxy.py:617(<genexpr>)
   200006    1.763    0.000   84.576    0.000 /usr/lib64/python2.6/logging/__init__.py:674(handle)
      784   45.260    0.058   83.060    0.106 /opt/cylc/lib/cylc/task_pool.py:1323(_get_earliest_unsatisfied_point)
  2638125    5.988    0.000   79.158    0.000 /opt/cylc/lib/parsec/OrderedDict.py:105(__nonzero__)
 12303944   70.945    0.000   78.864    0.000 /opt/cylc/lib/cylc/task_state.py:197(dump)
   200006    1.914    0.000   73.791    0.000 /usr/lib64/python2.6/logging/handlers.py:68(emit)
  1491331    4.488    0.000   71.519    0.000 /opt/cylc/lib/parsec/config.py:99(get)
  8649299   63.142    0.000   71.167    0.000 /usr/lib64/python2.6/collections.py:47(__setitem__)
    30000    1.880    0.000   70.877    0.002 /opt/cylc/lib/cylc/task_proxy.py:1065(job_submission_succeeded)
  8361785   25.459    0.000   70.270    0.000 /usr/lib64/python2.6/Queue.py:86(qsize)
  1491330    3.063    0.000   66.509    0.000 /opt/cylc/lib/parsec/config.py:94(expand)
    30000    1.216    0.000   60.011    0.002 /opt/cylc/lib/cylc/job_file.py:43(write)
   871508    7.816    0.000   53.604    0.000 /opt/cylc/lib/cylc/cfgspec/globalcfg.py:433(get_host_item)
 13960620   39.960    0.000   52.431    0.000 /opt/cylc/lib/cylc/prerequisite.py:122(is_satisfied)
    90000    1.187    0.000   51.871    0.001 /opt/cylc/lib/cylc/task_proxy.py:902(setup_event_handlers)
   942063    7.908    0.000   51.767    0.000 /opt/cylc/lib/parsec/OrderedDict.py:47(__init__)
   619632    2.175    0.000   51.069    0.000 /opt/cylc/lib/cylc/rundb.py:279(add_update_item)
        1    0.001    0.001   50.690   50.690 /opt/cylc/lib/cylc/scheduler.py:190(configure)
      785   30.411    0.039   49.648    0.063 /opt/cylc/lib/cylc/task_pool.py:1442(check_task_timers)
    90000    6.806    0.000   49.049    0.001 /opt/cylc/lib/cylc/task_proxy.py:464(command_log)
   619632   44.111    0.000   48.894    0.000 /opt/cylc/lib/cylc/rundb.py:127(add_update_item)
    60000    4.791    0.000   48.259    0.001 /opt/cylc/lib/cylc/task_proxy.py:1230(register_job_logs)
   300754    2.690    0.000   47.646    0.000 /opt/cylc/lib/cylc/cfgspec/globalcfg.py:396(get_derived_host_item)
 12393944   22.782    0.000   45.161    0.000 /opt/cylc/lib/cylc/task_id.py:40(split)
   270000    4.640    0.000   44.084    0.000 /opt/cylc/lib/cylc/task_proxy.py:199(get_job_log_dir)
   942066   22.751    0.000   43.860    0.000 /usr/lib64/python2.6/collections.py:31(__init__)
20505179/18186290   42.008    0.000   42.802    0.000 /opt/cylc/lib/parsec/OrderedDict.py:52(__getitem__)
   389816    7.281    0.000   42.296    0.000 /opt/cylc/lib/cylc/task_proxy.py:433(_get_events_conf)
    70000    7.004    0.000   40.955    0.001 /opt/cylc/lib/cylc/task_proxy.py:1177(set_from_rtconfig)
   200006    2.081    0.000   39.800    0.000 /usr/lib64/python2.6/logging/handlers.py:139(shouldRollover)
 59815297   39.023    0.000   39.023    0.000 /usr/lib64/python2.6/collections.py:66(__iter__)
  1120000    2.998    0.000   35.159    0.000 /opt/cylc/lib/cylc/rundb.py:266(add_insert_item)
    90000    3.031    0.000   34.246    0.000 /opt/cylc/lib/cylc/task_proxy.py:970(setup_custom_event_handlers)
  8913154   34.115    0.000   34.115    0.000 {built-in method acquire}
        1    0.000    0.000   33.100   33.100 /opt/cylc/lib/cylc/scheduler.py:645(configure_suite)
        1    0.000    0.000   33.093   33.093 /opt/cylc/lib/cylc/scheduler.py:607(load_suiterc)
  1120000   26.599    0.000   32.160    0.000 /opt/cylc/lib/cylc/rundb.py:104(add_insert_item)
   200006    1.040    0.000   32.077    0.000 /usr/lib64/python2.6/logging/__init__.py:859(emit)
   988096    8.445    0.000   31.986    0.000 /opt/cylc/lib/parsec/OrderedDict.py:80(items)
   200006    3.021    0.000   31.037    0.000 /usr/lib64/python2.6/logging/__init__.py:772(emit)
      784   22.387    0.029   30.365    0.039 /opt/cylc/lib/cylc/broker.py:39(register)
   150791   29.080    0.000   29.080    0.000 {open}
 30140996   28.872    0.000   28.872    0.000 /opt/cylc/lib/cylc/task_state.py:167(has_spawned)
    13611    0.033    0.000   28.864    0.002 /opt/cylc/lib/cylc/config.py:127(get_inst)
        1    1.580    1.580   28.831   28.831 /opt/cylc/lib/cylc/config.py:152(__init__)
  3342801    5.462    0.000   28.528    0.000 /opt/cylc/lib/cylc/broker.py:59(negotiate)
    30000    0.882    0.000   27.985    0.001 /opt/cylc/lib/cylc/task_proxy.py:1846(_create_job_log_path)
    30000    5.471    0.000   26.635    0.001 /opt/cylc/lib/cylc/job_file.py:219(_write_environment_1)
 14088444   23.491    0.000   26.063    0.000 /opt/cylc/lib/cylc/cycling/iso8601.py:110(__cmp__)
  8319000   16.909    0.000   25.030    0.000 /opt/cylc/lib/cylc/task_proxy.py:1788(ready_to_spawn)
  1162137    5.133    0.000   24.783    0.000 /opt/cylc/lib/cylc/wallclock.py:103(get_time_string)
        1    0.000    0.000   24.612   24.612 /opt/cylc/lib/cylc/config.py:2040(load_graph)
        1    0.072    0.072   24.612   24.612 /opt/cylc/lib/cylc/config.py:2169(parse_graph)
     4000    2.430    0.001   24.458    0.006 /opt/cylc/lib/cylc/config.py:1306(process_graph_line)
   400012    2.657    0.000   24.068    0.000 /usr/lib64/python2.6/logging/__init__.py:651(format)
  3342801   14.787    0.000   23.066    0.000 /opt/cylc/lib/cylc/task_proxy.py:1827(satisfy_me)
 12613748   22.818    0.000   22.818    0.000 {method 'split' of 'str' objects}
   200006   22.697    0.000   22.697    0.000 {method 'seek' of 'file' objects}
   400012    4.715    0.000   21.412    0.000 /usr/lib64/python2.6/logging/__init__.py:431(format)
   972066   11.370    0.000   21.408    0.000 /usr/lib64/python2.6/_abcoll.py:466(update)
    10000    0.225    0.000   20.606    0.002 /opt/cylc/lib/cylc/config.py:1621(generate_taskdefs)
   612125    2.484    0.000   20.255    0.000 /opt/cylc/lib/cylc/wallclock.py:80(get_current_time_string)
  1162137   19.649    0.000   19.649    0.000 {method 'strftime' of 'datetime.date' objects}
    30000    0.169    0.000   19.603    0.001 /opt/cylc/lib/cylc/task_pool.py:607(<lambda>)
    30000    3.140    0.000   19.433    0.001 /opt/cylc/lib/cylc/task_proxy.py:875(job_cmd_out_callback)
    30745    0.770    0.000   18.879    0.001 /opt/cylc/lib/cylc/mp_pool.py:180(put_command)
    30745    0.439    0.000   18.093    0.001 /usr/lib64/python2.6/multiprocessing/pool.py:186(apply_async)
   390035   18.011    0.000   18.011    0.000 {posix.stat}
        1    0.070    0.070   17.575   17.575 /opt/cylc/bin/cylc-run:121(load_tasks)
   400012   17.554    0.000   17.554    0.000 /usr/lib64/python2.6/logging/__init__.py:574(filter)
   550012    2.711    0.000   17.381    0.000 /opt/cylc/lib/cylc/wallclock.py:177(get_time_string_from_unix_time)
  4976199   10.449    0.000   17.098    0.000 /opt/cylc/lib/cylc/task_proxy.py:619(suicide_prerequisites_are_all_satisfied)
    10000    0.079    0.000   16.756    0.002 /opt/cylc/lib/cylc/get_task_proxy.py:23(get_task_proxy)
 32402767   16.372    0.000   16.372    0.000 {method 'get' of 'dict' objects}
    40000    0.807    0.000   16.093    0.000 /opt/cylc/lib/cylc/task_pool.py:193(add_to_runahead_pool)
    30000    0.070    0.000   15.813    0.001 /opt/cylc/lib/cylc/mkdir_p.py:35(mkdir_p)
60003/30000    0.380    0.000   15.743    0.001 /usr/lib64/python2.6/os.py:136(makedirs)
10867928/10438928   13.124    0.000   15.682    0.000 /opt/cylc/lib/parsec/OrderedDict.py:99(__contains__)
  2043106   11.961    0.000   15.551    0.000 /usr/lib64/python2.6/posixpath.py:59(join)
 12583760   14.671    0.000   14.671    0.000 /opt/cylc/lib/cylc/task_state.py:158(get_status)
   200006    1.020    0.000   14.366    0.000 /usr/lib64/python2.6/logging/__init__.py:1150(makeRecord)
   150760    1.633    0.000   14.183    0.000 /usr/lib64/python2.6/threading.py:176(Condition)
   400012    0.874    0.000   13.517    0.000 /opt/cylc/lib/cylc/suite_logging.py:87(formatTime)
  8804517   13.428    0.000   13.428    0.000 {built-in method release}
   200006    7.865    0.000   13.346    0.000 /usr/lib64/python2.6/logging/__init__.py:226(__init__)
    30000    0.235    0.000   13.295    0.000 /opt/cylc/lib/cylc/task_proxy.py:1010(custom_event_handler_callback)
    40003    1.227    0.000   13.080    0.000 /usr/lib64/python2.6/Queue.py:22(__init__)
   140000    4.404    0.000   13.046    0.000 /opt/cylc/lib/cylc/poll_timer.py:31(__init__)
   128271    0.989    0.000   12.986    0.000 /usr/lib64/python2.6/threading.py:272(notify)
  8421785    9.268    0.000   12.799    0.000 /usr/lib64/python2.6/Queue.py:200(_qsize)
 24688738   12.665    0.000   12.788    0.000 {method 'setdefault' of 'dict' objects}
      785    6.349    0.008   12.727    0.016 /opt/cylc/lib/cylc/task_pool.py:251(release_runahead_tasks)
    30746    0.490    0.000   12.597    0.000 /usr/lib64/python2.6/Queue.py:107(put)
   150760   12.072    0.000   12.550    0.000 /usr/lib64/python2.6/threading.py:181(__init__)
 12303944   12.237    0.000   12.237    0.000 /opt/cylc/lib/cylc/suite_state_dumping.py:85(<lambda>)
    30000    1.378    0.000   12.061    0.000 /opt/cylc/lib/cylc/task_proxy.py:1440(_populate_job_conf)
    60021    8.210    0.000   11.996    0.000 /usr/lib64/python2.6/genericpath.py:15(exists)
 37998935   11.814    0.000   11.814    0.000 {method 'append' of 'list' objects}
 20882129    9.407    0.000   11.620    0.000 {isinstance}
   200006    0.710    0.000   11.451    0.000 /usr/lib64/python2.6/logging/__init__.py:765(flush)
 10077201   10.874    0.000   10.874    0.000 /opt/cylc/lib/cylc/cycling/__init__.py:111(__str__)
    40000    2.619    0.000   10.788    0.000 /opt/cylc/lib/cylc/task_proxy.py:366(_add_prerequisites)
   159816    2.079    0.000   10.588    0.000 /opt/cylc/lib/cylc/task_proxy.py:1761(set_status)
   200014   10.497    0.000   10.497    0.000 {method 'flush' of 'file' objects}
   443016    9.289    0.000   10.474    0.000 /usr/lib64/python2.6/copy.py:65(copy)
  8375446   10.310    0.000   10.310    0.000 {getattr}
    90000    1.080    0.000   10.088    0.000 /opt/cylc/lib/cylc/task_proxy.py:447(_get_host_conf)

(lots more below 10 seconds that I've omitted)

The 10x busy suite I ran with (note: runtime uses wrong range for i2, but that doesn't matter):

#!jinja2
title = Suite-1000-x3
description = A suite of 1000 tasks per cycle
[cylc]
    UTC mode = True # Ignore DST
[scheduling]
    initial cycle point = 20130101T00
    final cycle point = 20130103T00
    runahead limit = PT12H
    [[queues]]
        [[[everything]]]
            limit = 100
            members = root
    [[dependencies]]
        [[[T00]]]
            graph="""
                {% for i0 in range(10) -%}
                {% for i1 in range(10) -%}
                {% for i2 in range(10) -%}
                t{{i0}}{{i1}}{{i2}}9[-P1D] => t{{i0}}{{i1}}{{i2}}0
                t{{i0}}{{i1}}{{i2}}0 => t{{i0}}{{i1}}{{i2}}1 => \
                t{{i0}}{{i1}}{{i2}}2 => t{{i0}}{{i1}}{{i2}}3
                t{{i0}}{{i1}}{{i2}}3 => t{{i0}}{{i1}}{{i2}}4 => \
                t{{i0}}{{i1}}{{i2}}5 => t{{i0}}{{i1}}{{i2}}6
                t{{i0}}{{i1}}{{i2}}6 => t{{i0}}{{i1}}{{i2}}7 => \
                t{{i0}}{{i1}}{{i2}}8 => t{{i0}}{{i1}}{{i2}}9
                {% endfor -%}
                {% endfor -%}
                {% endfor -%}
            """
[runtime]
    [[root]]
        command scripting = sleep 1
        [[[event hooks]]]
            succeeded handler = true
            failed handler = true
            retry handler = true
            submission failed handler = true
            submission timeout handler = true
            execution timeout handler = true
            execution timeout = PT6M
            submission timeout = PT1M
{% for i0 in range(10) -%}
{% for i1 in range(10) -%}
{% for i2 in range(1) -%}
{% for i3 in range(10) -%}
    [[t{{i0}}{{i1}}{{i2}}{{i3}}]]
{% endfor -%}
{% endfor -%}
{% endfor -%}
{% endfor -%}

@hjoliver
Copy link
Member Author

That's amazing, good job. Let's get this in! 😀

benfitzpatrick added a commit to benfitzpatrick/cylc that referenced this issue Mar 21, 2016
benfitzpatrick added a commit to benfitzpatrick/cylc that referenced this issue Mar 24, 2016
@hjoliver
Copy link
Member Author

Closing this as #1769 more or less does what was proposed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants