Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes for Adaptive #63

Merged
merged 70 commits into from
Jul 16, 2018
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
Show all changes
70 commits
Select commit Hold shift + click to select a range
7c56b1d
add job ids to dask workers
May 18, 2018
6b35688
pad job id
May 18, 2018
507be82
parse PBS/slurm job ids
May 18, 2018
99d0f1f
track workers individually (sort of)
May 21, 2018
78a22ff
add _adaptive_options
May 21, 2018
62e050c
generalize the parsing of the job id
May 22, 2018
d121180
fix typo
May 22, 2018
2329bfe
changes for review
May 22, 2018
92eaf4e
add pluggin (untested)
May 24, 2018
9084a35
a few fixes + tests
May 24, 2018
4776892
respond to first round of comments
May 29, 2018
ef62f59
fix list addition
May 29, 2018
a4e007a
mark test modules (again)
May 29, 2018
c19e4da
fixes while testing on pbs
May 29, 2018
5d5fd85
remove extra if block
May 29, 2018
115b0c1
use for/else
May 29, 2018
cde3ca4
fix two failing tests
May 30, 2018
75f2c6a
Merge branch 'jobids' of github.com:jhamman/dask-jobqueue into jobids
May 30, 2018
66db52d
respond to review comments
Jun 3, 2018
ea7d56d
fix bug in scale down
Jun 4, 2018
a6d31d2
fix marks
Jun 4, 2018
25965c0
Merge branch 'master' of github.com:dask/dask-jobqueue into jobids
Jun 15, 2018
ab4363a
more docs
Jun 21, 2018
604a563
debugging ci
Jun 21, 2018
1e0455e
more ci
Jun 21, 2018
ace37ad
only stop jobs if there are jobs to stop
Jun 21, 2018
56e2990
refactor remove_worker method
Jun 21, 2018
914244c
debug
Jun 21, 2018
359be59
print debug info
Jun 21, 2018
1441634
longer waits in tests
Jun 21, 2018
0bf53d1
refactor use of jobids, add scheduler plugin
May 18, 2018
cc2628f
debug stop workers
Jun 26, 2018
90dd730
fix tests of stop_workers
Jun 26, 2018
9fe2178
Merge branch 'master' into jobids
mrocklin Jun 26, 2018
18dfe31
cleanup
Jun 26, 2018
d98b141
Merge branch 'master' of github.com:dask/dask-jobqueue into jobids
Jun 26, 2018
b303275
more flexible stop workers
Jun 26, 2018
13e5dc3
Merge branch 'jobids' of github.com:jhamman/dask-jobqueue into jobids
Jun 26, 2018
b4877ad
Merge remote-tracking branch 'origin/jobids' into jobids
Jun 26, 2018
667369e
remove Job class
Jun 27, 2018
bf99d29
Merge branch 'master' of github.com:dask/dask-jobqueue into jobids
Jun 27, 2018
292b595
fix for worker name (again)
Jun 27, 2018
627f873
debug
Jun 27, 2018
f2b2a92
perform a backflip
Jun 27, 2018
1f0dc71
Merge branch 'jobids' of github.com:jhamman/dask-jobqueue into jobids
Jun 27, 2018
a1b102d
Merge branch 'master' of github.com:dask/dask-jobqueue into jobids
Jul 3, 2018
c988f1e
isort after merge conflict resolution
Jul 3, 2018
619047f
more docs stuff
Jul 3, 2018
8db65eb
update worker name template to use bracket delims
Jul 3, 2018
ef16298
add a few more comments
Jul 3, 2018
3803918
roll back changes in tests
Jul 3, 2018
aad58d4
fix slurm tests and missing scheduler plugin after merge conflict res…
Jul 3, 2018
fa1b717
fix threads in test
Jul 3, 2018
aeea2e5
debug on travis
Jul 6, 2018
b93a7c4
simplify tests
Jul 6, 2018
0a2e304
more debugging and more robust sleeps
Jul 7, 2018
1a1fe75
unify basic test
Jul 7, 2018
8c32872
unify adaptive tests
Jul 7, 2018
9c43f43
Merge remote-tracking branch 'origin/jobids' into jobids
Jul 9, 2018
a02abc8
Merge branch 'master' of github.com:dask/dask-jobqueue into jobids
Jul 9, 2018
ee89f20
debug statements and some nice fixups
Jul 9, 2018
4abcea1
future div
Jul 13, 2018
0c7425a
add logging stuff
Jul 13, 2018
5d4552d
-s for pytest
Jul 13, 2018
8a150c9
use --job_id-- for name
mrocklin Jul 13, 2018
ca0c727
fix memory in sge tests
mrocklin Jul 13, 2018
ce007df
remove pending jobs when scaling down
Jul 14, 2018
c23ce7c
remove pending jobs
Jul 14, 2018
7618467
cleanup after lots of debugging
Jul 15, 2018
d5e42b3
additional cleanup
Jul 16, 2018
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 21 additions & 11 deletions dask_jobqueue/core.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
from __future__ import absolute_import, division, print_function

import logging
import math
import os
import shlex
import socket
Expand Down Expand Up @@ -133,8 +136,11 @@ def __init__(self,
if memory is not None:
self._command_template += " --memory-limit %s" % memory
if name is not None:
self._command_template += " --name %s" % name
self._command_template += "-%(n)d" # Keep %(n) to be replaced later
# worker names follow this template: {NAME}-{JOB_ID}-{WORKER_NUM}
self._command_template += " --name %s" % name # e.g. "dask-worker"
# Keep %(n) to be replaced later (worker id on this job)
# ${JOB_ID} is an environment variable describing this job
self._command_template += "-${JOB_ID}-%(n)d"
if death_timeout is not None:
self._command_template += " --death-timeout %s" % death_timeout
if local_directory is not None:
Expand All @@ -161,7 +167,8 @@ def job_file(self):
def start_workers(self, n=1):
""" Start workers and point them to our local scheduler """
workers = []
for _ in range(n):
num_jobs = min(1, math.ceil(n / self.worker_processes))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using min here? This would always lead to only one job started if I'm not mistaken.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point. I've removed this.

for _ in range(num_jobs):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a breaking change I want to make sure everyone is aware of. The current behavior for a hypothetical setup that includes 10 workers per job would be:

cluster.start_workers(1)

...and get 1 job and 10 workers.

I'd like to change this so that start_workers(n) gives us n workers and as many jobs as needed to make that happen.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Historically start_workers was a semi-convention between a few projects. This has decayed, so I have no strong thoughts here. I do think that we need to be consistent on scale though, which seems a bit more standard today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this really help adaptive? Would'nt there still be a problem with starting the worker in a grouped manner?

With your example, calling cluster.start_workers(1) will still lead to 1 job and 10 workers!

But this may be well handled by adaptive, I don't know. In this case, this may not be needed to do this breaking change?

with self.job_file() as fn:
out = self._call(shlex.split(self.submit_command) + [fn])
job = self._job_id_from_submit_output(out.decode())
Expand Down Expand Up @@ -196,12 +203,12 @@ def _calls(self, cmds):
Also logs any stderr information
"""
logger.debug("Submitting the following calls to command line")
procs = []
for cmd in cmds:
logger.debug(' '.join(cmd))
procs = [subprocess.Popen(cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE)
for cmd in cmds]
procs.append(subprocess.Popen(cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE))

result = []
for proc in procs:
Expand Down Expand Up @@ -232,10 +239,13 @@ def scale_up(self, n, **kwargs):

def scale_down(self, workers):
''' Close the workers with the given addresses '''
if isinstance(workers, dict):
names = {v['name'] for v in workers.values()}
job_ids = {name.split('-')[-2] for name in names}
self.stop_workers(job_ids)
if not isinstance(workers, dict):
raise ValueError(
'Expected dictionary of workers, got %s' % type(workers))
names = {v['name'] for v in workers.values()}
# This will close down the full group of workers
job_ids = {name.split('-')[-2] for name in names}
self.stop_workers(job_ids)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking there is a better way to do this. The current behavior to scale down removes the entire job from the system. So if Adaptive tells us to remove 1 worker (say we have 10 workers per job), we're going to remove all 10.

@mrocklin - Would it make sense to add logic to Adaptive so it knows how to bundle groups of workers? Otherwise, we could bundle here and check to see we're being asked to scale down an entire group.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The key= parameter to workers_to_close (passed through from retire_workers) seems relevant here. I believe that it was made for this purpose.

https://github.com/dask/distributed/blob/master/distributed/scheduler.py#L2525-L2548

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad to see that grouped worker is handled in adaptive!

Another comment here, not linked to this PR, is that I find the job_ids var name misleading. Should be something like worker_ids.


def __enter__(self):
return self
Expand Down
3 changes: 2 additions & 1 deletion dask_jobqueue/pbs.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,7 +73,7 @@ def __init__(self,
# Try to find a project name from environment variable
project = project or os.environ.get('PBS_ACCOUNT')

header_lines = []
header_lines = ['#!/usr/bin/env bash']
# PBS header build
if self.name is not None:
header_lines.append('#PBS -N %s' % self.name)
Expand All @@ -93,6 +93,7 @@ def __init__(self,
if walltime is not None:
header_lines.append('#PBS -l walltime=%s' % walltime)
header_lines.extend(['#PBS %s' % arg for arg in job_extra])
header_lines.append('JOB_ID=${PBS_JOBID%.*}')

# Declare class attribute that shall be overriden
self.job_header = '\n'.join(header_lines)
Expand Down
5 changes: 3 additions & 2 deletions dask_jobqueue/sge.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
from __future__ import absolute_import, division, print_function

import logging

import dask
Expand Down Expand Up @@ -52,8 +54,7 @@ def __init__(self,

super(SGECluster, self).__init__(**kwargs)

header_lines = ['#!/bin/bash']

header_lines = ['#!/usr/bin/env bash']
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you don't have a solution for propagating JOB_ID var in sge script?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch. I thought I did but I'll have to sort it out.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I take that back. SGE uses JOB_ID for its environment variable so its already present.

if self.name is not None:
header_lines.append('#$ -N %(name)s')
if queue is not None:
Expand Down
3 changes: 2 additions & 1 deletion dask_jobqueue/slurm.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ def __init__(self,
super(SLURMCluster, self).__init__(**kwargs)

# Always ask for only one task
header_lines = []
header_lines = ['#!/usr/bin/env bash']
# SLURM header build
if self.name is not None:
header_lines.append('#SBATCH -J %s' % self.name)
Expand Down Expand Up @@ -99,6 +99,7 @@ def __init__(self,

if walltime is not None:
header_lines.append('#SBATCH -t %s' % walltime)
header_lines.append('JOB_ID=${SLURM_JOB_ID%;*}')
header_lines.extend(['#SBATCH %s' % arg for arg in job_extra])

# Declare class attribute that shall be overriden
Expand Down