Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kick-off doc section about common work-arounds. #430

Merged
merged 1 commit into from
May 18, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 74 additions & 0 deletions docs/source/common-work-arounds.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
Common work-arounds
===================

The universe of HPC clusters is extremely diverse, with different job
schedulers, different configuration, different decisions (security, usage, etc...)
made by each HPC cluster. An unfortunate consequence of this is that this is
impossible for Dask-Jobqueue to cover all possible tiny edge cases of some HPC
clusters.

This page is an attempt to document work-arounds that are likely to be useful
on some clusters (strictly more than one ideally although hard to be sure ...).

Skipping unrecognised line in submission script with ``header_skip``
--------------------------------------------------------------------

On some clusters the submission script generated by Dask-Jobqueue (you can look
at it with ``print(cluster.job_script())``) may not work on your HPC cluster
because on some configuration quirk of your HPC cluster. Probably there are
some reasons behind this configuration quirk of course.

You'll get an error when doing ``cluster.scale`` (i.e. where you actually
submit some jobs) that will tell you your job scheduler is not happy with your
job submission script (see examples below). The main parameter you can use to
work-around this is ``header_skip``:

.. code-block:: python

# this will remove any line containing either '--mem' or
# 'another-string' from the job submission script
cluster = YourCluster(
header_skip=['--mem', 'another-string'],
**other_options_go_here)


An example of this problem is very well detailed in this `blog post
<https://blog.dask.org/2019/08/28/dask-on-summit#invalid-operations-in-the-job-script>`_
by Matthew Rocklin. In his case, the error was:

.. code-block:: text

Command:
bsub /tmp/tmp4874eufw.sh
stdout:

Typical usage:
bsub [LSF arguments] jobscript
bsub [LSF arguments] -Is $SHELL
bsub -h[elp] [options]
bsub -V

NOTES:
* All jobs must specify a walltime (-W) and project id (-P)
* Standard jobs must specify a node count (-nnodes) or -ln_slots. These jobs cannot specify a resource string (-R).
* Expert mode jobs (-csm y) must specify a resource string and cannot specify -nnodes or -ln_slots.

stderr:
ERROR: Resource strings (-R) are not supported in easy mode. Please resubmit without a resource string.
ERROR: -n is no longer supported. Please request nodes with -nnodes.
ERROR: No nodes requested. Please request nodes with -nnodes.

Another example of this issue is this github `issue
<https://github.com/dask/dask-jobqueue/issues/238>`_ where ``--mem`` is not an
accepted option on some SLURM clusters. The error was something like this:

.. code-block:: text

$sbatch submit_slurm.sh
sbatch: error: Memory specification can not be satisfied
sbatch: error: Batch job submission failed: Requested node configuration is not available





1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,7 @@ A good entry point to know more about how to use ``dask-jobqueue`` is
configuration-setup
examples
configurations
common-work-arounds
api

.. toctree::
Expand Down