Skip to content

PrincetonUniversity/job_defense_shield

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 

Repository files navigation

License: GPL v2 DOI

Job Defense Shield

The general version is coming. Please see the version that is specific to our institution in the meantime.

Job Defense Shield is simple Python code for sending automated email alerts to users and for creating reports for system administrators. It is part of the Jobstats platform.

About

The software in this repo creates a report of problem users and problem jobs. The software identifies the following:

  • actively running jobs where a GPU has zero utilization
  • the heaviest users with low CPU or GPU utilization
  • jobs that use the large-memory nodes but do not need them
  • jobs that could have been run on MIG GPUs instead of full A100 GPUs
  • multinode CPU jobs where one or more nodes have zero utilization
  • jobs with excessive run time limits
  • jobs with CPU or GPU fragmentation (e.g., 1 GPU per node over 4 nodes)
  • jobs with the most CPU-cores and jobs with the most GPUs
  • pending jobs with the longest queue times
  • jobs that request more than the default memory but do not use it

The script does not identify:

  • abuses of file storage or I/O

How to Use

The following example show how to check for zero GPU utilization of actively running jobs:

$ ./job_defense_shield.py --email \
                          --days=1 \
                          --zero-gpu-utilization \
                          --files /nfs/.shield/violations

The example below runs several alerts at once:

$ ./job_defense_shield.py --email \
                          --days=7 \
                          --zero-util-gpu-hours \
                          --gpu-fragmentation \
                          --mig \
                          --low-xpu-efficiency \
                          --low-time-efficiency \
                          --datascience \
                          --longest-queued   

Installation

The Job Defense Shield is written in Python. The requirements are:

  • Python 3.7+
  • Pandas
  • jobstats (if looking to send emails about actively running jobs)

Sample Emails

Actively Running Jobs with Zero GPU Utilization

Hi Alan,

You have a GPU job that has been running for more than 1 hour but it appears to
not be using the GPU:

      JobID     NetID  Cluster  GPUs-Allocated  GPUs-Unused GPU-Util  Hours
     49412914  aturing  della         1             1          0%     2.8  

Please consider canceling the job listed above by using the "scancel" command,
for example:

     $ scancel 49412914

The utilization of each allocated GPU is measured every 30 seconds. All
measurements for the job above have been reported as 0%. You can see this by
running the "jobstats" command, for example:

     $ jobstats 49412914

Follow the link at the bottom of the "jobstats" output for more detailed
information.

If the GPU is not being used then you need to take action now to resolve this
issue. Wasting resources prevents other users from getting their work done and
it causes your subsequent jobs to have a lower priority. Users that continually
underutilize the GPUs risk having their accounts suspended.

Toward resolving this issue, please consult the documentation for the code that
you are running. Is it GPU-enabled?

For general information about GPU computing at Princeton:

     https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing

Please monitor your jobs using the "jobstats" command and the web interface:

     https://researchcomputing.princeton.edu/support/knowledge-base/job-stats

Add the following lines to your Slurm scripts to receive an email report with
GPU utilization information after each job finishes:

     #SBATCH --mail-type=end
     #SBATCH --mail-user=aturing@princeton.edu

Consider attending an in-person Research Computing help session for assistance:

     https://researchcomputing.princeton.edu/support/help-sessions

Replying to this automated email will open a support ticket with Research
Computing. Let us know if we can be of help.

Low GPU Utilization

Hi Alan,

Over the last 8 days you have used the 10th most GPU-hours on Della (GPU) but
your mean GPU efficiency is only 12%:

      NetID  Partition(s)  Jobs  GPU-hours GPU-rank Efficiency
     aturing     gpu       1670    1902     10/118     12%    

Please investigate the reason(s) for the low efficiency. Common reasons for low
GPU efficiency include:

  1. Misconfigured application scripts. Be sure to read the documentation of the
     software to make sure that you are using it properly. This includes creating
     the appropriate software environment. For a general overview of GPU computing:
     https://researchcomputing.princeton.edu/support/knowledge-base/gpu-computing

  2. Using an A100 GPU when a MIG GPU would be sufficient. Some codes do not have
     enough work to keep an A100 GPU busy. If you encounter this on the Della
     cluster then consider using a MIG GPU:
     https://researchcomputing.princeton.edu/systems/della#gpus

  3. Training deep learning models while only using a single CPU-core. Codes such as
     PyTorch and TensorFlow show performance benefits when multiple CPU-cores are
     used for the data loading. For PyTorch see:
     https://researchcomputing.princeton.edu/support/knowledge-base/pytorch#multi

  4. Using too many GPUs for a job. You can find the optimal number of GPUs and
     CPU-cores by performing a scaling analysis:
     https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis

  5. Writing job output to the /tigress or /projects storage systems. Actively
     running jobs should be writing output files to /scratch/gpfs/aduzgun which is
     a much faster filesystem. For more information:
     https://researchcomputing.princeton.edu/support/knowledge-base/data-storage

Consult the documentation or write to the mailing list of the software that you
are using for additional reasons for low GPU efficiency and for potential
solutions. You may also consider attending a Research Computing help session:

     https://researchcomputing.princeton.edu/support/help-sessions

Add the following lines to your Slurm scripts to receive an email report with GPU
efficiency information after each job finishes:

     #SBATCH --mail-type=end
     #SBATCH --mail-user=aturing@princeton.edu

You can check the efficiency of completed and actively running jobs by using the
'jobstats' command:

     https://researchcomputing.princeton.edu/support/knowledge-base/job-stats

Replying to this email will open a support ticket with CSES. Let us know if we
can be of help.

Low CPU Utilization

Hi Alan,

Over the last 8 days you have used the 11th most CPU-hours on TigerCPU but
your mean CPU efficiency is only 47%:

     NetID    Partition(s)   Jobs  CPU-hours CPU-rank Efficiency
     aturing cpu,ext,serial   11    52876     11/63      47%    

Please investigate the reason(s) for the low efficiency. Common reasons for low
CPU efficiency include:

  1. Running a serial code using multiple CPU-cores. Make sure that your code is
     written to run in parallel before using multiple CPU-cores. Learn more:
     https://researchcomputing.princeton.edu/support/knowledge-base/parallel-code

  2. Using too many CPU-cores for parallel jobs. You can find the optimal number
     of CPU-cores by performing a scaling analysis:
     https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis

  3. Writing job output to the /tigress or /projects storage systems. Actively
     running jobs should be writing output files to /scratch/gpfs/hm2524 which is
     a much faster filesystem. For more information:
     https://researchcomputing.princeton.edu/support/knowledge-base/data-storage

  4. Using the MPICH library instead of an MPI library that was built for our
     clusters. Some software installed using 'conda' is built against an MPI
     library that is not optimized for our systems. Run 'conda list' after
     activating the environment and look for 'mpich' to see if you are using this
     library.

  5. Using 'mpirun' instead of 'srun' for parallel codes. Please use 'srun'.

Consult the documentation or write to the mailing list of the software that you
are using for additional reasons for low CPU efficiency and for potential
solutions. You may also consider attending a Research Computing help session:

     https://researchcomputing.princeton.edu/support/help-sessions

Add the following lines to your Slurm scripts to receive an email report with CPU
efficiency information after each job finishes:

     #SBATCH --mail-type=end
     #SBATCH --mail-user=aturing@princeton.edu

You can check the efficiency of completed and actively running jobs by using the
"jobstats" command:

     https://researchcomputing.princeton.edu/support/knowledge-base/job-stats

Replying to this email will open a support ticket with CSES. Let us know if we
can be of help.

Consider Using the MIG GPUs on Della

Hi Alan,

Below are jobs that ran on an A100 GPU on Della in the past 10 days:

   JobID   NetID  GPU-Util GPU-Mem-Used CPU-Mem-Used  Hours
  45933239 aturing  10%        3 GB         3 GB       50  
  45933241 aturing   9%        3 GB         3 GB       50  
  45948433 aturing  10%        2 GB         2 GB       55  
  45948435 aturing   8%        2 GB         2 GB       82

The jobs above have a low GPU utilization and they use less than 10 GB of GPU
memory and less than 32 GB of CPU memory. Such jobs could be run on the MIG
GPUs. A MIG GPU is essentially a small A100 GPU with 1/7th the performance and
memory of an A100. To run on a MIG GPU, add the "partition" directive to your
Slurm script:

  #SBATCH --nodes=1
  #SBATCH --ntasks=1
  #SBATCH --cpus-per-task=1
  #SBATCH --gres=gpu:1
  #SBATCH --partition=mig

For interactive sessions use, for example:

  $ salloc --nodes=1 --ntasks=1 --time=1:00:00 --gres=gpu:1 --partition=mig

If you are using Jupyter OnDemand then set the "Custom partition" to "mig" when
creating the session.

A job can use a MIG GPU when the following constraints are satisfied:

  1. The required number of GPUs is 1
  2. The required number of CPU-cores is 1
  3. The required GPU memory is less than 10 GB
  4. The required CPU memory is less than 32 GB

All MIG jobs are automatically allocated 32 GB of CPU memory and 10 GB of GPU
memory.

By running future jobs on the MIG GPUs you will experience shorter queue
times and you will help keep A100 GPUs free for jobs that need them. Since
your jobs satisfy the above constraints, please use the MIG GPUs. For more:

  https://researchcomputing.princeton.edu/systems/della#gpus

As an alternative to MIG, you may consider trying to improve the GPU
utilization of your code. A good place to start is the mailing list of
the software you are using.

Add the following lines to your Slurm scripts to receive an email report with
memory usage information after each job finishes:

   #SBATCH --mail-type=end
   #SBATCH --mail-user=aturing@princeton.edu

Replying to this email will open a support ticket with CSES. Let us know if we
can be of help.

Excess CPU Memory Allocations

Hi Alan,

Below are 10 of your 1671 jobs that ran on Della (CPU) in the past 7 days:

        JobID     Memory-Used Memory-Allocated Percent-Used  Cores  Hours
    49271341_287      1 GB         32 GB            3%         1      24  
    49271341_288      1 GB         32 GB            3%         1      24  
    49271341_289      1 GB         32 GB            3%         1      24  
    49271341_290      1 GB         32 GB            3%         1      24  
    49271341_291      1 GB         32 GB            3%         1      24  
    49271341_292      1 GB         32 GB            3%         1      24  
    49271341_293      1 GB         32 GB            3%         1      24  
    49271341_294      1 GB         32 GB            3%         1      24  
    49271341_295      1 GB         32 GB            3%         1      24  
    49271341_296      1 GB         32 GB            3%         1      24  

It appears that you are requesting too much CPU memory for your jobs since you
are only using on average 3% of the allocated memory (for the 1671 jobs). This
has resulted in 507 TB-hours of unused memory. A TB-hour is the allocation
of 1 terabyte of memory for 1 hour.

Please request less memory by modifying the --mem-per-cpu or --mem Slurm
directive. This will lower your queue times and make the resources available
to other users. For instance, if your job requires 8 GB per node then use:

    #SBATCH --mem=10G

The value above includes an extra 20% for safety. A good target value for
Percent-Used is 80%. For more on allocating CPU memory with Slurm:

    https://researchcomputing.princeton.edu/support/knowledge-base/memory

You can check the CPU memory utilization of completed and actively running jobs
by using the "jobstats" command. For example:

    $ jobstats 49271341_1448

The command above can also be used to see suggested values for the --mem-per-cpu
and --mem Slurm directives.

Add the following lines to your Slurm scripts to receive an email report with
CPU memory utilization information after each job finishes:

    #SBATCH --mail-type=end
    #SBATCH --mail-user=aturing@princeton.edu

Consider attending an in-person Research Computing help session for assistance:

    https://researchcomputing.princeton.edu/support/help-sessions

Replying to this automated email will open a support ticket with Research
Computing. Let us know if we can be of help.

Serial Jobs Using Multiple CPU-cores

Hi Alan,

Below are jobs that ran on Della in the past 7 days:

     JobID     NetID   Partition  CPU-cores CPU-Util 100%/CPU-cores  Hours
    49229759  aturing    cpu          8       11.5%       12.5%        12  
    49229772  aturing    cpu          8       12.5%       12.5%        75  
    49234330  aturing    cpu          8       11.2%       12.5%        15  

The CPU utilization (CPU-Util) of each job above is approximately equal to
100% divided by the number of allocated CPU-cores (100%/CPU-cores). This
suggests that you may be running a code that can only use 1 CPU-core. If this is
true then allocating more than 1 CPU-core is wasteful. A good target value for
CPU utilization is 90% and above.

Please consult the documentation of the software to see if it is parallelized.
For a general overview of parallel computing:

    https://researchcomputing.princeton.edu/support/knowledge-base/parallel-code

If the code cannot run in parallel then please use the following Slurm
directives:

    #SBATCH --nodes=1
    #SBATCH --ntasks=1
    #SBATCH --cpus-per-task=1

You will experience shorter queue times by allocating only 1 CPU-core per job.
In some cases this will also allow you run more jobs simultaneously.

If you believe that the code is capable of using more than 1 CPU-core then
consider attending an in-person Research Computing help session for assistance
with running parallel jobs:

    https://researchcomputing.princeton.edu/support/help-sessions

You can check the CPU utilization of completed and actively running jobs by using
the "jobstats" command. For example:

    $ jobstats 49229759

Add the following lines to your Slurm scripts to receive an email report with
CPU utilization information after each job finishes:

    #SBATCH --mail-type=end
    #SBATCH --mail-user=aturing@princeton.edu

Replying to this automated email will open a support ticket with Research
Computing. Let us know if we can be of help.

Jobs Using Too Many Nodes

Hi Alan,

Below are your recent jobs which appear to be using more nodes than necessary:

   JobID   Cluster  Nodes Memory-per-Node-Used Cores-per-Node  Min-Nodes-Needed
  49362222  della    4            3 GB               16               2        
  49362290  della    4            3 GB               16               2        
  49362294  della    4            3 GB               16               2        

The "Nodes" column shows the number of nodes used to run the job. The
"Min-Nodes-Needed" column shows the minimum number of nodes needed to run the
job (these values are based on the number of requested CPU-cores while taking
into account the CPU memory usage of the job).

When possible please try to minimize the number of nodes per job by using all
of the CPU-cores of each node. This will help to maximize the overall job
throughput of the cluster.

Della is composed of nodes with 32 CPU-cores and 190 GB of CPU memory. If your
job requires 64 CPU-cores then use, for example:

  #SBATCH --nodes=2
  #SBATCH --ntasks-per-node=32

For more information about the nodes on Della:

  https://researchcomputing.princeton.edu/systems/della

If you are unsure about the meanings of --nodes, --ntasks, --ntasks-per-node and
--cpus-per-task, see our Slurm webpage:

  https://researchcomputing.princeton.edu/support/knowledge-base/slurm

Additionally, see this general overview on parallel computing:

  https://researchcomputing.princeton.edu/support/knowledge-base/parallel-code

It is very important to conduct a scaling analysis to find the optimal number
of nodes and CPU-cores to use for a given parallel job. The calculation of
"Min-Nodes-Needed" above is based on your choice of the total CPU-cores which
may not be optimal. For information on conducting a scaling analysis:

  https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis

See detailed information about each job by running the "jobstats" command:

  $ jobstats 49362222

Add the following lines to your Slurm scripts to receive an email report with
efficiency information after each job finishes:

  #SBATCH --mail-type=end
  #SBATCH --mail-user=aturing@princeton.edu

Consider attending an in-person Research Computing help session for assistance:

  https://researchcomputing.princeton.edu/support/help-sessions

Replying to this automated email will open a support ticket with Research
Computing. Let us know if we can be of help.

Fragmented GPU Jobs on Della

Hi Alan,

Below are jobs that ran on Della in the past 5 days that used 1 GPU per node
over multiple nodes:

     JobID    NetID   GPUs  Nodes  GPUs-per-Node  Hours State GPU-eff
    48550174 aturing   4     4          1          1     COM    86%  
    48550175 aturing   4     4          1          1     COM    86%  
    48625465 aturing   4     4          1          1     COM    96%  

The GPU nodes on Della have either 2 GPUs per node or 4 GPUs per node. For future
jobs, please try to use as few nodes as possible by allocating more GPUs per node.
This is done by modifying the --gres Slurm directive as explained here:

    https://researchcomputing.princeton.edu/support/knowledge-base/slurm#gpus

For more information about the Della GPU nodes:

    https://researchcomputing.princeton.edu/systems/della#gpus

When using more than 1 GPU per job, be sure to conduct a scaling analysis to find
the optimal number of GPUs:

    https://researchcomputing.princeton.edu/support/knowledge-base/scaling-analysis

Consider attending an in-person Research Computing help session for assistance:

    https://researchcomputing.princeton.edu/support/help-sessions

Replying to this automated email will open a support ticket with Research
Computing. Let us know if we can be of help.

Requesting Too Much Time for Jobs

Hi Alan,

Below are 10 of your 174 jobs that ran on Della (CPU) in the past 7 days:

       JobID      NetID  Time-Used Time-Allocated Percent-Used  CPU-Cores
    49152228_1   aturing  01:01:33    16:00:00         6%          32    
    49152228_2   aturing  01:01:31    16:00:00         6%          32    
    49152228_3   aturing  01:00:16    16:00:00         6%          32    
    49152228_4   aturing  01:01:19    16:00:00         6%          32    
    49152228_5   aturing  01:02:12    16:00:00         6%          32    
    49152228_6   aturing  00:59:56    16:00:00         6%          32    
    49152228_7   aturing  01:01:00    16:00:00         6%          32    
    49152228_8   aturing  01:01:36    16:00:00         6%          32    
    49152228_9   aturing  01:02:29    16:00:00         7%          32    
    49152228_10  aturing  01:02:25    16:00:00         7%          32    

It appears that you are requesting too much time for your jobs since you are
only using on average 7% of the allocated time (for the 174 jobs). This has
resulted in 72507 CPU-hours that you scheduled but did not use (it was made
available to other users, however).

Please request less time by modifying the --time Slurm directive. This will
lower your queue times and allow the Slurm job scheduler to work more
effectively for all users. For instance, if your job requires 8 hours then use:

    #SBATCH --time=10:00:00

The value above includes an extra 20% for safety. This is important because jobs
that exceed the run time limit are killed. A good target for Percent-Used is 80%.

Time-Used is the time (wallclock) that the job needed. The total time allocated
for the job is Time-Allocated. The format is DD-HH:MM:SS where DD is days,
HH is hours, MM is minutes and SS is seconds. Percent-Used is Time-Used
divided by Time-Allocated.

For more information on allocating time via Slurm:

    https://researchcomputing.princeton.edu/support/knowledge-base/slurm

Consider attending an in-person Research Computing help session for assistance:

    https://researchcomputing.princeton.edu/support/help-sessions

Replying to this automated email will open a support ticket with Research
Computing. Let us know if we can be of help.

Underutilization of the Large-Memory Nodes

Hi Alan,

Below are jobs that ran on the large-memory (datascience) nodes on Della in the 
past 7 days:

    JobID     NetID  Memory-Used Memory-Allocated Large-Memory-Needed?  Hours
   46171956  aturing    105 GB        300 GB               No             8   
   46236937  aturing    104 GB        300 GB               No             3   
   46247483  aturing     72 GB        300 GB               No             2   

The large-memory nodes should only be used for jobs that require 190 GB or more.
It appears that none of the jobs above needed one of these nodes. For future jobs,
please lower the value of the --mem-per-cpu or --mem Slurm directive so that the
overall memory requirement of each job is less than 190 GB. You should use the
smallest value possible but include an extra 20% for safety.

For more information on the large-memory nodes and allocating CPU memory:

   https://researchcomputing.princeton.edu/systems/della#large_memory
   https://researchcomputing.princeton.edu/support/knowledge-base/memory

Users that continually run jobs on the large-memory nodes without justification
risk losing access to these nodes since it prevents others from getting their
work done.

Add the following lines to your Slurm scripts to receive an email report with
GPU utilization information after each job finishes:

   #SBATCH --mail-type=end
   #SBATCH --mail-user=aturing@princeton.edu

One can also see memory usage information by using the following command:

   $ jobstats 46171956

Replying to this email will open a support ticket with CSES. Let us know if we
can be of help.

How to Create a New Alert

A new alert is made by creating a new Python class that derives from the Alert base class. One then has to write the _filter_and_add_new_fields method and the send_emails_to_users method. There are several examples of this procedure in the alert directory.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published