Skip to content

PrincetonUniversity/gpudash

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gpudash

The gpudash command displays a GPU utilization dashboard in text (no graphics) for the last hour:

gpudash example

The dashboard can be generated for a specific user:

gpudash user example

The gpudash command is part of the Jobstats platform. Here is the help menu:

usage: gpudash [-h] [-u NETID] [-n] [-c]

GPU utilization dashboard for the last hour

optional arguments:
  -h, --help       show this help message and exit
  -u NETID         create dashboard for a single user
  -n, --no-legend  flag to hide the legend

Utilization is the percentage of time during a sampling window (< 1 second) that
a kernel was running on the GPU. The format of each entry in the dashboard is
username:utilization (e.g., aturing:90). Utilization varies between 0 and 100%.

Examples:

  Show dashboard for all users:
    $ gpudash

  Show dashboard for the user aturing:
    $ gpudash -u aturing

  Show dashboard for all users without displaying legend:
    $ gpudash -n

Getting Started

The gpudash command builds on the Jobstats platform. To run the software it requires Python 3.6+ and version 1.17+ of the Python blessed package.

1. Create a script to pull data from Prometheus

The query_prometheus.sh script below makes three queries to Prometheus. Old files are removed. The extract.py Python script is called to extract the data and write columns files. The column files are read by gpudash.

$ cat query_prometheus.sh
#!/bin/bash

printf -v SECS '%(%s)T' -1
DATA='/path/to/gpudash/data'
PROM_QUERY='http://vigilant2.sn17:8480/api/v1/query?'

curl -s ${PROM_QUERY}query=nvidia_gpu_duty_cycle > ${DATA}/util.${SECS}
curl -s ${PROM_QUERY}query=nvidia_gpu_jobUid     > ${DATA}/uid.${SECS}
curl -s ${PROM_QUERY}query=nvidia_gpu_jobId      > ${DATA}/jobid.${SECS}

# remove any files that are greater or equal to 70 minutes old
find ${DATA} -type f -mmin +70 -exec rm -f {} \;

# extract the data from the Prometheus queries, convert UIDs to usernames, write column files
python3 /path/to/extract.py

Be sure to customize nodelist in extract.py for the given system. The above Bash script will generate column files with the format:

$ head -n 5 column.1
{"timestamp": "1678144802", "host": "comp-g1", "index": "0", "user": "ft130", "util": "92", "jobid": "46034275"}
{"timestamp": "1678144802", "host": "comp-g1", "index": "1", "user": "ft130", "util": "99", "jobid": "46015684"}
{"timestamp": "1678144802", "host": "comp-g1", "index": "2", "user": "ft130", "util": "99", "jobid": "46015684"}
{"timestamp": "1678144802", "host": "comp-g1", "index": "3", "user": "kt415", "util": "44", "jobid": "46048505"}
{"timestamp": "1678144802", "host": "comp-g2", "index": "0", "user": "kt415", "util": "82", "jobid": "46015407"}

The column files are read by gpudash to generate the dashboard.

2. Generate a CSV file containing UIDs and the corresponding usernames

Here is a sample of the file:

$ head -n 5 uid2user.csv
153441,ft130
150919,lc455
224256,sh235
329819,bb274
347117,kt415

The above file can be generated by running the following command:

$ getent passwd | awk -F":" '{print $3","$1}' > /path/to/uid2user.csv

3. Create two entries in crontab

0,10,20,30,40,50 * * * * /path/to/query_prometheus.sh > /dev/null 2>&1
0 6 * * 1 getent passwd | awk -F":" '{print $3","$1}' > /path/to/uid2user.csv 2> /dev/null

The first entry above calls the script that queries the Prometheus server every 10 minutes. The second entry creates the CSV file of UIDs and usernames on every Monday at 6 AM.

4. Download gpudash

gpudash is a pure Python code. It's only dependency is the blessed Python package. On Ubuntu Linux, this can be installed with:

$ apt-get install python-blessed

Then put gpudash in a location like /usr/local/bin:

$ cd /usr/local/bin
$ wget https://raw.githubusercontent.com/PrincetonUniversity/gpudash/main/gpudash
$ chmod 755 gpudash

Next, edit gpudash by replacing cluster1 with the beginning of the login node name. Modify all_nodes to generate a list of compute node names. Lastly, set SBASE to the path containing the column files produced by extract.py and make sure that the shebang line at the very top is pointing to python3.

With these steps in place, you can use the gpudash command:

$ gpudash

About the Design

The choice was made to enter the node names in the script (i.e., all_nodes) as opposed to reading the Prometheus configuration file or using the output of the sinfo command. The code looks for data on each of the specified nodes and only updates the values for a given node if the data is found. Calling sinfo has the disadvantage of not having any node names if the command fails. One would also have to specify partitions. Reading the Prometheus server configuration file(s) is reasonable but changes would be required if Prometheus were swapped with an alternative.

Troubleshooting

The two most commons problems are (1) setting the correct paths throughout the procedure and (2) installing the Python blessed package.

Getting Help

Please post an issue to this repo. Extensions to the code are welcome via pull requests.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages