This repository provides the shell command ncparallel
for running arbitrary commands on
NetCDF files in parallel by dividing then combining the files along arbitrary dimensions.
We use the GFDL Flexible Modelling System mppnccombine.c
tool for combining datasets
along non-record dimensions (also see the
mppnccombine-fast.c
tool developed for
the Modular Ocean Modelling system).
Using this tool won't always result in a speedup. For relatively fast commands, the overhead of creating a bunch of temporary NetCDF files can exceed the original command execution time.
However, this tool is exceedingly useful in two situations:
- For very slow, labor-intensive processes, the parallelization will result in an obvious speedup.
- For very large files, e.g. approaching the available system RAM, your computer may
run out of memory and have to use the hard disk for "virtual" RAM. This significantly
slows down the process and can grind hard disk access to a crawl, getting in the way
of other processes. With this tool, you can use the
-p
and-n
flags to serially process the file in chunks, eliminating the memory bottleneck.
Download this repository with
cd $HOME && git clone https://github.com/lukelbd/ncparallel
and add the resulting folder to your PATH
by adding the line
export PATH="$HOME/ncparallel:$PATH"
to your shell configuration file, usually named $HOME/.bashrc
or $HOME/.bash_profile
. If the file
does not exist, you can create it, and its contents should be run every time you open up a terminal.
Example usage is as follows:
ncparallel -d=lat -p=8 -n=32 command input1.nc [input2.nc ... inputN.nc] output.nc
The first positional argument is the command written as you would type it into the
command line -- for example, ./script.sh
, 'python script.py'
, or
'ncap2 -s "math-goes-here"'
. Note that the command must be surrounded by quotes
if it consists of more than one word. The final positional arguments are the input file
name(s) and the output file name.
For input file(s) named input1.nc
, input2.nc
, etc. and an output file named
output.nc
, parallel processing is achieved as follows:
- Each input file
inputN.nc
is split up along some dimension into chunks, in this case namedinputN.0000.nc
,inputN.0001.nc
, etc. - The input command is called on the file chunks serially or in parallel (depending on
the value passed to
-p
), in this case withcommand input1.0000.nc [input2.0000.nc ... inputN.0000.nc] output.0000.nc
, etc. - The resulting output files are combined along the same dimension into the requested
output file name, in this case
output.nc
.
The optional arguments are as follows:
-d=NAME
: The dimension name along which we split the file. Default islat
.-n=NUM
: The number of file chunks to generate. Default is8
.-p=NUM
: The maximum number of parallel processes. Default is the-n
setting, but this can also be smaller.
If you do not want parallel processing and instead just want to split up the file into
more manageable chunks, simply use -p=1
. As explained above, this is very useful when
your command execution time is limited by available memory.
The flags are as follows:
-h
: Print help information.-r
: If passed, the output file comes before in the input file(s), rather than after.-f
: If passed and dimension is has unlimited length, it is changed to fixed length.-s
: If passed, silent mode is enabled.-k
: If passed, temporary files are not deleted.
Below are performance metrics for longitude-time Randel and Held (1991) spectral decompositions of a 400MB file with 64 latitudes on a high-performance server with 32 cores and 32GB of RAM.
Since the task is computationally intensive and the server is well-suited for
parallelization, ncparallel
provides an obvious speedup. In this case, the optimal
performance was reached with 16 latitude chunks and 16 parallel processes. Restricting
the number of parallel processes did not improve performance because the process was not
limited by available memory, and increasing the number of chunks yielded diminishing
performance returns.
The optimal performance for you will depend on your data, your code, and your system architecture.
Sample file: ../test.nc
Sample command: python ./spectra.py
Splitting along: lat
Number of files: 1
Parallelization: 1
real 106s user 78s sys 20s
Number of files: 2
Parallelization: 2
real 80s user 90s sys 38s
Number of files: 4
Parallelization: 4
real 53s user 107s sys 31s
Parallelization: 2
real 81s user 102s sys 31s
Number of files: 8
Parallelization: 8
real 45s user 147s sys 40s
Parallelization: 4
real 61s user 130s sys 35s
Parallelization: 2
real 94s user 123s sys 33s
Number of files: 16
Parallelization: 16
real 41s user 232s sys 52s
Parallelization: 8
real 53s user 192s sys 47s
Parallelization: 4
real 75s user 177s sys 43s
Parallelization: 2
real 127s user 164s sys 41s
Number of files: 32
Parallelization: 32
real 42s user 389s sys 78s
Parallelization: 16
real 50s user 319s sys 66s
Parallelization: 8
real 64s user 270s sys 61s
Parallelization: 4
real 106s user 259s sys 56s
Parallelization: 2
real 178s user 249s sys 56s
Below are performance metrics for eddy heat and momentum flux calculations with the same dataset used above.
This time, the optimal performance was reached with only 4 latitude chunks. While
ncparallel
did improve performance, the improvement was marginal, and increasing the
number of chunks yielded even worse performance than the performance without
ncparallel
.
This test emphasizes the fact that ncparallel
should be used only with careful
consideration of the system architecture and the task at hand.
Sample file: ../test.nc
Sample command: python ./fluxes.py
Splitting along: lat
Number of files: 1
Parallelization: 1
real 25s user 12s sys 12s
Number of files: 2
Parallelization: 2
real 20s user 18s sys 16s
Number of files: 4
Parallelization: 4
real 17s user 26s sys 17s
Parallelization: 2
real 23s user 25s sys 13s
Number of files: 8
Parallelization: 8
real 22s user 41s sys 21s
Parallelization: 4
real 24s user 35s sys 15s
Parallelization: 2
real 32s user 33s sys 14s
Number of files: 16
Parallelization: 16
real 34s user 71s sys 35s
Parallelization: 8
real 36s user 58s sys 23s
Parallelization: 4
real 39s user 53s sys 20s
Parallelization: 2
real 51s user 52s sys 19s
Number of files: 32
Parallelization: 32
real 60s user 129s sys 54s
Parallelization: 16
real 61s user 114s sys 42s
Parallelization: 8
real 65s user 95s sys 33s
Parallelization: 4
real 74s user 94s sys 30s
Parallelization: 2
real 90s user 94s sys 29s