map file generation is slow and fails for big problems #5

matt-long · 2019-04-10T17:24:34Z

I recently wanted to generate weights to map ETOPO1 (1-minute data) to 0.1° POP. The esmlab.regrid function failed.

I resorted to running ESMF_RegridWeightGen in MPI on 12 Cheyenne nodes.

#!/bin/bash
#PBS -N ESMF_RegridWeightGen
#PBS -q regular
#PBS -A NCGD0011
#PBS -l select=12:ncpus=36:mpiprocs=4:mem=109GB
#PBS -l walltime=06:00:00
#PBS -o logs/
#PBS -e logs/
#PBS -j oe

module purge
module load ncarenv/1.2
module load intel/17.0.1
module load netcdf/4.6.1
module load mpt/2.19

module load esmf_libs/7.1.0r
module load esmf-7.1.0r-ncdfio-mpi-O

SRC=/glade/work/mclong/esmlab-regrid/etopo1.nc
DST=/glade/work/mclong/esmlab-regrid/POP_tx0.1v3.nc

WEIGHT_FILE=/glade/work/mclong/esmlab-regrid/etopo1_to_POP_tx0.1v3_conservative.nc
METHOD=conserve

# Remove previous log files
rm -f PET*.RegridWeightGen.Log

mpirun -np 48 ESMF_RegridWeightGen --netcdf4 --ignore_unmapped -s ${SRC} -d ${DST} -m ${METHOD} -w ${WEIGHT_FILE}

The text was updated successfully, but these errors were encountered:

andersy005 · 2019-04-10T17:33:26Z

The esmlab.regrid function failed.

I am curious to know what kind of error (MemoryError, etc) or is it just too slow?

matt-long · 2019-04-10T18:46:32Z

Pretty sure it was a memory error, but I don't recall the specific message. I had to use several nodes to get over the memory hurdle with MPI.

andersy005 · 2019-04-12T19:56:31Z

Per xesmf documentation: https://xesmf.readthedocs.io/en/latest/limitations.html

xESMF currently only runs in serial. Parallel options are being investigated.

JiaweiZhuang/xESMF#3

I just found about it

matt-long · 2019-04-12T20:04:30Z

We are currently using xESMF, but don't have to. ESMPy does support MPI:
http://www.earthsystemmodeling.org/esmf_releases/last_built/esmpy_doc/html/examples.html?highlight=mpi

though it's not clear how to integrate with dask.

andersy005 · 2019-04-12T20:19:39Z

though it's not clear how to integrate with dask.

Introducing MPI, ESMPy's complicated interface :) , integrating these with Xarray and Dask would definitely be a conundrum.

I am curious, what is the highest priority for esmlab-regrid? Is it usability? Performance? Do we want users to be able to perform regridding with one line of code? Because if usability is not the highest priority, it would be worth looking into MPI and ESMPy functionality

andersy005 · 2019-04-12T20:44:14Z

It looks like Dask's folks are looking into this kind of workflow: Running Dask and MPI programs together an experiment

andersy005 · 2019-04-12T20:48:16Z

@matt-long, Correct me if I'm wrong. This kind of parallelism is only needed when generating the weights. Once you have the weights, you don't need ESMPy/MPI machinery anymore. To apply the weights which is a matrix multiplication would be done without this heavy machinery, and this could be achieved with Scipy/Dask/Xarray, right?

matt-long · 2019-04-12T20:50:48Z

I think our focus should remain on an end-to-end workflow and usability in the near term, but keep performance thru parallelism on the radar.

We could consider prototyping an MPI implementation as a standalone script, analogous to that shown here.

@andersy005, you are correct. The weights files are sparse matrices and are handled well by scipy.sparse.

andersy005 · 2019-04-12T22:53:50Z

@matt-long, was the work you were doing to generate WEIGHT_FILE=/glade/work/mclong/esmlab-regrid/etopo1_to_POP_tx0.1v3_conservative.nc connected to the content of this notebook https://gist.github.com/matt-long/87630e97dc787ffc27b33e944dcd1473 ?

matt-long · 2019-04-12T23:04:37Z

Yes

andersy005 · 2019-04-12T23:13:49Z

Since you are not using xesmf and ESMF/ESMPy, and the code deals with raw NumPy, I was thinking of exploring some optimization with numba and dask. Do you see any value in this or am I missing anything before I end up going down a rabbit hole :) ?

matt-long · 2019-04-12T23:15:11Z

By "connected" I mean that that code was used in the same project. It does not compute the weight files, but rather only the grid file. It's fast enough as is, I'd say. Not a high priority for optimization.

andersy005 · 2019-04-12T23:19:19Z

By "connected" I mean that that code was used in the same project. It does not compute the weight files, but rather only the grid file.

Good point. Does this mean that the failing component is _gen_weights method?

esmlab-regrid/esmlab_regrid/core.py

Lines 84 to 88 in b8b7182

    
           def _gen_weights(self, overwrite_existing): 
        
               """ Generate regridding weights """ 
        
               grid_file_dir = esmlab.config.get('regrid.gridfile-directory') 
        
               weights_dir = f'{grid_file_dir}/weights'

matt-long · 2019-04-12T23:23:41Z

Yes.

andersy005 · 2019-04-12T23:29:06Z

Thank you for the clarification! Speaking of high priority, is there anything on your plate I can help with? :)

JiaweiZhuang · 2019-08-06T05:55:07Z

Not sure if related to JiaweiZhuang/xESMF#29. Parallel weight generation is very hard (if possible at all) to rewrite in a non-MPI way. But after the weights are generated, applying them to data using dask is much easier.

My plan is to clearly separate between "weight generation" and "weight application" phases:

The later phase doesn't depend on ESMF/ESMPy (don't even need to have it installed), and it is easy to rewrite with pure dask/xarray/scipy/numba/cython or whatever modern Python libraries. Chunking in lev/time can be trivially implemented (xESMF v0.2 already supports it), chunking in horizontal (for extremely large grids) still seems doable, as it is just a parallel sparse matrix multiplication problem.
Parallelizing the first phase probably has to rely on ESMPy-MPI, as no one would want to reinvent the wheel that ESMF already has (and has been developed for decades). Although configuring MPI is much more annoying than configuring Dask, this laborious task only needs to be done once. The weights can be reused and even shared between platforms & users. Public clouds actually have decent supports for MPI (think about all the cloud-HPC business), so in principle every one should be able to generate giant regridding weight files, even without access to NCAR supercomputers.

Such separation will be much clearer after resolving JiaweiZhuang/xESMF#11. My plan is to have a "mini-xesmf" installation that doesn't depend on ESMPy -- it will just construct a complete regridder from existing weight files, generated from a ESMPy program running elsewhere (potentially a huge MPI run, potentially with a xesmf wrapper for better usability).

andersy005 added help wanted Extra attention is needed priority : high labels Apr 12, 2019

andersy005 removed the priority : high label Jul 31, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

map file generation is slow and fails for big problems #5

map file generation is slow and fails for big problems #5

matt-long commented Apr 10, 2019

andersy005 commented Apr 10, 2019

matt-long commented Apr 10, 2019

andersy005 commented Apr 12, 2019

matt-long commented Apr 12, 2019

andersy005 commented Apr 12, 2019

andersy005 commented Apr 12, 2019

andersy005 commented Apr 12, 2019

matt-long commented Apr 12, 2019 •

edited

Loading

andersy005 commented Apr 12, 2019 •

edited

Loading

matt-long commented Apr 12, 2019

andersy005 commented Apr 12, 2019

matt-long commented Apr 12, 2019

andersy005 commented Apr 12, 2019

matt-long commented Apr 12, 2019

andersy005 commented Apr 12, 2019

JiaweiZhuang commented Aug 6, 2019 •

edited

Loading

map file generation is slow and fails for big problems #5

map file generation is slow and fails for big problems #5

Comments

matt-long commented Apr 10, 2019

andersy005 commented Apr 10, 2019

matt-long commented Apr 10, 2019

andersy005 commented Apr 12, 2019

matt-long commented Apr 12, 2019

andersy005 commented Apr 12, 2019

andersy005 commented Apr 12, 2019

andersy005 commented Apr 12, 2019

matt-long commented Apr 12, 2019 • edited Loading

andersy005 commented Apr 12, 2019 • edited Loading

matt-long commented Apr 12, 2019

andersy005 commented Apr 12, 2019

matt-long commented Apr 12, 2019

andersy005 commented Apr 12, 2019

matt-long commented Apr 12, 2019

andersy005 commented Apr 12, 2019

JiaweiZhuang commented Aug 6, 2019 • edited Loading

matt-long commented Apr 12, 2019 •

edited

Loading

andersy005 commented Apr 12, 2019 •

edited

Loading

JiaweiZhuang commented Aug 6, 2019 •

edited

Loading