Skip to content
This repository has been archived by the owner on Apr 30, 2021. It is now read-only.

map file generation is slow and fails for big problems #5

Open
matt-long opened this issue Apr 10, 2019 · 16 comments
Open

map file generation is slow and fails for big problems #5

matt-long opened this issue Apr 10, 2019 · 16 comments
Labels
help wanted Extra attention is needed

Comments

@matt-long
Copy link
Collaborator

I recently wanted to generate weights to map ETOPO1 (1-minute data) to 0.1° POP. The esmlab.regrid function failed.

I resorted to running ESMF_RegridWeightGen in MPI on 12 Cheyenne nodes.

#!/bin/bash
#PBS -N ESMF_RegridWeightGen
#PBS -q regular
#PBS -A NCGD0011
#PBS -l select=12:ncpus=36:mpiprocs=4:mem=109GB
#PBS -l walltime=06:00:00
#PBS -o logs/
#PBS -e logs/
#PBS -j oe

module purge
module load ncarenv/1.2
module load intel/17.0.1
module load netcdf/4.6.1
module load mpt/2.19

module load esmf_libs/7.1.0r
module load esmf-7.1.0r-ncdfio-mpi-O

SRC=/glade/work/mclong/esmlab-regrid/etopo1.nc
DST=/glade/work/mclong/esmlab-regrid/POP_tx0.1v3.nc

WEIGHT_FILE=/glade/work/mclong/esmlab-regrid/etopo1_to_POP_tx0.1v3_conservative.nc
METHOD=conserve

# Remove previous log files
rm -f PET*.RegridWeightGen.Log

mpirun -np 48 ESMF_RegridWeightGen --netcdf4 --ignore_unmapped -s ${SRC} -d ${DST} -m ${METHOD} -w ${WEIGHT_FILE}
@andersy005
Copy link
Contributor

The esmlab.regrid function failed.

I am curious to know what kind of error (MemoryError, etc) or is it just too slow?

@matt-long
Copy link
Collaborator Author

Pretty sure it was a memory error, but I don't recall the specific message. I had to use several nodes to get over the memory hurdle with MPI.

@andersy005
Copy link
Contributor

Per xesmf documentation: https://xesmf.readthedocs.io/en/latest/limitations.html

xESMF currently only runs in serial. Parallel options are being investigated.

JiaweiZhuang/xESMF#3

I just found about it

@matt-long
Copy link
Collaborator Author

We are currently using xESMF, but don't have to. ESMPy does support MPI:
http://www.earthsystemmodeling.org/esmf_releases/last_built/esmpy_doc/html/examples.html?highlight=mpi

though it's not clear how to integrate with dask.

@andersy005
Copy link
Contributor

though it's not clear how to integrate with dask.

Introducing MPI, ESMPy's complicated interface :) , integrating these with Xarray and Dask would definitely be a conundrum.

I am curious, what is the highest priority for esmlab-regrid? Is it usability? Performance? Do we want users to be able to perform regridding with one line of code? Because if usability is not the highest priority, it would be worth looking into MPI and ESMPy functionality

@andersy005
Copy link
Contributor

It looks like Dask's folks are looking into this kind of workflow: Running Dask and MPI programs together an experiment

@andersy005
Copy link
Contributor

@matt-long, Correct me if I'm wrong. This kind of parallelism is only needed when generating the weights. Once you have the weights, you don't need ESMPy/MPI machinery anymore. To apply the weights which is a matrix multiplication would be done without this heavy machinery, and this could be achieved with Scipy/Dask/Xarray, right?

@andersy005 andersy005 added help wanted Extra attention is needed priority : high labels Apr 12, 2019
@matt-long
Copy link
Collaborator Author

matt-long commented Apr 12, 2019

I think our focus should remain on an end-to-end workflow and usability in the near term, but keep performance thru parallelism on the radar.

We could consider prototyping an MPI implementation as a standalone script, analogous to that shown here.

@andersy005, you are correct. The weights files are sparse matrices and are handled well by scipy.sparse.

@andersy005
Copy link
Contributor

andersy005 commented Apr 12, 2019

@matt-long, was the work you were doing to generate WEIGHT_FILE=/glade/work/mclong/esmlab-regrid/etopo1_to_POP_tx0.1v3_conservative.nc connected to the content of this notebook https://gist.github.com/matt-long/87630e97dc787ffc27b33e944dcd1473 ?

@matt-long
Copy link
Collaborator Author

Yes

@andersy005
Copy link
Contributor

Since you are not using xesmf and ESMF/ESMPy, and the code deals with raw NumPy, I was thinking of exploring some optimization with numba and dask. Do you see any value in this or am I missing anything before I end up going down a rabbit hole :) ?

@matt-long
Copy link
Collaborator Author

By "connected" I mean that that code was used in the same project. It does not compute the weight files, but rather only the grid file. It's fast enough as is, I'd say. Not a high priority for optimization.

@andersy005
Copy link
Contributor

By "connected" I mean that that code was used in the same project. It does not compute the weight files, but rather only the grid file.

Good point. Does this mean that the failing component is _gen_weights method?

def _gen_weights(self, overwrite_existing):
""" Generate regridding weights """
grid_file_dir = esmlab.config.get('regrid.gridfile-directory')
weights_dir = f'{grid_file_dir}/weights'

@matt-long
Copy link
Collaborator Author

Yes.

@andersy005
Copy link
Contributor

Thank you for the clarification! Speaking of high priority, is there anything on your plate I can help with? :)

@JiaweiZhuang
Copy link

JiaweiZhuang commented Aug 6, 2019

Not sure if related to JiaweiZhuang/xESMF#29. Parallel weight generation is very hard (if possible at all) to rewrite in a non-MPI way. But after the weights are generated, applying them to data using dask is much easier.

My plan is to clearly separate between "weight generation" and "weight application" phases:

  • The later phase doesn't depend on ESMF/ESMPy (don't even need to have it installed), and it is easy to rewrite with pure dask/xarray/scipy/numba/cython or whatever modern Python libraries. Chunking in lev/time can be trivially implemented (xESMF v0.2 already supports it), chunking in horizontal (for extremely large grids) still seems doable, as it is just a parallel sparse matrix multiplication problem.
  • Parallelizing the first phase probably has to rely on ESMPy-MPI, as no one would want to reinvent the wheel that ESMF already has (and has been developed for decades). Although configuring MPI is much more annoying than configuring Dask, this laborious task only needs to be done once. The weights can be reused and even shared between platforms & users. Public clouds actually have decent supports for MPI (think about all the cloud-HPC business), so in principle every one should be able to generate giant regridding weight files, even without access to NCAR supercomputers.

Such separation will be much clearer after resolving JiaweiZhuang/xESMF#11. My plan is to have a "mini-xesmf" installation that doesn't depend on ESMPy -- it will just construct a complete regridder from existing weight files, generated from a ESMPy program running elsewhere (potentially a huge MPI run, potentially with a xesmf wrapper for better usability).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants