Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Invoking xesmf with mpirun #79

Closed
nichannah opened this issue Dec 14, 2019 · 3 comments
Closed

Invoking xesmf with mpirun #79

nichannah opened this issue Dec 14, 2019 · 3 comments

Comments

@nichannah
Copy link

nichannah commented Dec 14, 2019

I am using mpirun to run my Python program across multiple nodes in a cluster. Each instance of the program uses MPI to determine it's own rank and the number of processes but nothing else. Each program also uses xESMF to do some regridding.

The problem is that the underlying ESMF library then tries to decompose the regridding task across the ranks. xESMF does not handle this and will have an error.

Since xESMF does not support parallel regridding (yet) - is there a way to ensure that the underlying library does not try to do this?

Any thoughts or work-around ideas would be much appreciated.

@JiaweiZhuang
Copy link
Owner

JiaweiZhuang commented Dec 14, 2019

Each instance of the program uses MPI to determine it's own rank and the number of processes but nothing else.

So you don't need any inter-process communication using mpi4py? In that case I would suggest not using mpirun to launch your python script, but using a scheduler feature like Slurm Job Array Support and get your job ID via os.environ['SLURM_ARRAY_TASK_ID'].

Since xESMF does not support parallel regridding (yet)

Parallel weight construction is not supported yet, but the weights can be apply in parallel via Dask. See a long discussion at #3.

Does your use case actually need MPI-style parallelization? If the data can be chunked in vertical/time dimension, Dask should be sufficient. Any reason for having to chunk in the horizontal?

@nichannah
Copy link
Author

nichannah commented Dec 14, 2019

@JiaweiZhuang thank you for the very quick reply and useful suggestions.

Yes that is correct, the only reason to use MPI-style parallelization is to launch across multiple nodes.

The SLURM suggestion is a good one, and this is what I'm doing on a cluster that has that installed. However I also need to get it running on a PBS cluster system which uses MPI to do task launching.

I have tried disconnecting the MPI communicator (comm.Disconnect()) after start-up but this seems to crash ESMF with a seg fault.

@nichannah
Copy link
Author

OK, I think you've answered this. The best approach is probably to use job arrays. An alternative might be use ESMF compiled without mpi support.

aulemahal pushed a commit to Ouranosinc/xESMF that referenced this issue May 18, 2021
…ble-vars

Allow non-regriddable vars in datasets
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants