-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallel regridding support #3
Comments
I generally agree with your summary. I added a few comments. It would also be good to get @rokuingh's thoughts.
Yes, mostly. The only caveat is grid size. Very large grids can be prohibitive for serial operations. Dealing with big data is the responsibility of the user (provided the software works in the first place), but excluding the ability to generate weights in parallel using horizontal decompositions would remove important functionality. Especially since grid resolutions are for the most part increasing.
Again, large grids. There would also be the case of time-varying grid/mesh spatial structures. That is a very special case though! There is also the parallelization scheme used by ESMF's weight application (see below).
Generating the weights and creating the ESMF route handle are the major bottlenecks. Which takes longer depends on the regridding method and source/destination structures. The route handle contains the sparse matrix application optimizations. Hence, once the route handle is created, the weight application is very fast. Note SMM optimizations must use horizontal decompositions (correct me if I'm wrong @rokuingh). It is entirely possible to avoid ESMF's SMM routines and write your own that interprets a weight file. It is worth noting that ESMPy does not expose Python bindings for route handle stuff yet. @rokuingh is very close to to-weight-file/from-weight-file operations... For large grids, memory usage is also an issue. We've developed a workflow that uses spatial subsetting (to ensure appropriate source/destination overlap) and ESMF to tile weight and SMM operations. We'd like to bring this into ESMF proper at some point. Anyway, you definitely have a good handle on things. Let us know if you have any questions. |
I generally agree with @bekozi. ESMPy does use horizontal decompositions, but we have considered bringing the customizealld decomposition capability of ESMF out to the Python layer. However, this still would not allow the case mentioned by @JiaweiZhuang of distributing over non-gridded dimensions like time and levels, unless the grid was created with enough dimensions to incorporate the non-gridded dimensions, but that would probably turn into a big mess.. The ability to write weights to file and read weight from file in ESMPy is almost finished. I requested time for a development sprint on this from my supervisor yesterday to finish it off, will let you know how that goes. @bekozi has done a fair amount of profiling of the ESMF SMM code, is that stuff publicly available? |
Great. That's the currently most important part for me. |
You are right, but it also depends on actual use cases. For me, the highest priority for xESMF is usability. I want users to be able to perform regridding with one line of code, and I don't specifically target on very large grids. If there's an easy&elegant way to gain speed-up (say dask) I will be happy to go, but I don't want users to call Also, I want xESMF to provide regridding capability for my cubedsphere package and the models it serves (FV3, GEOS-Chem...). The horizontal resolution should be around 50km and there are always many vertical levels and variables, so parallelizing over extra dimensions with dask seems a reasonable way. Would like to hear from @rabernat @jhamman for your use cases, as you are particularly interested in regridding. |
@JiaweiZhuang - first let me say that the package you've put together looks pretty sweet. The conundrum you're discussing here is why I stayed away from ESMF (and other existing low-level remapping libraries). I basically could not see a clear path toward integration with the xarray ecosystem (dask). As I'm thinking about it more, I have one idea for how you could make this work using a hybrid approach.
The transform step is fairly straightforward ( As for your actual question, my most common use case is remapping 3d and 4d data between well defined grids (regular, equal area, etc.) using a either conservative or linear remapping schemes. So, in theory, ESMF includes everything I need. |
|
@jhamman Thanks!
I totally agree with this kind of interoperability. Current popular remapping packages are ESMF and Tempest. They both output regridding weights in SCRIP format (SCRIP itself was retired). I tried to write a Python-interface for Tempest (written C++) but then realized that ESMPy allows me to do everything in Python... I think the representations of regridding weights are pretty consistent among packages.
Glad to see that! Just to point out that the first-order conservative scheme is also a linear mapping. A scheme is linear as long as the weight matrix is independent to the data field, which means we can pre-calculate the weights and apply them to any data. The only non-linear remapping I know is high-order conservative remapping with monotonic constraints, used in semi-Lagrangian advection schemes. |
Yes it is very sparse(see #2). The full dense matrix should have the size of N_s*N_d, where N_s and N_d are the numbers of grid points in source and destination grids. But the number of non-zero elements should be at the order of max(N_s, N_d). Good to know that dask already has sparse matrix support. It seems that calling sparse.tensordot on dask arrays means executing scipy.sprarse.csr_matrix.dot in parallel... I would try it out. |
My first impression here is that creating a large sparse matrix may not be the right approach here. Instead, we might consider how each block in the input array affects each block in the output array, sum up those effects, and build up a graph of tasks manually. My guess is that this requires a non-trivial amount of blackboard/notepad time to get right. |
@pelson has also asked about this topic before. |
If my intuition above is correct then my first instinct would be to consider the action of regridding on a small 10x10 array cut into 5x5 chunks, and work that out by hand. |
@mrocklin Thanks for the suggestion! This block-by-block approach looks more like ESMF/ESMPy's original MPI implementation. It is very hard to replicate that in Python so I'll first try a simpler approach. |
@cpelley has a huge amount of experience with chunking regridding operations (esp. huge input to moderate output) |
I'm curious, what makes this hard in Python? |
I have developed a generalised decomposition framework for the project ANTS (a toolkit for generating ancillaries for the unified model here at the Met Office). This works by fetching 'pieces' of the target and corresponding overlapping source 'pieces' and sending them along with an operation (whether regrid or otherwise) to any number of processes. Our method abstracts away the decomposition technology utilised and in doing so, allows users to gain the benefits of ipython parallel for multi-machine parallelism, simple single machine parallelism from multiprocessing or just serial decomposition without any coding change to their code and with a simple one-line API. More information can be found here: The reasons for choosing horizontal decomposition are that performing the overlap calculations is incredibly fast and broadcasting over the other dimensions is incredibly fast. So, doing so provides the benefit of hardware utilisation along with control of memory usage while maintaining the speed benefits of numpy broadcasting. I hope to spend some time looking at the possibility of Dask usage for us in future. A few years ago I deemed it wouldn't help us due to the complex runtime relationship between source and target in the decomposition. Dask seems to have moved along quite a bit now so I suspect the situation could easily have changed... Works well for us and our users and is a Python based framework, made possible by utilising powerful libraries like iris, cartopy, numpy, shapely etc. |
The link you provide seems to require a Met Office login. Do you have anything that is publicly available? |
Ah OK, here you go: Hope this is useful in some way. Cheers |
Is this open source? If so do you have a link to source code?
…On Thu, Nov 2, 2017 at 8:44 AM, Carwyn Pelley ***@***.***> wrote:
Ah OK, here you go:
ants_decomposition.pdf
<https://github.com/JiaweiZhuang/xESMF/files/1437724/ants_decomposition.pdf>
Hope this is useful in some way.
Cheers
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#3 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AASszGMrZ6Ktq2QYf5eNM_It4e3KVcT5ks5syblDgaJpZM4PIAzP>
.
|
I'm afraid not :( |
Just realized that it shouldn't be that hard😅 I was thinking about domain decomposition and MPI communication. But here since the weight matrix is already generated, the problem is only about parallelizing a sparse matrix multiplication (with broadcasting). See #6 for the serial implementation. The choice is either chunking over extra/broadcasting dimensions (time and level) or chunking over the horizontal field. A typical sparse matrix structure is shown below. By chunking over the "output field" dimension we should be able to avoid writing to the same output grid point at the same time. |
Checking in here. Does the latest release manage parallelism across non-grid dimensions like time and level? |
Assuming the answer above is "yes, this works for simple cases on dask arrays" (which may not be true) then have you tried this out on a distributed system? One might try this by setting up a single-machine dask.distributed cluster with the following lines of Python from dask.distributed import Client
client = Client() # starts a few single-threaded worker processes
# normal xarray with dask code |
The current version (v0.1.1) only runs in serial, although extra-dimension-parallelism should be quite easy to add (literally just a parallel sparse |
Thanks for the suggestion. I haven't needed to regrid any data that is large enough and requires a distributed cluster. If there's any specific research needs I am willing to try. |
I think that XArray will handle things for you if you use methods like apply_ufunc
We're certainly dealing with multi-terabyte datasets for which distributed computation isn't necessary, but is certainly convenient. I'll be giving a couple of talks about distributed XArray for data analysis in the next couple of weeks. It would be nice to point to this work as an example of an active community developing functionality with serial computation in mind that "just works" without having to think much about distributed computation. I believe that XArray's |
I agree with Matt that we are talking about two different types of optimization here:
|
So far @JiaweiZhuang has been focused on optimizing 1. Those of us with more xarray and dask experience could help with 2. |
Tried at the very beginning. |
I agree with Matt and Ryan's comments on out-of-core computing with dask and apply_ufunc. I've just done some experiments with See this notebook in a standalone repo: apply_ufunc_with_dask.ipynb I've simplified the problem so it is only about sparse matrix multiplication. The repo already contains the regridding weight file, so you don't need to use xESMF and ESMPy at all. It is helpful to take a look at sparse_dot_benchmark.ipynb first. It contains detailed explanations on the sparse matrix multiplication algorithm and a successful parallel implementation with Numba. It can be used to benchmark the dask method. |
And I guess it can also parallelize over multiple variables in a Dataset? |
Yeah, I agree that you don't necessarily want to invoke dask array operations here. It sounds like you have built a good numpy -> numpy function. I believe that apply_ufunc will help apply that function in parallel across many slices of many dataarrays of a distributed dataset. |
@JiaweiZhuang Try using |
Thanks! Adding |
Yeah, to be clear, Dask-done-well is almost never faster than Numba-done-well. In the context of XArray the main advantage of dask is larger-than-memory computing. |
You might want to try profiling, perhaps with |
|
You might want to try timing with |
I would try using the Numba dot product (without Numba's parallelization) inside dask/ Something seems to be going wrong with your benchmarks using dask. Maybe dot products with scipy's coo_matrix don't release the GIL? |
Using
Numba on dask array shows somewhat similar performance to Scipy on dask array. There is some speed-up but the parallel efficiency is not great. All details are in this new notebook: numba_on_dask.ipynb |
Just chiming in to support the push for xesmf handling out-of-memory! Getting MemoryError while regridding some GFDL model output on a local computer with 32GB of ram. Would be great to avoid looping over files. Thanks @JiaweiZhuang for a great package, its been very useful so far. |
Glad that xESMF helps! I will utilize |
How about Here's the example. I'm not familiar with |
v0.2 now supports parallel regridding with dask. Distributed regridding is left to pangeo-data/pangeo#334 |
Weights are now returned in memory by default, instead of being written to file. Backward compatibility with the `filename` and `reuse_weights` arguments is preserved.
"Parallel regridding" could mean two things: (see #2 for more about this two-step regridding)
The native parallel support in ESMF/ESMPy is based on MPI and horizontal domain decomposition. It works for both generating and applying weights. See https://github.com/nawendt/esmpy-tutorial/blob/master/esmpy_mpi_example.py as an example.
MPI-based horizontal domain decomposition makes perfect sense for earth system model simulation, but for data analysis I would absolutely want to avoid MPI's complexity. With dask, there's a simple way to go:
dask.array
will be trivial.Is there any case that we have to parallelize over horizontal dimensions?
PS: Need to profile the regridding on very large data sets and figure out the bottleneck (generating vs applying weights) before starting to implement anything.
The text was updated successfully, but these errors were encountered: