Initial support for MPI parallelization #159

sloede · 2020-09-04T03:54:55Z

This issue is to keep track of the necessary steps towards an initial parallel octree implementation in Trixi. It should thus be amended as we progress and gather more experience about which ideas work and which don't. The general goal of the initial MPI implementation is to support nearly all of Trixi's current features in parallel and in the simplest possible way (at the cost of scalability).

Unassigned:

Support for MHD (missing: non-conservative surface flux support)
Support for alpha smoothing in calc_blending_factors
TreeMesh: Implement partitioning scheme that preserves spatial locality (probably Morton or Hilbert space-filling
curve)
Weight-based load balancing
Parallel I/O with external parallel HDF5 library (MPI: Parallel IO #1332)

The text was updated successfully, but these errors were encountered:

ali-ramadhan · 2020-09-26T20:01:12Z

Hi I've been curiously tracking #167 and it seems that MPI will inevitably become a core part of Trixi.jl, i.e. you cannot run Trixi.jl without MPI anymore.

I'm not familiar with the Trixi.jl numerics but is it possible to add distributed parallelism as a layer on top of the existing package? Can a distributed Trixi.jl model be built by stiching together a bunch of Trixi.jl models (1 per rank) that communicate with each other as needed?

I ask since I've been working on supporting distributed parallelism in Oceananigans.jl (CliMA/Oceananigans.jl#590 but haven't done much since January...) and I'm trying this distributed layer approach so the existing package behavior is completely unmodified but I'm not sure if it's going to work. Maybe it'll work since Oceananigans.jl is finite volume but curious to know whether you guys have considered a similar approach (or if it's even possible).

sloede · 2020-09-28T14:27:15Z

Hi @ali-ramadhan! Thank you for your comments - I'll try to respond to them one by one.

[...] it seems that MPI will inevitably become a core part of Trixi.jl, i.e. you cannot run Trixi.jl without MPI anymore.

This is true in the sense that we made MPI.jl an explicit dependency instead of using Requires.jl - for the time being, that is. However, we decided to try a route that maintains a parallel, non-MPI implementation, such that there is no hard algorithmic dependency. This is mainly motivated by the fact that we want Trixi.jl to be accessible also for new and inexperienced users (such as students), and they should be able to work with Trixi without having to deal with any MPI details explicitly.

One example can be found here:

Trixi.jl/src/auxiliary/auxiliary.jl

Lines 16 to 33 in d635cbd

    
           parse_parameters_file(filename) = parse_parameters_file(filename, mpi_parallel()) 
        
           function parse_parameters_file(filename, mpi_parallel::Val{false}) 
        
             parameters[:default] = parsefile(filename) 
        
             parameters[:default]["parameters_file"] = filename 
        
           end 
        
           function parse_parameters_file(filename, mpi_parallel::Val{true}) 
        
             if is_mpi_root() 
        
               buffer = read(filename) 
        
               MPI.Bcast!(Ref(length(buffer)), mpi_root(), mpi_comm()) 
        
               MPI.Bcast!(buffer, mpi_root(), mpi_comm()) 
        
             else 
        
               count = MPI.Bcast!(Ref(0), mpi_root(), mpi_comm()) 
        
               buffer = Vector{UInt8}(undef, count[]) 
        
               MPI.Bcast!(buffer, mpi_root(), mpi_comm()) 
        
             end 
        
             parameters[:default] = parse(String(buffer)) 
        
             parameters[:default]["parameters_file"] = filename 
        
           end

Those not interested in parallel computation only need to read and understand the implementation of the parse_parameters_file methods starting in lines 16 and 17, but not the one starting in line 21. While this undoubtedly introduces some code redundancies, it allows the freedom to choose not to implement a particular feature for parallel simulations, and keeps the serial free of MPI specifics (except the added Val parameter used for dispatch).

However, this is not to say that we might not change this dependency to an optional one in the future... the final verdict is still out on this, since we wanted to gather more practical experience first.

I'm not familiar with the Trixi.jl numerics but is it possible to add distributed parallelism as a layer on top of the existing package? Can a distributed Trixi.jl model be built by stiching together a bunch of Trixi.jl models (1 per rank) that communicate with each other as needed?

In general - and with some restrictions - such an approach would be possible with the DGSEM scheme as well, yes, by leveraging the fact that in the DG method elements are only coupled via the solutions on their faces. As far as I can tell, in CliMA/Oceananigans.jl#590 you try to keep MPI out of the core implementation by using a special boundary condition that exchanges all required data in a way that is transparent to the user. However, I believe there are certain functional limitations you will not get around with such an approach, e.g., if you use explicit time stepping with a global time step or for everything related to I/O. Also, what happens if you need other quantities than the state variables in your halo layer? Thus, depending on the complexity of the numerical methods you want to support, a completely orthogonal separation of MPI code and solver code might make many implementations much more complicated than they necessarily have to be. We therefore opted for an approach where it is OK to add MPI-specific code to the solver, but to use dynamic dispatch to keep it away as much as possible from a user who does not care about distributed parallelism.

[...] I'm trying this distributed layer approach so the existing package behavior is completely unmodified but I'm not sure if it's going to work. Maybe it'll work since Oceananigans.jl is finite volume but curious to know whether you guys have considered a similar approach (or if it's even possible)

As stated above, I think this is possible, yes, but only if you accept some restrictions in terms of supported features and performance. For example, with an approach that hides all communication inside the boundary conditions, it becomes hard to overlap communication with computation (probably even impossible, unless you have multiple sweeps over your boundary conditions in each time step).

I guess in the end it boils down to what your priorities are for Oceananigans.jl: Do you want a simple, accessible code that is easy to understand also for the casual user, or do you want to HPC-optimized code that scales well to thousands (or even hundreds of thousands?) of cores? I doubt that it is possible to achieve both goals at the same time. With Trixi.jl we somewhat aim for the middle ground. But since efficient distributed parallelism is essentially a Really Hard Problem™, I am not sure whether this would suit your needs as well.

OK, this turned out to be a somewhat longer post than expected, and I am still not sure whether I fully answered your questions 🤔 Let me know if not; I'd be happy to further discuss possible MPI strategies in Julia 😉

ali-ramadhan · 2020-10-01T15:40:20Z

Thank you for the detailed reply @sloede!

However, we decided to try a route that maintains a parallel, non-MPI implementation, such that there is no hard algorithmic dependency. This is mainly motivated by the fact that we want Trixi.jl to be accessible also for new and inexperienced users (such as students), and they should be able to work with Trixi without having to deal with any MPI details explicitly.

That would be awesome! Certainly I feel like there's a lack of high performance models accessible to beginners and Julia/Trixi.jl could help a lot here.

However, this is not to say that we might not change this dependency to an optional one in the future... the final verdict is still out on this, since we wanted to gather more practical experience first.

Yeah I get the feeling that new solutions are possible with Julia but there aren't many examples out there (if any?) of large HPC simulation packages that are both beginner-friendly and super performant, so I guess we have to try out different solutions.

As far as I can tell, in CliMA/Oceananigans.jl#590 you try to keep MPI out of the core implementation by using a special boundary condition that exchanges all required data in a way that is transparent to the user. However, I believe there are certain functional limitations you will not get around with such an approach, e.g., if you use explicit time stepping with a global time step or for everything related to I/O.

Yes that are definitely limitations. I guess for Oceananigans.jl it only makes sense to take global time steps, but the I/O limitations are definitely a concern. We would have to add extra code to handle I/O across ranks. My current idea is for each rank to output to it's own file since this is possible with existing output writers, then have some post-processor combine all the files at the end of a simulation or something.

Also, what happens if you need other quantities than the state variables in your halo layer?

Hmmm, that's an interesting point. I guess we only store primitive state variables in memory and all intermediate variables are computed on-the-fly in CPU/GPU kernels. This was an optimization we took to maximize available memory on the GPU so we can fit larger models in memory, perhaps to the detriment of CPU models (slightly). But maybe it helps us with MPI if I understood your point correctly?

Thus, depending on the complexity of the numerical methods you want to support, a completely orthogonal separation of MPI code and solver code might make many implementations much more complicated than they necessarily have to be. We therefore opted for an approach where it is OK to add MPI-specific code to the solver, but to use dynamic dispatch to keep it away as much as possible from a user who does not care about distributed parallelism.

That makes a lot of sense. Perhaps I've convinced myself that a completely orthogonal approach should be fine for Oceananigans.jl but I might encounter unpleasant surprises haha. For sure some existing features will have be re-implemented or modified for distributed models via dispatch and maybe some features won't make it into distributed models.

I guess in the end it boils down to what your priorities are for Oceananigans.jl: Do you want a simple, accessible code that is easy to understand also for the casual user, or do you want to HPC-optimized code that scales well to thousands (or even hundreds of thousands?) of cores? I doubt that it is possible to achieve both goals at the same time. With Trixi.jl we somewhat aim for the middle ground. But since efficient distributed parallelism is essentially a Really Hard Problem™, I am not sure whether this would suit your needs as well.

Julia makes me feel like we can have everything haha although we have favored GPU optimizations over CPU optimizations so far, so maybe we will find out that our design choices hurt us when scaling up distributed models vs the Really Hard Problem™. I believe we can have scalable and HPC-optimized code that is beginner-friendly with Julia, it might take a lot of work and refactoring to get there?

I guess I'll have to finish up CliMA/Oceananigans.jl#590 and see what the benchmarks look like.

ranocha · 2022-04-05T06:08:56Z

@sloede I guess we can update some points?

sloede · 2022-04-05T08:16:47Z

Thanks, yes indeed. I updated the description to reflect the work that has been done. There are a few points that I moved towards the "unassigned" area at the bottom, plus WP6 (multi-physics parallelization). I am leaning towards moving the Multi-physics stuff into its own issue, but I am not sure whether the other points should become separate issues or be collected in a meta issue until they are actually being worked on. What's your take on this?

Ultimately, I would like to close this issue since I think we have reached the point of "initial support for MPI parallelization".

ranocha · 2022-04-05T08:39:44Z

As long as there aren't any plans for somebody working on this, it's fine to keep the status quo. If we start working on it, we can track discussions/progress either in PRs or separate issues. I don't have a strong opinion on this either.

JoshuaLampert · 2023-05-23T20:21:56Z

WP5 sounds like the ParallelTreeMesh works in 3d, but it doesn't, right, @sloede, @ranocha? Maybe adjust this point accordingly?

ranocha · 2023-05-24T03:22:38Z

Correct

JoshuaLampert · 2023-05-26T11:08:29Z

The last point in the list can be ticked (or refer to #1332).

sloede added discussion performance We are greedy parallelization Related to MPI, threading, tasks etc. labels Sep 4, 2020

sloede self-assigned this Sep 4, 2020

sloede mentioned this issue Sep 7, 2020

Add support for parallel 2D simulations on static, uniform mesh #167

Merged

3 tasks

efaulhaber mentioned this issue Nov 3, 2020

Implement AMR for MPI #284

Closed

sloede assigned sloede, efaulhaber and ghost and unassigned sloede Nov 17, 2020

efaulhaber mentioned this issue Nov 30, 2020

Implement AMR with MPI #361

Merged

efaulhaber mentioned this issue Dec 19, 2020

Implement load balancing #390

Merged

sloede pinned this issue Dec 21, 2020

gregorgassner unpinned this issue May 27, 2021

sloede assigned lchristm Aug 27, 2021

lchristm mentioned this issue Oct 6, 2021

Add MPI support for non-uniform meshes (2D TreeMesh) #898

Merged

JoshuaLampert mentioned this issue May 25, 2023

Improve error message for TreeMesh with MPI in 3D #1488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Initial support for MPI parallelization #159

Initial support for MPI parallelization #159

sloede commented Sep 4, 2020 •

edited by ranocha

Loading

ali-ramadhan commented Sep 26, 2020 •

edited

Loading

sloede commented Sep 28, 2020

ali-ramadhan commented Oct 1, 2020

ranocha commented Apr 5, 2022

sloede commented Apr 5, 2022 •

edited

Loading

ranocha commented Apr 5, 2022

JoshuaLampert commented May 23, 2023

ranocha commented May 24, 2023

JoshuaLampert commented May 26, 2023

Initial support for MPI parallelization #159

Initial support for MPI parallelization #159

Comments

sloede commented Sep 4, 2020 • edited by ranocha Loading

ali-ramadhan commented Sep 26, 2020 • edited Loading

sloede commented Sep 28, 2020

ali-ramadhan commented Oct 1, 2020

ranocha commented Apr 5, 2022

sloede commented Apr 5, 2022 • edited Loading

ranocha commented Apr 5, 2022

JoshuaLampert commented May 23, 2023

ranocha commented May 24, 2023

JoshuaLampert commented May 26, 2023

sloede commented Sep 4, 2020 •

edited by ranocha

Loading

ali-ramadhan commented Sep 26, 2020 •

edited

Loading

sloede commented Apr 5, 2022 •

edited

Loading