-
Notifications
You must be signed in to change notification settings - Fork 109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Initial support for MPI parallelization #159
Comments
Hi I've been curiously tracking #167 and it seems that MPI will inevitably become a core part of Trixi.jl, i.e. you cannot run Trixi.jl without MPI anymore. I'm not familiar with the Trixi.jl numerics but is it possible to add distributed parallelism as a layer on top of the existing package? Can a distributed Trixi.jl model be built by stiching together a bunch of Trixi.jl models (1 per rank) that communicate with each other as needed? I ask since I've been working on supporting distributed parallelism in Oceananigans.jl (CliMA/Oceananigans.jl#590 but haven't done much since January...) and I'm trying this distributed layer approach so the existing package behavior is completely unmodified but I'm not sure if it's going to work. Maybe it'll work since Oceananigans.jl is finite volume but curious to know whether you guys have considered a similar approach (or if it's even possible). |
Hi @ali-ramadhan! Thank you for your comments - I'll try to respond to them one by one.
This is true in the sense that we made One example can be found here: Trixi.jl/src/auxiliary/auxiliary.jl Lines 16 to 33 in d635cbd
Those not interested in parallel computation only need to read and understand the implementation of the parse_parameters_file methods starting in lines 16 and 17, but not the one starting in line 21. While this undoubtedly introduces some code redundancies, it allows the freedom to choose not to implement a particular feature for parallel simulations, and keeps the serial free of MPI specifics (except the added Val parameter used for dispatch).
However, this is not to say that we might not change this dependency to an optional one in the future... the final verdict is still out on this, since we wanted to gather more practical experience first.
In general - and with some restrictions - such an approach would be possible with the DGSEM scheme as well, yes, by leveraging the fact that in the DG method elements are only coupled via the solutions on their faces. As far as I can tell, in CliMA/Oceananigans.jl#590 you try to keep MPI out of the core implementation by using a special boundary condition that exchanges all required data in a way that is transparent to the user. However, I believe there are certain functional limitations you will not get around with such an approach, e.g., if you use explicit time stepping with a global time step or for everything related to I/O. Also, what happens if you need other quantities than the state variables in your halo layer? Thus, depending on the complexity of the numerical methods you want to support, a completely orthogonal separation of MPI code and solver code might make many implementations much more complicated than they necessarily have to be. We therefore opted for an approach where it is OK to add MPI-specific code to the solver, but to use dynamic dispatch to keep it away as much as possible from a user who does not care about distributed parallelism.
As stated above, I think this is possible, yes, but only if you accept some restrictions in terms of supported features and performance. For example, with an approach that hides all communication inside the boundary conditions, it becomes hard to overlap communication with computation (probably even impossible, unless you have multiple sweeps over your boundary conditions in each time step). I guess in the end it boils down to what your priorities are for OK, this turned out to be a somewhat longer post than expected, and I am still not sure whether I fully answered your questions 🤔 Let me know if not; I'd be happy to further discuss possible MPI strategies in Julia 😉 |
Thank you for the detailed reply @sloede!
That would be awesome! Certainly I feel like there's a lack of high performance models accessible to beginners and Julia/Trixi.jl could help a lot here.
Yeah I get the feeling that new solutions are possible with Julia but there aren't many examples out there (if any?) of large HPC simulation packages that are both beginner-friendly and super performant, so I guess we have to try out different solutions.
Yes that are definitely limitations. I guess for Oceananigans.jl it only makes sense to take global time steps, but the I/O limitations are definitely a concern. We would have to add extra code to handle I/O across ranks. My current idea is for each rank to output to it's own file since this is possible with existing output writers, then have some post-processor combine all the files at the end of a simulation or something.
Hmmm, that's an interesting point. I guess we only store primitive state variables in memory and all intermediate variables are computed on-the-fly in CPU/GPU kernels. This was an optimization we took to maximize available memory on the GPU so we can fit larger models in memory, perhaps to the detriment of CPU models (slightly). But maybe it helps us with MPI if I understood your point correctly?
That makes a lot of sense. Perhaps I've convinced myself that a completely orthogonal approach should be fine for Oceananigans.jl but I might encounter unpleasant surprises haha. For sure some existing features will have be re-implemented or modified for distributed models via dispatch and maybe some features won't make it into distributed models.
Julia makes me feel like we can have everything haha although we have favored GPU optimizations over CPU optimizations so far, so maybe we will find out that our design choices hurt us when scaling up distributed models vs the Really Hard Problem™. I believe we can have scalable and HPC-optimized code that is beginner-friendly with Julia, it might take a lot of work and refactoring to get there? I guess I'll have to finish up CliMA/Oceananigans.jl#590 and see what the benchmarks look like. |
@sloede I guess we can update some points? |
Thanks, yes indeed. I updated the description to reflect the work that has been done. There are a few points that I moved towards the "unassigned" area at the bottom, plus WP6 (multi-physics parallelization). I am leaning towards moving the Multi-physics stuff into its own issue, but I am not sure whether the other points should become separate issues or be collected in a meta issue until they are actually being worked on. What's your take on this? Ultimately, I would like to close this issue since I think we have reached the point of "initial support for MPI parallelization". |
As long as there aren't any plans for somebody working on this, it's fine to keep the status quo. If we start working on it, we can track discussions/progress either in PRs or separate issues. I don't have a strong opinion on this either. |
Correct |
The last point in the list can be ticked (or refer to #1332). |
This issue is to keep track of the necessary steps towards an initial parallel octree implementation in Trixi. It should thus be amended as we progress and gather more experience about which ideas work and which don't. The general goal of the initial MPI implementation is to support nearly all of Trixi's current features in parallel and in the simplest possible way (at the cost of scalability).
(achievements: basic infrastructure for parallelization in place, parallelized DG solver)
(achievements: domain partitioning/space filling curve)
(achievements: infrastructure for parallel mesh and solver adaptation)
(achievements: ability to redistribute solver data and to do "online restarts")
ParallelP4estMesh
to 3D #1062, Extend parallel P4estMesh mortar+AMR support to 3D #1091)(achievements: everything works in 3D on the
P4estMesh
)TreeMesh
(achievements: all Euler-gravity tests work in parallel)
Unassigned:
calc_blending_factors
TreeMesh
: Implement partitioning scheme that preserves spatial locality (probably Morton or Hilbert space-fillingcurve)
The text was updated successfully, but these errors were encountered: