MOOSE performance being significantly slower compared to other programs #28185
Replies: 3 comments 9 replies
-
Hello Thanks for looking into this. There is a lot that comes into performance and it's understandably more complex to get peak performance in MOOSE than in more reduced packages. I don't think anyone in the team has availability to reproduce these results and re-profile to get a better view of the differences right now. @lindsayad for awareness One thing I have wondered is how unoptimized are our libmesh and PETSc packages delivered through conda. They are built with no architecture-specific optimization, and could be slower than libmesh or petsc built from source. With regards to explicit solves, optimizing them is being looked at there Guillaume |
Beta Was this translation helpful? Give feedback.
-
From experience you can get hugely different performance depending on how you build the various components of MOOSE. We have done quite a lot of investigation of this with different compilers and MPI implementations which can add up to at least an order of magnitude performance difference, ending up on a best set of compilers and options for our systems. Echoing what has been said above, if you are just using the Conda binaries for benchmarking you are not going to get a real comparison. You should build MOOSE and all components from source with the flags tuned for your system, and probably the same flags you are using to build Libmesh and PETSc. If you do that you should get a fairly significant speed up. @GiudGiud might it be worth adding this to the documentation for the Conda install that it is not necessarily optimised and that people should build from source if they want pure performance? Other things to consider:
Benchmarking real systems is a nebulous topic and you really need to take care that you are executing in the same environment with the same options to get fair comparisons |
Beta Was this translation helpful? Give feedback.
-
Thanks everyone for your interests and comments! Yes, we are only compiling following the default conda environment on the website, and as @GiudGiud, @milljm and @ABallisat are suggesting the performance is not optimized on our system in the compilation side. And surely, since not optimized, the performance comparison we have is not fair between MOOSE and other software. It would be really helpful if there is a "MOOSE Performance Assessment Suite" as @GiudGiud said for us to compare how far our current performance is compared to an officially optimized performance so that we could know where we are right now. The information from @ABallisat that the one order of magnitude speed up is encouraging, and it would be even better if there's an official version we can compare to for a few specific cases. Forgive me but as a startup company, we are worried that no matter how we optimize, this would be still order of magnitude slower than other software.... For example, for the smeared cracking model case, one order of magnitude speed increase still cannot close the x450 slower gap.... If we have a comparison that can give us an estimation of how good the eventual performance can be, we will have more confidence. A few other points: @GiudGiud Thanks for sharing information about the explicit solver. We were actually using NewmarkBeta implicit time integrator, and switching to explicit because we want to increase the speed. Although not optimized for explicit solver, it still gives us 2-3 times speed increase, which is then applied to the performance analysis we have here. @ABallisat Thanks for your suggestions! When we are comparing, we didn't use any distributed mesh. When comparing to the bare PETSc implementation, we did make sure all the options are the same. For compilation, we just follow the conda default information with make -j8. We didn't defined any optimization flag on our own. |
Beta Was this translation helpful? Give feedback.
-
Check these boxes if you have followed the posting rules.
Question
Hello there!
We are developing a Multiphysics application based on MOOSE, but our tests indicate that MOOSE is significantly slower than other programs in various aspects. We've compared:
All these scenarios show MOOSE being at least x10 slower than the other programs. We've already performed some profiling to optimize the input file, but the order of magnitude difference in performance persists. After some searching, we’ve found some tests done by others too regarding performance differences. For example, a slide from source also suggests that MOOSE is slower compared to many other toolsets, including libMesh on which it is based:
We enjoy using MOOSE, and appreciate the effort for building this great program, but are also quite confused about why the performance is so different (a few times slower make sense but here is >x10...). We would greatly appreciate any feedback on how to improve the performance!
Detailed Comparisons:
1. Dynamic poroelastic problem
This is an application we're developing based on MOOSE, benchmarked against a case on the legacy FEniCS (https://www.sciencedirect.com/science/article/abs/pii/S0045782523005108?via%3Dihub). Both cases use a grid of 500,000 nodes, 2D, with 2,500,000 DoF. Running them in the same environment yields:
We see more than x10 slower performance on MOOSE compared to the legacy version of FEniCS.
2. The isothermal porous flow problem
We tested the isothermal porous flow problem performance between MOOSE and TRM:
As well as the comparison between MOOSE and a bare PETSc finite element implementation:
In both comparison MOOSE is >x10, if not more, slower than the other programs.
3. The smeared cracking model (SCM)
We also compared a 2D problem using the smeared cracking model on MOOSE with DefMOD (https://doi.org/10.56952/ARMA-2023-0493). Both cases use explicit solvers (central difference in MOOSE). The performance difference is again more than x10, if not x100. Note that the performance on MOOSE here is already optimized on some degree based on perfgraphoutput:
Even when we change the problem to pure elastic, MOOSE speed only increases by a factor of ~2, which still doesn't close the significant performance gap. We've performed scaling tests for the MOOSE smeared cracking model, and the performance increase is linear as we increase the number of nodes. All these tests were done for 50 timesteps.
We've also performed profiling using both PerfGraphOutput and oprof. The current performance is achieved by reducing auxvariables and output from smeared cracking models so that NonlinearSystemBase::Kernels take up most of the time rather than AuxiliarySystem. When we perform oprof profiling, a significant amount of time is consumed by something labeled "unknown", which doesn't seem helpful.
We would greatly appreciate any insights or suggestions on how to improve MOOSE's performance for these types of problems. If additional information about our test configurations or profiling results would be helpful, please let us know.
MOOSE Environment:
The environment we used for tests 1 and 3 is on Windows Subsystem of Linux 2 (WSL2). The machine is equipped with I9-13900H CPU and 32 Gb memory. We achieved a similar performance on M3 MacBook pro.
Part of the output from Diagnostic.sh showing compilers and environment versions:
Compiler(s) (CC CXX FC F77 F90):
CC=/home/ycli/miniforge/envs/moose/bin/mpicc
CC -show:
x86_64-conda-linux-gnu-cc -I/home/ycli/miniforge/envs/moose/include -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath,/home/ycli/miniforge/envs/moose/lib -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath -Wl,/home/ycli/miniforge/envs/moose/lib -Wl,--enable-new-dtags -lmpi
CC version: x86_64-conda-linux-gnu-cc (conda-forge gcc 10.4.0-19) 10.4.0
CXX=/home/ycli/miniforge/envs/moose/bin/mpicxx
CXX -show:
x86_64-conda-linux-gnu-c++ -I/home/ycli/miniforge/envs/moose/include -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath,/home/ycli/miniforge/envs/moose/lib -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -lmpicxx -Wl,-rpath -Wl,/home/ycli/miniforge/envs/moose/lib -Wl,--enable-new-dtags -lmpi
CXX version: x86_64-conda-linux-gnu-c++ (conda-forge gcc 10.4.0-19) 10.4.0
FC=/home/ycli/miniforge/envs/moose/bin/mpif90
FC -show:
x86_64-conda-linux-gnu-gfortran -I/home/ycli/miniforge/envs/moose/include -fallow-argument-mismatch -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath,/home/ycli/miniforge/envs/moose/lib -I/home/ycli/miniforge/envs/moose/include -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -lmpifort -Wl,-rpath -Wl,/home/ycli/miniforge/envs/moose/lib -Wl,--enable-new-dtags -lmpi
FC version: GNU Fortran (conda-forge gcc 10.4.0-19) 10.4.0
F77=/home/ycli/miniforge/envs/moose/bin/mpif77
F77 -show:
x86_64-conda-linux-gnu-gfortran -I/home/ycli/miniforge/envs/moose/include -fallow-argument-mismatch -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath,/home/ycli/miniforge/envs/moose/lib -I/home/ycli/miniforge/envs/moose/include -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -lmpifort -Wl,-rpath -Wl,/home/ycli/miniforge/envs/moose/lib -Wl,--enable-new-dtags -lmpi
F77 version: GNU Fortran (conda-forge gcc 10.4.0-19) 10.4.0
F90=/home/ycli/miniforge/envs/moose/bin/mpif90
F90 -show:
x86_64-conda-linux-gnu-gfortran -I/home/ycli/miniforge/envs/moose/include -fallow-argument-mismatch -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath,/home/ycli/miniforge/envs/moose/lib -I/home/ycli/miniforge/envs/moose/include -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -lmpifort -Wl,-rpath -Wl,/home/ycli/miniforge/envs/moose/lib -Wl,--enable-new-dtags -lmpi
F90 version: GNU Fortran (conda-forge gcc 10.4.0-19) 10.4.0
##################################################################################################
CONDA MOOSE Packages
moose-dev 2024.05.13 build_0 https://conda.software.inl.gov/public
moose-libmesh 2024.05.05 build_0 https://conda.software.inl.gov/public
moose-libmesh-vtk 9.2.6 build_9 https://conda.software.inl.gov/public
moose-mpich 4.0.2 build_16 https://conda.software.inl.gov/public
moose-peacock 2023.04.11 hb6770a3_0 https://conda.software.inl.gov/public
moose-petsc 3.20.3 build_1 https://conda.software.inl.gov/public
moose-tools 2024.05.02 h4a78fc2_0 https://conda.software.inl.gov/public
moose-wasp 2024.05.08 build_0 https://conda.software.inl.gov/public
Beta Was this translation helpful? Give feedback.
All reactions