MOOSE performance being significantly slower compared to other programs #28185

yanchong-eden · 2024-07-19T19:26:15Z

yanchong-eden
Jul 19, 2024

Check these boxes if you have followed the posting rules.

Q&A General is the most appropriate section for my question
I have consulted the posting Guidelines on the Discussions front page
I have searched the Discussions forum and my question has not been asked before
I have searched the MOOSE website and the documentation does not answer my question
I have formatted my post following the posting guidelines (screenshots as a last resort, triple back quotes around pasted text)

Question

Hello there!

We are developing a Multiphysics application based on MOOSE, but our tests indicate that MOOSE is significantly slower than other programs in various aspects. We've compared:

The dynamic poroelastic problem between implementations on MOOSE and FEniCS;
Isothermal porous flow problem between MOOSE and Tough React Mech (TRM), as well as a bare FE implementation on PETSc;
The smeared cracking model in MOOSE against a similar implementation in DefMod (PoroElastoDynamic model based on PETSc).

All these scenarios show MOOSE being at least x10 slower than the other programs. We've already performed some profiling to optimize the input file, but the order of magnitude difference in performance persists. After some searching, we’ve found some tests done by others too regarding performance differences. For example, a slide from source also suggests that MOOSE is slower compared to many other toolsets, including libMesh on which it is based:

We enjoy using MOOSE, and appreciate the effort for building this great program, but are also quite confused about why the performance is so different (a few times slower make sense but here is >x10...). We would greatly appreciate any feedback on how to improve the performance!

Detailed Comparisons:

1. Dynamic poroelastic problem

This is an application we're developing based on MOOSE, benchmarked against a case on the legacy FEniCS (https://www.sciencedirect.com/science/article/abs/pii/S0045782523005108?via%3Dihub). Both cases use a grid of 500,000 nodes, 2D, with 2,500,000 DoF. Running them in the same environment yields:

# Nodes	# CPU	FEniCS (sec/timestep)	MOOSE development (sec/timestep)	Speed Difference
500K	1	~10	~200	x20

We see more than x10 slower performance on MOOSE compared to the legacy version of FEniCS.

2. The isothermal porous flow problem

We tested the isothermal porous flow problem performance between MOOSE and TRM:

# Nodes	# CPU	TRM (sec/timestep)	MOOSE (sec/timestep)	Speed Difference
214K	4	1.2	40	x33.3
28K	4	0.2	3.9	x19.5

As well as the comparison between MOOSE and a bare PETSc finite element implementation:

# Nodes	# CPU	Bare PETSc (sec/NR iteration)	MOOSE (sec/NR iteration)	Speed Difference
214K	4	0.21	10	x47.6
28K	4	0.025	0.62	x24.8

In both comparison MOOSE is >x10, if not more, slower than the other programs.

3. The smeared cracking model (SCM)

We also compared a 2D problem using the smeared cracking model on MOOSE with DefMOD (https://doi.org/10.56952/ARMA-2023-0493). Both cases use explicit solvers (central difference in MOOSE). The performance difference is again more than x10, if not x100. Note that the performance on MOOSE here is already optimized on some degree based on perfgraphoutput:

# Nodes	# CPU	DefMod (sec/timestep)	MOOSE SCM (sec/timestep)	Speed Difference
50K	1	0.0035	1.5	x428.6
50K	4	0.0014	0.64	x457
13K	1	0.00063	0.3	x476.2

Even when we change the problem to pure elastic, MOOSE speed only increases by a factor of ~2, which still doesn't close the significant performance gap. We've performed scaling tests for the MOOSE smeared cracking model, and the performance increase is linear as we increase the number of nodes. All these tests were done for 50 timesteps.

We've also performed profiling using both PerfGraphOutput and oprof. The current performance is achieved by reducing auxvariables and output from smeared cracking models so that NonlinearSystemBase::Kernels take up most of the time rather than AuxiliarySystem. When we perform oprof profiling, a significant amount of time is consumed by something labeled "unknown", which doesn't seem helpful.

We would greatly appreciate any insights or suggestions on how to improve MOOSE's performance for these types of problems. If additional information about our test configurations or profiling results would be helpful, please let us know.

MOOSE Environment:

The environment we used for tests 1 and 3 is on Windows Subsystem of Linux 2 (WSL2). The machine is equipped with I9-13900H CPU and 32 Gb memory. We achieved a similar performance on M3 MacBook pro.
Part of the output from Diagnostic.sh showing compilers and environment versions:

Compiler(s) (CC CXX FC F77 F90):
CC=/home/ycli/miniforge/envs/moose/bin/mpicc
CC -show:
x86_64-conda-linux-gnu-cc -I/home/ycli/miniforge/envs/moose/include -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath,/home/ycli/miniforge/envs/moose/lib -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath -Wl,/home/ycli/miniforge/envs/moose/lib -Wl,--enable-new-dtags -lmpi
CC version: x86_64-conda-linux-gnu-cc (conda-forge gcc 10.4.0-19) 10.4.0

CXX=/home/ycli/miniforge/envs/moose/bin/mpicxx
CXX -show:
x86_64-conda-linux-gnu-c++ -I/home/ycli/miniforge/envs/moose/include -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath,/home/ycli/miniforge/envs/moose/lib -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -lmpicxx -Wl,-rpath -Wl,/home/ycli/miniforge/envs/moose/lib -Wl,--enable-new-dtags -lmpi
CXX version: x86_64-conda-linux-gnu-c++ (conda-forge gcc 10.4.0-19) 10.4.0

FC=/home/ycli/miniforge/envs/moose/bin/mpif90
FC -show:
x86_64-conda-linux-gnu-gfortran -I/home/ycli/miniforge/envs/moose/include -fallow-argument-mismatch -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath,/home/ycli/miniforge/envs/moose/lib -I/home/ycli/miniforge/envs/moose/include -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -lmpifort -Wl,-rpath -Wl,/home/ycli/miniforge/envs/moose/lib -Wl,--enable-new-dtags -lmpi
FC version: GNU Fortran (conda-forge gcc 10.4.0-19) 10.4.0

F77=/home/ycli/miniforge/envs/moose/bin/mpif77
F77 -show:
x86_64-conda-linux-gnu-gfortran -I/home/ycli/miniforge/envs/moose/include -fallow-argument-mismatch -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath,/home/ycli/miniforge/envs/moose/lib -I/home/ycli/miniforge/envs/moose/include -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -lmpifort -Wl,-rpath -Wl,/home/ycli/miniforge/envs/moose/lib -Wl,--enable-new-dtags -lmpi
F77 version: GNU Fortran (conda-forge gcc 10.4.0-19) 10.4.0

F90=/home/ycli/miniforge/envs/moose/bin/mpif90
F90 -show:
x86_64-conda-linux-gnu-gfortran -I/home/ycli/miniforge/envs/moose/include -fallow-argument-mismatch -L/home/ycli/miniforge/envs/moose/lib -Wl,-rpath,/home/ycli/miniforge/envs/moose/lib -I/home/ycli/miniforge/envs/moose/include -I/home/ycli/miniforge/envs/moose/include -L/home/ycli/miniforge/envs/moose/lib -lmpifort -Wl,-rpath -Wl,/home/ycli/miniforge/envs/moose/lib -Wl,--enable-new-dtags -lmpi
F90 version: GNU Fortran (conda-forge gcc 10.4.0-19) 10.4.0

##################################################################################################
CONDA MOOSE Packages

GiudGiud · 2024-07-20T00:47:51Z

GiudGiud
Jul 20, 2024
Collaborator

Hello

Thanks for looking into this. There is a lot that comes into performance and it's understandably more complex to get peak performance in MOOSE than in more reduced packages. I don't think anyone in the team has availability to reproduce these results and re-profile to get a better view of the differences right now. @lindsayad for awareness

One thing I have wondered is how unoptimized are our libmesh and PETSc packages delivered through conda. They are built with no architecture-specific optimization, and could be slower than libmesh or petsc built from source.
That would be a good thing to examine. @milljm may know if that is a factor or not.

With regards to explicit solves, optimizing them is being looked at there
#28105
and there is an optimized central difference stepper for solid mechanics in the works there
#28175
MOOSE was never optimized for explicit solves. There is a lot to do if we want better performance there.

Guillaume

6 replies

milljm Jul 22, 2024
Maintainer

Conda packages are considered 'fat' binaries, designed to work on a variety of microarchitecture(s). But not optimized for a single one (except perhaps for the ancient Haswell microarchitecture -mtune=haswell for which all Conda packages are built with. Including MOOSE's). This is so anything newer than haswell can use our packages.

ref:
https://docs.conda.io/projects/conda-build/en/stable/resources/compiler-tools.html#customizing-the-compilers
and,
https://community.intel.com/t5/Processors/How-to-correctly-determine-march-and-mtune-for-Intel-processors/td-p/734278

...for comparison when trying to build binaries for a specific machine's specifications. Which is what HPC admins will do when building HPC clusters stacks.

GiudGiud Jul 22, 2024
Collaborator

this is why I dont consider benchmarking between libmesh and petsc built from source and moose, with the max-convenience, max portability, likely lower-performance conda prebuilt binaries for libmesh and petsc to be fair.

I have not checked how unfair though.

GiudGiud Jul 22, 2024
Collaborator

One way we could address this would be to publish a yearly benchmark where we do our best on the MOOSE side, maybe even involve applications, and compare to codes we have access to the source for (so we can take similar precautions building them).
Maybe some sort "MOOSE Performance assessment suite" which is public, instead of the internal performance tracking we already do

@lindsayad

lindsayad Jul 22, 2024
Maintainer

If you can share branches for your case that compares MOOSE and PETSc performance, then I will profile them

GiudGiud Nov 1, 2024
Collaborator

@yanchong-eden with the end of the FY rush and a small dip in proposal season, we should have some time to look at these cases now.

Can you please share with us:

the MOOSE inputs for all cases
the bare petsc code for the relevant cases
the bare libMesh code for the relevant cases

ABallisat · 2024-07-23T08:53:24Z

ABallisat
Jul 23, 2024

From experience you can get hugely different performance depending on how you build the various components of MOOSE. We have done quite a lot of investigation of this with different compilers and MPI implementations which can add up to at least an order of magnitude performance difference, ending up on a best set of compilers and options for our systems. Echoing what has been said above, if you are just using the Conda binaries for benchmarking you are not going to get a real comparison. You should build MOOSE and all components from source with the flags tuned for your system, and probably the same flags you are using to build Libmesh and PETSc. If you do that you should get a fairly significant speed up. @GiudGiud might it be worth adding this to the documentation for the Conda install that it is not necessarily optimised and that people should build from source if they want pure performance?

Other things to consider:

Are you using a replicated or distributed mesh? We find that for some systems that can be a factor of 2 or more performance. What is PETSc doing?
Are the preconditioners the same?
Are the options getting passed through to PETSc the same, e.g. tolerances?
What flags are you passing to compilers? looking at what you posted didn't show any optimisation flags

Benchmarking real systems is a nebulous topic and you really need to take care that you are executing in the same environment with the same options to get fair comparisons

1 reply

milljm Jul 23, 2024
Maintainer

I am having to update our Conda instructions for another reason, but I will gladly add some information about this: #28193

yanchong-eden · 2024-07-23T20:09:26Z

yanchong-eden
Jul 23, 2024
Author

Thanks everyone for your interests and comments!

Yes, we are only compiling following the default conda environment on the website, and as @GiudGiud, @milljm and @ABallisat are suggesting the performance is not optimized on our system in the compilation side. And surely, since not optimized, the performance comparison we have is not fair between MOOSE and other software.

It would be really helpful if there is a "MOOSE Performance Assessment Suite" as @GiudGiud said for us to compare how far our current performance is compared to an officially optimized performance so that we could know where we are right now. The information from @ABallisat that the one order of magnitude speed up is encouraging, and it would be even better if there's an official version we can compare to for a few specific cases. Forgive me but as a startup company, we are worried that no matter how we optimize, this would be still order of magnitude slower than other software.... For example, for the smeared cracking model case, one order of magnitude speed increase still cannot close the x450 slower gap.... If we have a comparison that can give us an estimation of how good the eventual performance can be, we will have more confidence.

A few other points:
@lindsayad Thank you very much for your interest for willing to do the profiling, but unfortunately we are not allowed to share the PETSc implementation :( On MOOSE it would be just solving a diffusion kernel for pressure.

@GiudGiud Thanks for sharing information about the explicit solver. We were actually using NewmarkBeta implicit time integrator, and switching to explicit because we want to increase the speed. Although not optimized for explicit solver, it still gives us 2-3 times speed increase, which is then applied to the performance analysis we have here.

@ABallisat Thanks for your suggestions! When we are comparing, we didn't use any distributed mesh. When comparing to the bare PETSc implementation, we did make sure all the options are the same. For compilation, we just follow the conda default information with make -j8. We didn't defined any optimization flag on our own.

2 replies

lindsayad Jul 24, 2024
Maintainer

Are there any of your comparison cases that you can share? The Fenics case? If you want to help us improve MOOSE, making the comparison as easy as possible for us would go a long ways

chunhuizhao478 Jul 24, 2024

Hi @lindsayad, thanks for asking. We are not ready to make the MOOSE code public for now, I have shared a clean version repo invitation to you and I also copy the FEniCS code zip on Slack, thanks for your time!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MOOSE performance being significantly slower compared to other programs #28185

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 3 comments 9 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

MOOSE performance being significantly slower compared to other programs #28185

yanchong-eden Jul 19, 2024

Check these boxes if you have followed the posting rules.

Question

1. Dynamic poroelastic problem

2. The isothermal porous flow problem

3. The smeared cracking model (SCM)

MOOSE Environment:

Replies: 3 comments · 9 replies

GiudGiud Jul 20, 2024 Collaborator

milljm Jul 22, 2024 Maintainer

GiudGiud Jul 22, 2024 Collaborator

GiudGiud Jul 22, 2024 Collaborator

lindsayad Jul 22, 2024 Maintainer

GiudGiud Nov 1, 2024 Collaborator

ABallisat Jul 23, 2024

milljm Jul 23, 2024 Maintainer

yanchong-eden Jul 23, 2024 Author

lindsayad Jul 24, 2024 Maintainer

chunhuizhao478 Jul 24, 2024

yanchong-eden
Jul 19, 2024

Replies: 3 comments 9 replies

GiudGiud
Jul 20, 2024
Collaborator

milljm Jul 22, 2024
Maintainer

GiudGiud Jul 22, 2024
Collaborator

GiudGiud Jul 22, 2024
Collaborator

lindsayad Jul 22, 2024
Maintainer

GiudGiud Nov 1, 2024
Collaborator

ABallisat
Jul 23, 2024

milljm Jul 23, 2024
Maintainer

yanchong-eden
Jul 23, 2024
Author

lindsayad Jul 24, 2024
Maintainer