Parallelization issue during iMTD-GC in crest 3.0 #284

kevinkong98 · 2024-04-14T04:52:50Z

During the iMTD-GC workflow, if the number of threads specified in CREST 3.0 (using the pre-compiled version) is bigger than number of MTD running co-currently(for example I have 52 threads on my workstation), a lot of threads are unused (threads specified in the .toml input file format) and each MTD simulation only uses one thread. In contrast, CREST 2.12 uses all thread specified by the OMP_NUM_THREADS environment variable. This probably cause significant slowdown when using CREST with large systems. Is there any way to recover the old behavior on parallelization?

Many thanks!

kevinkong98 · 2024-04-14T05:26:18Z

Moreover I also ran into a seg fault during optimization

======================================
| Multilevel Ensemble Optimization |

Optimizing all 2786 structures from file "crest_dynamics.trj" ...

crude pre-optimization

Optimization engine: ANCOPT
Hessian update type: BFGS
E/G convergence criteria: 0.500E-03 Eh, 0.100E-01 Eh/a0
|>0.0% |>7.5% |>15.0%forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image PC Routine Line Source
crest 0000000002659810 Unknown Unknown Unknown
crest 0000000000916453 tblite_double_dic 124 double_dictionary.f90
crest 0000000002413B6A Unknown Unknown Unknown
crest 0000000002413E97 Unknown Unknown Unknown
crest 0000000002413E97 Unknown Unknown Unknown
crest 0000000002413E97 Unknown Unknown Unknown
crest 0000000002414CD5 Unknown Unknown Unknown
crest 0000000002414D7F Unknown Unknown Unknown
crest 000000000046566D crest_calculator_ 615 calculator.F90
crest 0000000000890056 modelhessian_modu 151 modelhessian.f90
crest 000000000087A928 ancopt_module_mp_ 195 ancopt.f90
crest 00000000007CC45A optimize_module_m 74 optimize_module.f90
crest 0000000000526AF9 crest_oloop_ 293 parallel.f90
crest 000000000256535F Unknown Unknown Unknown
crest 0000000002565DCB Unknown Unknown Unknown
crest 000000000255F4C7 Unknown Unknown Unknown
crest 00000000005272F2 crest_oloop_ 274 parallel.f90
crest 000000000257B213 Unknown Unknown Unknown
crest 000000000252DDAE Unknown Unknown Unknown
crest 000000000252EFAC Unknown Unknown Unknown
crest 00000000024F5D14 Unknown Unknown Unknown
crest 00000000005258DD crest_oloop_ 266 parallel.f90
crest 0000000000517776 crest_multilevel_ 377 search_conformers.f90
crest 0000000000512C61 crest_search_imtd 129 search_conformers.f90
crest 0000000000404294 MAIN__ 241 crest_main.f90
crest 000000000040336D Unknown Unknown Unknown
crest 000000000265BAE0 Unknown Unknown Unknown
crest 000000000040324E Unknown Unknown Unknown

My input file is written as follows:

input ='input.xyz'
threads=52
runtype='imtd-gc'
optelv='extreme'
constraints='con.inp'
preopt=true
[[calculation.level]]
method = "tblite"
tblite_level = 'gfn2'
uhf = 0
chrg = 0

Thanks!

pitsteinbach · 2024-04-15T08:31:07Z

Hey,
I will have a look at it, since I put the double dictionary in tblite. Maybe, also the parallelization once I am there.

pprcht · 2024-04-15T09:28:57Z

Two comments from my side:

In contrast, CREST 2.12 uses all thread specified by the OMP_NUM_THREADS environment variable. This probably cause significant slowdown when using CREST with large systems. Is there any way to recover the old behavior on parallelization?

I don't think it will slow down and also haven't really seen evidence for this, so far. Both parallelization strategies still function via the OpenMP library, the threads setting in the .toml adjusts the very same thing as the OMP_NUM_THREADS variable. In CREST 2.12, threads are evenly distributed to the MDs/MTDs only when the number of threads exceeds the number of jobs by at least double. In contrast, CREST 3.0 has all the parallelization done by the OpenMP lib internally. Since tblite and gfn0 and gfnff are all using OpenMP by themselves my understanding is that the "outer" OMP loop running the jobs uses only the required number of threads while all threads exceeding this count go into the parallelization of the potential. These potential calls are quite short compared to the outer OMP loop, so they should not produce much traceable thread load. If all, this is the clean version of how to do it compared to 2.12.

Now, this doesn't explain the error from the second comment. It seems to stem from the tblite subproject. It could be kernel related if you are using a precompiled binary with this many threads. A good thing to try is to build a CREST version on the machine you are using it, if you haven't already. As for potential code-internal fixes, @pitsteinbach maybe an $omp critical within the assign subroutine might do the trick, if you want to test that.

kevinkong98 · 2024-04-16T06:47:15Z

I just repeated one of the meta-dynamics simulation performed during the iMTD-GC run independently using the following setup: (System size is 101 atoms)
threads=3 : 20 min 12 sec
threads=52 12 min 7 sec
While this is the result during the iMTD-GC workflow where each simulation appears to only use 1 core:
*MTD 6 completed successfully ... 36 min, 35.395 sec
*MTD 12 completed successfully ... 36 min, 41.420 sec
*MTD 5 completed successfully ... 36 min, 42.169 sec
*MTD 11 completed successfully ... 36 min, 43.143 sec
*MTD 3 completed successfully ... 36 min, 56.771 sec
*MTD 4 completed successfully ... 37 min, 0.533 sec
*MTD 7 completed successfully ... 37 min, 2.305 sec
*MTD 9 completed successfully ... 37 min, 6.339 sec
*MTD 2 completed successfully ... 37 min, 6.745 sec
*MTD 8 completed successfully ... 37 min, 15.799 sec
*MTD 13 completed successfully ... 37 min, 17.198 sec
*MTD 14 completed successfully ... 37 min, 19.776 sec
*MTD 10 completed successfully ... 37 min, 40.100 sec
*MTD 1 completed successfully ... 37 min, 47.858 sec

Moreover, using the --legacy option each simulation in the first round of the iMTD-GC workflow completed around 16 minute so there appears to be some benefit of the old parallelization scheme.

Old version:
Starting Meta-MD 1 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.2970
Vbias exp α /bohr⁻²: 1.30
Starting Meta-MD 5 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.1485
Vbias exp α /bohr⁻²: 0.78
Starting Meta-MD 6 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.0743
Vbias exp α /bohr⁻²: 0.78
Starting Meta-MD 7 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.2970
Vbias exp α /bohr⁻²: 0.47
Starting Meta-MD 2 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.1485
Vbias exp α /bohr⁻²: 1.30
Starting Meta-MD 8 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.1485
Vbias exp α /bohr⁻²: 0.47
Starting Meta-MD 9 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.0743
Vbias exp α /bohr⁻²: 0.47
Starting Meta-MD 10 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.2970
Vbias exp α /bohr⁻²: 0.28
Starting Meta-MD 3 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.0743
Vbias exp α /bohr⁻²: 1.30
Starting Meta-MD 11 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.1485
Vbias exp α /bohr⁻²: 0.28
Starting Meta-MD 12 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.0743
Vbias exp α /bohr⁻²: 0.28
Starting Meta-MD 14 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.4950
Vbias exp α /bohr⁻²: 0.80
Starting Meta-MD 13 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.0990
Vbias exp α /bohr⁻²: 0.10
Starting Meta-MD 4 with the settings:
MD time /ps : 20.0
dt /fs : 5.0
dumpstep(trj) /fs : 100
dumpstep(Vbias)/ps : 1.0
Vbias factor k /Eh : 0.2970
Vbias exp α /bohr⁻²: 0.78

Many thanks in advance! I will try to recompile CREST on my machine in the meantime to see if that resolves the optimization issue.

pprcht · 2024-04-16T11:09:50Z

Fair enough, it seems OpenMP doesn't handle the potential parallelization as automatically as I thought. I'll have to look into how and if the parallelization can be motivated into a similar behavior as before.

The kbias should not be different but the printout declaration in CREST <3.0 and with the --legacy option and the new version seem to be. In the old versions it says it is showing Vbias factor k /Eh, but it is actually showing (k x N_at)/E_h, while in the new version it is truly showing k/E_h. Considering your system has 101 atoms, this should explain the difference.
Thinking about it now, the printout is probably more correct in the old version so that k_tot=k x N_at. I'll change it back.

pprcht · 2024-04-20T13:12:42Z

I looked into the parallelization issue. Unfortunately with not much success, but some insight.

What I was originally hoping for is automatic nested parallelism in the OpenMP regions. Since this was obviously not the case, I tried defining it explicitly, and in the process revised some of the automatic OpenMP handling of the code in general (#288).

The problem with nested OpenMP parallel regions is that it can create severe overhead for managing the threads, which might slow down the overall process. I think we are seeing this here as well.
So far, I have been unable to achieve any effect when compiling with gfortran/gcc, but at least got somewhere with ifort/icx:
I've set up the OpenMP handling now to never block more threads than needed when evenly distributing the available ones over all jobs. Running a small test job with 3 MDs, this were the timings for the old version handling (blocking 9 cores, but effectively using only 3):

With the new handling, and having nested parallelism deactivated there is a small speedup resulting from less overhead due to thread handling (i.e. using 3 threads in total instead of 9):

And finally, if we allow nested parallelism, i.e. running 3 MDs that each should use 3 threads, we get:

Unfortunately, this is nowhere near the speedup a single standalone MD would receive from 3 threads (this would result in a runtime with ~30 sec. wall time), but it still a little better than the old code version and uses the available threads more effectively.
Again, this seemed to work for me only with the Intel compilers, don't know what's wrong with gfortran. Furthermore, all of this might still break down for a really high number of threads creating thread handling overhead. Hence, I introduced an input file keyword for the toml to activate/deactivate the nested OpenMP handling (omp_nested=true/false), which, if false, explicitly uses just a single thread per MD.

In summary, this is unfortunately the best nested OpenMP can achieve for the MDs. I think, to actually restore behavior similar to the xtb subprocess calls, we'd need to refer to a combination of OpenMP and MPI code. At some point I might want to do this, but not right now.
On the plus side, if a really high number of jobs (e.g. geometry optimizations) are run in parallel, the speed up of parallelizing the outer loop (running each opt on one thread) is much higher than would be achievable when parallelizing the individual energy calculations. In this case the code is almost embarrassingly parallel.

kevinkong98 · 2024-04-21T07:17:40Z

Adding some timing data for the updated parallelization with the same system:
*MTD 11 completed successfully ... 21 min, 46.826 sec
*MTD 8 completed successfully ... 22 min, 20.195 sec
*MTD 14 completed successfully ... 22 min, 33.917 sec
*MTD 10 completed successfully ... 22 min, 35.696 sec
*MTD 9 completed successfully ... 22 min, 39.015 sec
*MTD 7 completed successfully ... 22 min, 40.193 sec
*MTD 12 completed successfully ... 22 min, 44.285 sec
*MTD 13 completed successfully ... 22 min, 44.585 sec
*MTD 5 completed successfully ... 22 min, 51.601 sec
*MTD 4 completed successfully ... 23 min, 16.964 sec
*MTD 1 completed successfully ... 23 min, 52.743 sec
*MTD 6 completed successfully ... 23 min, 55.863 sec
*MTD 3 completed successfully ... 24 min, 7.290 sec
*MTD 2 completed successfully ... 24 min, 19.446 sec

So the speedup appears to be more meaningful for larger system.
However unfortunately I still got stuck during the ensemble optimization with the same behavior...

======================================
 |  Multilevel Ensemble Optimization  |
 ======================================
 Optimizing all 2786 structures from file "crest_dynamics.trj" ...
 ----------------------
 crude pre-optimization
 ----------------------
 Optimization engine: ANCOPT
 Hessian update type: BFGS
 E/G convergence criteria:  0.500E-03 Eh, 0.100E-01 Eh/a0
 |>0.0% |>7.5% |>15.0%forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source             
crest              0000000002691FB0  Unknown               Unknown  Unknown
crest              00000000009320D3  tblite_double_dic         124  double_dictionary.f90
crest              000000000244C41A  Unknown               Unknown  Unknown
crest              000000000244C715  Unknown               Unknown  Unknown
crest              000000000244C715  Unknown               Unknown  Unknown
crest              000000000244C715  Unknown               Unknown  Unknown
crest              000000000244D2E7  Unknown               Unknown  Unknown
crest              000000000244D38F  Unknown               Unknown  Unknown
crest              0000000000466207  crest_calculator_         643  calculator.F90
crest              0000000000893676  modelhessian_modu         151  modelhessian.f90
crest              000000000087DDD8  ancopt_module_mp_         195  ancopt.f90
crest              00000000007CFEAA  optimize_module_m          74  optimize_module.f90
crest              0000000000528D81  crest_oloop_              293  parallel.f90
crest              000000000259E91F  Unknown               Unknown  Unknown
crest              00000000025A478F  Unknown               Unknown  Unknown
crest              00000000025FB923  Unknown               Unknown  Unknown
crest              0000000002600422  Unknown               Unknown  Unknown
crest              000000000252EE1F  Unknown               Unknown  Unknown
crest              00000000005295BB  crest_oloop_              318  parallel.f90
crest              00000000025B43D3  Unknown               Unknown  Unknown
crest              0000000002566B6E  Unknown               Unknown  Unknown
crest              0000000002565B70  Unknown               Unknown  Unknown
crest              00000000025B5095  Unknown               Unknown  Unknown
crest              000000000268C1E9  Unknown               Unknown  Unknown

pprcht · 2024-04-21T10:36:17Z

So the speedup appears to be more meaningful for larger system.

Not bad, that's a relief.

However unfortunately I still got stuck during the ensemble optimization with the same behavior...

Let me see, I haven't been able to reproduce this, but have a hunch where it might come from (which is maybe not tblite after all).

pitsteinbach · 2024-04-21T11:48:59Z

I had to finish some other things and might be able to get to it tomorrow. The double dictionary is in a serial part of the code there is no OMP paralellization. There is now also a dependency on toml-f.

pprcht · 2024-04-21T13:17:23Z

@pitsteinbach I think it's not related to tblite per se. Hold off for now.

@kevinkong98 Could you please try a run with the changes committed in #289? This should circumvent the implicit allocation which threw the error.

kevinkong98 · 2024-04-21T23:51:52Z

@pprcht It worked! The entire conformation search finished without any error!

pprcht · 2024-04-22T10:09:10Z

great, thanks for testing!
I'll close the issue for now

pprcht mentioned this issue Apr 18, 2024

fixes (start v3.0.1) #286

Merged

pprcht mentioned this issue Apr 21, 2024

modifications for safer thread handling #289

Merged

pprcht closed this as completed Apr 22, 2024

pprcht mentioned this issue Apr 22, 2024

Intel MKL ERROR: Parameter 6 was incorrect on entry to DLASWP #285

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization issue during iMTD-GC in crest 3.0 #284

Parallelization issue during iMTD-GC in crest 3.0 #284

kevinkong98 commented Apr 14, 2024

kevinkong98 commented Apr 14, 2024

pitsteinbach commented Apr 15, 2024

pprcht commented Apr 15, 2024

kevinkong98 commented Apr 16, 2024 •

edited

Loading

pprcht commented Apr 16, 2024

pprcht commented Apr 20, 2024

kevinkong98 commented Apr 21, 2024 •

edited

Loading

pprcht commented Apr 21, 2024

pitsteinbach commented Apr 21, 2024

pprcht commented Apr 21, 2024

kevinkong98 commented Apr 21, 2024

pprcht commented Apr 22, 2024

Parallelization issue during iMTD-GC in crest 3.0 #284

Parallelization issue during iMTD-GC in crest 3.0 #284

Comments

kevinkong98 commented Apr 14, 2024

kevinkong98 commented Apr 14, 2024

====================================== | Multilevel Ensemble Optimization |

Optimizing all 2786 structures from file "crest_dynamics.trj" ...

crude pre-optimization

pitsteinbach commented Apr 15, 2024

pprcht commented Apr 15, 2024

kevinkong98 commented Apr 16, 2024 • edited Loading

pprcht commented Apr 16, 2024

pprcht commented Apr 20, 2024

kevinkong98 commented Apr 21, 2024 • edited Loading

pprcht commented Apr 21, 2024

pitsteinbach commented Apr 21, 2024

pprcht commented Apr 21, 2024

kevinkong98 commented Apr 21, 2024

pprcht commented Apr 22, 2024

======================================
| Multilevel Ensemble Optimization |

kevinkong98 commented Apr 16, 2024 •

edited

Loading

kevinkong98 commented Apr 21, 2024 •

edited

Loading