Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation fault with Espresso 4.1.1 using p3m, Gay-Berne potential, Lennard-Jones potential and virtual sites #3713

Closed
Kischy opened this issue May 14, 2020 · 66 comments

Comments

@Kischy
Copy link
Contributor

Kischy commented May 14, 2020

I have a very weird segmentation fault in one of my simulations. The curios thing is that I do multiple simulations in exactly the same manner and the only thing I change between different simulations is the position of a virtual particle, that carries a charge. For one particular charge position the simulation fails with a segmentation fault, all the others work just fine. The point at which it fails is different in every run, but it fails consistently. I also tested it on a different machine and using a different version of espresso, but the error persists. I can give more information if needed.

Here is an example of the error message:

[dhcp-184:16942] *** Process received signal ***
[dhcp-184:16942] Signal: Segmentation fault (11)
[dhcp-184:16942] Signal code: Address not mapped (1)
[dhcp-184:16942] Failing at address: 0xfffffffc0271d940
[dhcp-184:16942] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f8f7abf6f20]
[dhcp-184:16942] [ 1] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(void p3m_do_assign_charge<7>(double, Utils::Vector<double, 3ul>&, int)+0x21c)[0x7f8f757211bc]
[dhcp-184:16942] [ 2] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(p3m_charge_assign(ParticleRange const&)+0x759)[0x7f8f75713ee9]
[dhcp-184:16942] [ 3] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(Coulomb::calc_long_range_force(ParticleRange const&)+0x35)[0x7f8f75725745]
[dhcp-184:16942] [ 4] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(force_calc(CellStructure&)+0x168)[0x7f8f75662248]
[dhcp-184:16942] [ 5] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(integrate_vv(int, int)+0x163)[0x7f8f756786f3]
[dhcp-184:16942] [ 6] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(+0xed469)[0x7f8f7562d469]
[dhcp-184:16942] [ 7] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(Communication::detail::callback_reduce_t<std::plus<int>, int (*)(int, int), int, int>::operator()(boost::mpi::communicator const&, boost::mpi::packed_iarchive&) const+0x42)[0x7f8f75635822]
[dhcp-184:16942] [ 8] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(mpi_loop()+0xa2)[0x7f8f7562e162]
[dhcp-184:16942] [ 9] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/python/espressomd/_init.so(PyInit__init+0x67b)[0x7f8f759dd24b]
[dhcp-184:16942] [10] /usr/bin/python3(_PyImport_LoadDynamicModuleWithSpec+0x16f)[0x5fb06f]
[dhcp-184:16942] [11] /usr/bin/python3[0x5fb2ed]
[dhcp-184:16942] [12] /usr/bin/python3(PyCFunction_Call+0x13e)[0x566d9e]
[dhcp-184:16942] [13] /usr/bin/python3(_PyEval_EvalFrameDefault+0x53e0)[0x510f50]
[dhcp-184:16942] [14] /usr/bin/python3[0x507d64]
[dhcp-184:16942] [15] /usr/bin/python3[0x509a90]
[dhcp-184:16942] [16] /usr/bin/python3[0x50a48d]
[dhcp-184:16942] [17] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:16942] [18] /usr/bin/python3[0x509758]
[dhcp-184:16942] [19] /usr/bin/python3[0x50a48d]
[dhcp-184:16942] [20] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:16942] [21] /usr/bin/python3[0x509758]
[dhcp-184:16942] [22] /usr/bin/python3[0x50a48d]
[dhcp-184:16942] [23] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:16942] [24] /usr/bin/python3[0x509758]
[dhcp-184:16942] [25] /usr/bin/python3[0x50a48d]
[dhcp-184:16942] [26] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:16942] [27] /usr/bin/python3[0x509758]
[dhcp-184:16942] [28] /usr/bin/python3[0x50a48d]
[dhcp-184:16942] [29] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:16942] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 8 with PID 0 on node dhcp-184 exited on signal 11 (Segmentation fault).
@Kischy
Copy link
Contributor Author

Kischy commented May 14, 2020

Might be connected to issue #3663, but I am not sure. I do not remove any particles during the simulation.

@Kischy Kischy changed the title Segmentation fault with Espresso 4.1.1 using p3m Segmentation fault with Espresso 4.1.1 using p3m and virtual sites May 14, 2020
@Kischy Kischy changed the title Segmentation fault with Espresso 4.1.1 using p3m and virtual sites Segmentation fault with Espresso 4.1.1 using p3m, Gay-Berne potential, Lennard-Jones potential and virtual sites May 14, 2020
@jngrad
Copy link
Member

jngrad commented May 14, 2020

This looks like espresso 4.1.2, failing in function void p3m_do_assign_charge<7>(double, Utils::Vector<double, 3ul>&, int). This part of the code was refactored recently in the dev version. Is this bug reproducible without MPI? If not, is the failure reproducible on the same MPI rank by using the same particle position?

@jngrad
Copy link
Member

jngrad commented May 14, 2020

In 4.1.1 there are additional checks available to generate meaningful error messages before a segfault:

#ifdef ADDITIONAL_CHECKS
if (pos < -skin * p3m.params.ai[d]) {
fprintf(stderr, "%d: rs_mesh underflow! (pos %f)\n", this_node,
real_pos[d]);
fprintf(stderr, "%d: allowed coordinates: %f - %f\n", this_node,
local_geo.my_left()[d] - skin, local_geo.my_right()[d] + skin);
}
if ((nmp + cao) > p3m.local_mesh.dim[d]) {
fprintf(stderr, "%d: rs_mesh overflow! (pos %f, nmp=%d)\n", this_node,
real_pos[d], nmp);
fprintf(stderr, "%d: allowed coordinates: %f - %f\n", this_node,
local_geo.my_left()[d] - skin, local_geo.my_right()[d] + skin);
}
#endif

Could you please recompile your version of espresso with the extra feature ADDITIONAL_CHECKS in your myconfig.hpp and share with us if there is a runtime error instead of a segfault? This would spare us from running GDB in MPI-parallel code.

@Kischy
Copy link
Contributor Author

Kischy commented May 14, 2020

Could you please recompile your version of espresso with the extra feature ADDITIONAL_CHECKS in your myconfig.hpp and share with us if there is a runtime error instead of a segfault? This would spare us from running GDB in MPI-parallel code.

I will try that tomorrow and come back with the output. Might take awile, the simulation does not fail on start.

@Kischy
Copy link
Contributor Author

Kischy commented May 14, 2020

Is this bug reproducible without MPI?

I have never tried without MPI, because the tuning of the p3m algorithm never finishes. Or at least not in a reasonable time.

If not, is the failure reproducible on the same MPI rank by using the same particle position?

What exactly do you mean with this question?

@jngrad
Copy link
Member

jngrad commented May 14, 2020

If not, is the failure reproducible on the same MPI rank by using the same particle position?

What exactly do you mean with this question?

You mentioned the system crashes depending on the charge and position of the virtual particle. The backtrace shows the failure happened on MPI rank 8. The MPI ranks are assigned to slices of the cellsystem. If you re-run the same script, the virtual particle will be in the same domain. If MPI rank 8 fails repeatedly, we can attach GDB to ranks 0 and 8 to get an opportunity to print the values of all local variables in every function shown in the backtrace. However it's quite tricky, and there is no need to do that if ADDITIONAL_CHECKS catches the error.

@Kischy
Copy link
Contributor Author

Kischy commented May 14, 2020

It only depends on the position of the virtual particle, the value of the charge does not change between different simulation runs. I will try ADDITIONAL_CHECKS tomorrow and come back. Thank you.

This is another run:

[dhcp-184:18535] *** Process received signal ***
[dhcp-184:18535] Signal: Segmentation fault (11)
[dhcp-184:18535] Signal code: Address not mapped (1)
[dhcp-184:18535] Failing at address: 0xfffffffc03824450
[dhcp-184:18535] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f245fa0ef20]
[dhcp-184:18535] [ 1] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_Z20p3m_do_assign_chargeILi7EEvdRN5Utils6VectorIdLm3EEEi+0x21c)[0x7f245a5391bc]
[dhcp-184:18535] [ 2] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_Z17p3m_charge_assignRK13ParticleRange+0x759)[0x7f245a52bee9]
[dhcp-184:18535] [ 3] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_ZN7Coulomb21calc_long_range_forceERK13ParticleRange+0x35)[0x7f245a53d745]
[dhcp-184:18535] [ 4] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_Z10force_calcR13CellStructure+0x168)[0x7f245a47a248]
[dhcp-184:18535] [ 5] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_Z12integrate_vvii+0x163)[0x7f245a4906f3]
[dhcp-184:18535] [ 6] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_Z13mpi_integrateii+0x77)[0x7f245a446d67]
[dhcp-184:18535] [ 7] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_Z16python_integrateibb+0xde)[0x7f245a490ace]
[dhcp-184:18535] [ 8] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/python/espressomd/integrate.so(+0xc0ac)[0x7f24372240ac]
[dhcp-184:18535] [ 9] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/python/espressomd/script_interface.so(+0xf59c)[0x7f245825f59c]
[dhcp-184:18535] [10] /usr/bin/python3(_PyObject_FastCallKeywords+0x19c)[0x5a9cbc]
[dhcp-184:18535] [11] /usr/bin/python3[0x50a5c3]
[dhcp-184:18535] [12] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:18535] [13] /usr/bin/python3[0x507d64]
[dhcp-184:18535] [14] /usr/bin/python3(PyEval_EvalCode+0x23)[0x50ae13]
[dhcp-184:18535] [15] /usr/bin/python3[0x634c82]
[dhcp-184:18535] [16] /usr/bin/python3(PyRun_FileExFlags+0x97)[0x634d37]
[dhcp-184:18535] [17] /usr/bin/python3(PyRun_SimpleFileExFlags+0x17f)[0x6384ef]
[dhcp-184:18535] [18] /usr/bin/python3(Py_Main+0x591)[0x639091]
[dhcp-184:18535] [19] /usr/bin/python3(main+0xe0)[0x4b0d00]
[dhcp-184:18535] [20] /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xe7)[0x7f245f9f1b97]
[dhcp-184:18535] [21] /usr/bin/python3(_start+0x2a)[0x5b250a]
[dhcp-184:18535] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node dhcp-184 exited on signal 11 (Segmentation fault).

and yet another run:

[dhcp-184:16942] *** Process received signal ***
[dhcp-184:16942] Signal: Segmentation fault (11)
[dhcp-184:16942] Signal code: Address not mapped (1)
[dhcp-184:16942] Failing at address: 0xfffffffc0271d940
[dhcp-184:16942] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f8f7abf6f20]
[dhcp-184:16942] [ 1] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_Z20p3m_do_assign_chargeILi7EEvdRN5Utils6VectorIdLm3EEEi+0x21c)[0x7f8f757211bc]
[dhcp-184:16942] [ 2] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_Z17p3m_charge_assignRK13ParticleRange+0x759)[0x7f8f75713ee9]
[dhcp-184:16942] [ 3] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_ZN7Coulomb21calc_long_range_forceERK13ParticleRange+0x35)[0x7f8f75725745]
[dhcp-184:16942] [ 4] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_Z10force_calcR13CellStructure+0x168)[0x7f8f75662248]
[dhcp-184:16942] [ 5] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_Z12integrate_vvii+0x163)[0x7f8f756786f3]
[dhcp-184:16942] [ 6] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(+0xed469)[0x7f8f7562d469]
[dhcp-184:16942] [ 7] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_ZNK13Communication6detail17callback_reduce_tISt4plusIiEPFiiiEJiiEEclERKN5boost3mpi12communicatorERNS8_15packed_iarchiveE+0x42)[0x7f8f75635822]
[dhcp-184:16942] [ 8] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/core/EspressoCore.so(_Z8mpi_loopv+0xa2)[0x7f8f7562e162]
[dhcp-184:16942] [ 9] /home/chaege/espresso-4_1_1_ILC/espresso/build/src/python/espressomd/_init.so(PyInit__init+0x67b)[0x7f8f759dd24b]
[dhcp-184:16942] [10] /usr/bin/python3(_PyImport_LoadDynamicModuleWithSpec+0x16f)[0x5fb06f]
[dhcp-184:16942] [11] /usr/bin/python3[0x5fb2ed]
[dhcp-184:16942] [12] /usr/bin/python3(PyCFunction_Call+0x13e)[0x566d9e]
[dhcp-184:16942] [13] /usr/bin/python3(_PyEval_EvalFrameDefault+0x53e0)[0x510f50]
[dhcp-184:16942] [14] /usr/bin/python3[0x507d64]
[dhcp-184:16942] [15] /usr/bin/python3[0x509a90]
[dhcp-184:16942] [16] /usr/bin/python3[0x50a48d]
[dhcp-184:16942] [17] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:16942] [18] /usr/bin/python3[0x509758]
[dhcp-184:16942] [19] /usr/bin/python3[0x50a48d]
[dhcp-184:16942] [20] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:16942] [21] /usr/bin/python3[0x509758]
[dhcp-184:16942] [22] /usr/bin/python3[0x50a48d]
[dhcp-184:16942] [23] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:16942] [24] /usr/bin/python3[0x509758]
[dhcp-184:16942] [25] /usr/bin/python3[0x50a48d]
[dhcp-184:16942] [26] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:16942] [27] /usr/bin/python3[0x509758]
[dhcp-184:16942] [28] /usr/bin/python3[0x50a48d]
[dhcp-184:16942] [29] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:16942] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 8 with PID 0 on node dhcp-184 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

So I compiled using ADDITIONAL_CHECKS and reran the simulation. Unfortunately the error looks much the same:

[dhcp-184:33068] *** Process received signal ***
[dhcp-184:33068] Signal: Segmentation fault (11)
[dhcp-184:33068] Signal code: Address not mapped (1)
[dhcp-184:33068] Failing at address: 0xfffffffc02e89630
[dhcp-184:33068] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7fb2894eef20]
[dhcp-184:33068] [ 1] /home/chaege/espresso-4_1_1_ILC_add_checks/espresso/build/src/core/EspressoCore.so(_Z20p3m_do_assign_chargeILi7EEvdRN5Utils6VectorIdLm3EEEi+0x7b4)[0x7fb28401b174]
[dhcp-184:33068] [ 2] /home/chaege/espresso-4_1_1_ILC_add_checks/espresso/build/src/core/EspressoCore.so(_Z17p3m_charge_assignRK13ParticleRange+0x759)[0x7fb28400bf89]
[dhcp-184:33068] [ 3] /home/chaege/espresso-4_1_1_ILC_add_checks/espresso/build/src/core/EspressoCore.so(_ZN7Coulomb21calc_long_range_forceERK13ParticleRange+0x35)[0x7fb28401f4f5]
[dhcp-184:33068] [ 4] /home/chaege/espresso-4_1_1_ILC_add_checks/espresso/build/src/core/EspressoCore.so(_Z10force_calcR13CellStructure+0x168)[0x7fb283f5a278]
[dhcp-184:33068] [ 5] /home/chaege/espresso-4_1_1_ILC_add_checks/espresso/build/src/core/EspressoCore.so(_Z12integrate_vvii+0x163)[0x7fb283f70eb3]
[dhcp-184:33068] [ 6] /home/chaege/espresso-4_1_1_ILC_add_checks/espresso/build/src/core/EspressoCore.so(+0xed469)[0x7fb283f25469]
[dhcp-184:33068] [ 7] /home/chaege/espresso-4_1_1_ILC_add_checks/espresso/build/src/core/EspressoCore.so(_ZNK13Communication6detail17callback_reduce_tISt4plusIiEPFiiiEJiiEEclERKN5boost3mpi12communicatorERNS8_15packed_iarchiveE+0x42)[0x7fb283f2d822]
[dhcp-184:33068] [ 8] /home/chaege/espresso-4_1_1_ILC_add_checks/espresso/build/src/core/EspressoCore.so(_Z8mpi_loopv+0xa2)[0x7fb283f26162]
[dhcp-184:33068] [ 9] /home/chaege/espresso-4_1_1_ILC_add_checks/espresso/build/src/python/espressomd/_init.so(PyInit__init+0x67b)[0x7fb2842d52bb]
[dhcp-184:33068] [10] /usr/bin/python3(_PyImport_LoadDynamicModuleWithSpec+0x16f)[0x5fb06f]
[dhcp-184:33068] [11] /usr/bin/python3[0x5fb2ed]
[dhcp-184:33068] [12] /usr/bin/python3(PyCFunction_Call+0x13e)[0x566d9e]
[dhcp-184:33068] [13] /usr/bin/python3(_PyEval_EvalFrameDefault+0x53e0)[0x510f50]
[dhcp-184:33068] [14] /usr/bin/python3[0x507d64]
[dhcp-184:33068] [15] /usr/bin/python3[0x509a90]
[dhcp-184:33068] [16] /usr/bin/python3[0x50a48d]
[dhcp-184:33068] [17] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:33068] [18] /usr/bin/python3[0x509758]
[dhcp-184:33068] [19] /usr/bin/python3[0x50a48d]
[dhcp-184:33068] [20] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:33068] [21] /usr/bin/python3[0x509758]
[dhcp-184:33068] [22] /usr/bin/python3[0x50a48d]
[dhcp-184:33068] [23] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:33068] [24] /usr/bin/python3[0x509758]
[dhcp-184:33068] [25] /usr/bin/python3[0x50a48d]
[dhcp-184:33068] [26] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:33068] [27] /usr/bin/python3[0x509758]
[dhcp-184:33068] [28] /usr/bin/python3[0x50a48d]
[dhcp-184:33068] [29] /usr/bin/python3(_PyEval_EvalFrameDefault+0x444)[0x50bfb4]
[dhcp-184:33068] *** End of error message ***
--------------------------------------------------------------------------
mpirun noticed that process rank 9 with PID 0 on node dhcp-184 exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

@fweik
Copy link
Contributor

fweik commented May 15, 2020

It's not obvious to me what is happening here from the traces. Can you please provide a (minimal) script that reproduces the issue?

@KaiSzuttor
Copy link
Member

KaiSzuttor commented May 15, 2020

doesn't that look like a blow up of forces if the top of the stack is p3m_do_assign_charge?

@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

I can send you the whole script. I can't send it to you here, because its not puplished yet. It is not minimal, because this is hard to reproduce.

@fweik fweik removed their assignment May 15, 2020
@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

I just send you the script. Thank you for looking into this. If there is anything I can do, please let me know.

@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

doesn't that look like a blow up of forces if the top of the stack is p3m_do_assign_charge?

Might be possible. In this simulation the charge is relatively exposed in comparision with the other simulations.

@fweik
Copy link
Contributor

fweik commented May 15, 2020

Did you set the min_global_cut to the maximal distance between virtual and reference particle?

@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

Almost, I would say. rel_virt_pos[2] holds the maximal distance. I had problems with the exact value, which I assumed to stem from floating point inaccuracies, so I added a tiny bit.

md_system.min_global_cut = rel_virt_pos[2] + 0.0000001 (line 53, file particle_setup_q_2_zc_1_5.py)

@fweik
Copy link
Contributor

fweik commented May 15, 2020

This is an open source project, we can not provide private support to you. If you want our help, the discussion has to be public, so that it is also helpful to others.

@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

This is an open source project, we can not provide private support to you. If you want our help, the discussion has to be public, so that it is also helpful to others.

I think this must be some kind of misunderstanding: I did never ask you for private support. I simply wanted to report a bug, which I will definitly not do again in such a hostile environment. You asked me for a simulation script and I send it to you. I understand your need for a public discussion, but with a pending paper I simply cannot puplish the script here yet. I can provide the script here once the paper is puplished.

@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

By the way: This is the second time I feel like I am beeing attacked for trying to contribute. For me, that is unacceptable for any kind of open source project.

@fweik
Copy link
Contributor

fweik commented May 15, 2020

I'm sorry you feel that, that was not my intention. But referring to a private mail in public forum just isn't very helpful.

@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

But referring to a private mail in public forum just isn't very helpful.

That is right, but as I said, I can provide it later. And I did not refer to the private mail, I simply answered your question in a meaningful way. I could have simply stated "yes", which is not as helpful once the scripts are provided. The reference was a mere bonus.

@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

I just realized, that if this simulation does not work, it will not be included in the puplication anyway. So here you go:
q_2_zc_1_5.zip

You can start the simulation from "Temp_alle/ILC.py" if you want.

@fweik
Copy link
Contributor

fweik commented May 15, 2020

Ok, I can see how my comment could be understood as harsh, as I was saying, this wasn't my intention. With regards to the problem, I've run out of ideas for now, I'll have a look again next week.
@KaiSzuttor yes that is what we thought first, but with ADDITIONAL_CHECKS it is actually checked that the particles are withing the valid domain....

@jngrad
Copy link
Member

jngrad commented May 15, 2020

@Kischy Thank you for reporting this issue to us. This lead us to re-introduce one of the 4.1 ADDITIONAL_CHECKS assertions in the P3M code of the 4.2 dev branch (#3715).

It's unfortunate ADDITIONAL_CHECKS didn't catch the issue. This means something else is wrong, but from our side we cannot tell whether the issue is a bug in espresso or an issue with your script, as the backtrace alone doesn't provide enough information. As explained in our contributing guidelines about bug reports (GitHub, espressomd.org), we kindly ask users to submit a minimal working example that reproduces the error. This is standard practice in open source projects, as large scripts are difficult to investigate and often contain confidential code that is not relevant to reproduce the bug. In my experience, debugging MPI parallel code usually takes several weeks of investigation and requires the expertise of multiple developers. Sometimes we also need the input of external developers when it comes to bugs from libraries used by espresso (e.g. OpenMPI and the ecosystem of libraries around it). This can only happen if everyone has access to a bisected version of the script. I hope this clarifies our policy.

@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

Ok, I can see how my comment could be understood as harsh, as I was saying, this wasn't my intention. With regards to the problem, I've run out of ideas for now, I'll have a look again next week.
@KaiSzuttor yes that is what we thought first, but with ADDITIONAL_CHECKS it is actually checked that the particles are withing the valid domain....

Okay, no worries, the tone was quite aggressive to me, therefore my reaction. If we are back to a helpfull discussion, I want to mention again, that this is the second time this happend and kindly ask for an improvement on that front.

@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

@Kischy Thank you for reporting this issue to us. This lead us to re-introduce one of the 4.1 ADDITIONAL_CHECKS assertions in the P3M code of the 4.2 dev branch (#3715).

It's unfortunate ADDITIONAL_CHECKS didn't catch the issue. This means something else is wrong, but from our side we cannot tell whether the issue is a bug in espresso or an issue with your script, as the backtrace alone doesn't provide enough information. As explained in our contributing guidelines about bug reports (GitHub, espressomd.org), we kindly ask users to submit a minimal working example that reproduces the error. This is standard practice in open source projects, as large scripts are difficult to investigate and often contain confidential code that is not relevant to reproduce the bug. In my experience, debugging MPI parallel code usually takes several weeks of investigation and requires the expertise of multiple developers. Sometimes we also need the input of external developers when it comes to bugs from libraries used by espresso (e.g. OpenMPI and the ecosystem of libraries around it). This can only happen if everyone has access to a bisected version of the script. I hope this clarifies our policy.

I understand the policy, the files are provided above.

@Kischy
Copy link
Contributor Author

Kischy commented May 15, 2020

If you change the value in line 29 from 1.5 to 1.2 in file "particle_setup_q_2_zc_1_5.py" and then run the simulation, there is no issue. The value changes the position of the virtual particle relative to the Gay-Berne particle. In fact all values from 0.0 to 1.2 seem to work fine. Only when the charge is almost or close to the tip of the GB particle the simulations break for some reason.

@fweik
Copy link
Contributor

fweik commented May 19, 2020

@Kischy no harm done. I think the communication was not optimal from both sides. Let's do better next time and move on.

@jngrad
Copy link
Member

jngrad commented May 19, 2020

I tracked down the source of NaN to this expression:

/* Taken from "On the numerical integration of motion for rigid polyatomics:
* The modified quaternion approach", Omeylan, Igor (1998), Eq. 12.*/
auto const lambda =
1 - S[0] * time_step_squared_half -
sqrt(1 - time_step_squared *
(S[0] + time_step * (S[1] + time_step_half / 2. *
(S[2] - S[0] * S[0]))));

We're taking the square root of a negative number, which propagates to the particle quaternion, to the particle director, and finally to the virtual site position via:

auto const new_pos = p_real->r.p + director * p.p.vs_relative.distance;

The lambda expression matches eq. 12 from the Omelyan paper, with S = {q'^2, q'.q'', q''^2}. This probably means we are feeding define_Qdd() with values outside the definition domain of the method in the paper, or the fourth-order approximation we're using is not precise enough. @Kischy I'm not familiar enough with this method to come up with reasonable range checks, do you have any suggestions? The input parameters for define_Qdd() that lead to lambda = NaN are as follows:

p.r.quat = {1, 0, 0, 0}
p.m.omega = {0, 0, 0}
p.f.torque = {103486.05456599266, -1033296.3969197127, 65.09739381012956}
p.p.rinertia = {0.5, 0.5, 0.10000000000000001}

With regards to your script, I think the force cap you're applying during the warmup phase doesn't actually increase linearly. Furthermore the force cap doesn't cap torques, which is why the particle above has a torque of 1 million AU around the y axis while the linear force is less than 9 AU.

@Kischy
Copy link
Contributor Author

Kischy commented May 20, 2020

@Kischy I'm not familiar enough with this method to come up with reasonable range checks, do you have any suggestions? The input parameters for define_Qdd() that lead to lambda = NaN are as follows:

I'll have a look at the paper again and come back if I find anything usefull.


p.r.quat = {1, 0, 0, 0}
p.m.omega = {0, 0, 0}
p.f.torque = {103486.05456599266, -1033296.3969197127, 65.09739381012956}
p.p.rinertia = {0.5, 0.5, 0.10000000000000001}

One question about the parameters: From p.p.rinertia = {0.5, 0.5, 0.10000000000000001} I can see that it is a GB particle, but why does it have the starting values of p.r.quat = {1, 0, 0, 0} for the quaternion? The value of the quaternion should change after the first md step, shouldn't it? Or is p.r.quat not what I think it is?


With regards to your script, I think the force cap you're applying during the warmup phase doesn't actually increase linearly. Furthermore the force cap doesn't cap torques, which is why the particle above has a torque of 1 million AU around the y axis while the linear force is less than 9 AU.

I never thought it to be a problem, that the warmup cap does not increase linearly. I can change that and try again.
About the torques: I naivly assumed, that a forcecap also applies to torques. Is there a possibility to cap the torques? Side note: I can see how the torques can get out of hand easily for this simulation. There is a charge right at the end of the anisotropic Gay-Berne particle, which makes it easy for the particle to rotate. For all my other simulations the charge is closer to the center of the Gay-Berne particle and therefore the torque is probably not as high.

@jngrad
Copy link
Member

jngrad commented May 20, 2020

I never thought it to be a problem, that the warmup cap does not increase linearly. I can change that and try again.

It is not an issue in itself. The line md_system.force_cap = warmup_cap * warmup_integ_steps in the for loop made me think that you were increasing the force cap, as we typically do in the samples, but then I realized the two variables were constants.

why does it have the starting values of p.r.quat = {1, 0, 0, 0} for the quaternion? The value of the quaternion should change after the first md step, shouldn't it?

I got it to fail on the first integration step:

(gdb) print sim_time
$4 = 0

About the torques: I naivly assumed, that a forcecap also applies to torques.

I did too. In fact, the steepest descent algorithm does exactly that. Not sure why the force cap doesn't, though. You could update the code in forcecap_cap() at L46-L55 to also cap torques, although I didn't try it and make no guarantee about the result. Force capping in general should be used with care, especially when there's already a thermostat active (in your case langevin), because they will compete. In particular, langevin corrects the torques of virtual sites.

@fweik I can't really follow the old integrator code nor find where the temperature global is recalculated; could it be that force capping artificially decreases the linear velocities and thus the global temperature before langevin is applied, creating a negative feedback loop where langevin increases both the linear velocity of real particles and torques of real and virtual particles?

@Kischy
Copy link
Contributor Author

Kischy commented May 20, 2020

p.r.quat = {1, 0, 0, 0}
p.m.omega = {0, 0, 0}
p.f.torque = {103486.05456599266, -1033296.3969197127, 65.09739381012956}
p.p.rinertia = {0.5, 0.5, 0.10000000000000001}

I do not know why these leads to a negative value in the sqrt. It might be that the timestep is simply to high and I have to use a lower timestep for this simulation.


I got it to fail on the first integration step:

The torque values are immediately that high? Even with the particles so far appart?


I have a few questions about the code though. Is the quaternion in espresso defined differntly than it is in the papers "On the numerical integration of motion for rigid polyatomics: The modified quaternion approach", Omeylan, Igor (1998), Eq. 12." and "An improved algorithm for molecular dynamics simulation of * rigid molecules", Sonnenschein, Roland (1985)" ?

I ask, because I do not understand some equations in define_Qdd.
Shoudn't these equations

  /* calculate the first derivative of the quaternion */
  /* Taken from "An improved algorithm for molecular dynamics simulation of
   * rigid molecules", Sonnenschein, Roland (1985), Eq. 4.*/
  Qd[0] = 0.5 * (-p.r.quat[1] * p.m.omega[0] - p.r.quat[2] * p.m.omega[1] -
                 p.r.quat[3] * p.m.omega[2]);

  Qd[1] = 0.5 * (p.r.quat[0] * p.m.omega[0] - p.r.quat[3] * p.m.omega[1] +
                 p.r.quat[2] * p.m.omega[2]);

  Qd[2] = 0.5 * (p.r.quat[3] * p.m.omega[0] + p.r.quat[0] * p.m.omega[1] -
                 p.r.quat[1] * p.m.omega[2]);

  Qd[3] = 0.5 * (-p.r.quat[2] * p.m.omega[0] + p.r.quat[1] * p.m.omega[1] +
                 p.r.quat[0] * p.m.omega[2]);

be

  /* calculate the first derivative of the quaternion */
  /* Taken from "An improved algorithm for molecular dynamics simulation of
   * rigid molecules", Sonnenschein, Roland (1985), Eq. 4.*/
  Qd[0] = 0.5 * (-p.r.quat[2] * p.m.omega[0] - p.r.quat[3] * p.m.omega[1] +
                 p.r.quat[1] * p.m.omega[3]);

  Qd[1] = 0.5 * (p.r.quat[3] * p.m.omega[0] - p.r.quat[2] * p.m.omega[1] -
                 p.r.quat[0] * p.m.omega[2]);

  Qd[2] = 0.5 * (p.r.quat[0] * p.m.omega[0] + p.r.quat[1] * p.m.omega[1] +
                 p.r.quat[3] * p.m.omega[2]);

  Qd[3] = 0.5 * (-p.r.quat[1] * p.m.omega[0] + p.r.quat[0] * p.m.omega[1] -
                 p.r.quat[2] * p.m.omega[2]);

?
Maybe my math is inccorect. My Matrix multiplication skills are a bit rusty. See here:
first_der_quaternion

If my math is correct I have other issues like this one as the code continues, but can someone first check my math in context of equation 4 in "An improved algorithm for molecular dynamics simulation of * rigid molecules", Sonnenschein, Roland (1985)"?

Here is the equation from the Paper:
grafik

grafik

@Kischy
Copy link
Contributor Author

Kischy commented May 20, 2020

I did too. In fact, the steepest descent algorithm does exactly that. Not sure why the force cap doesn't, though. You could update the code in forcecap_cap() at L46-L55 to also cap torques, although I didn't try it and make no guarantee about the result. Force capping in general should be used with care, especially when there's already a thermostat active (in your case langevin), because they will compete. In particular, langevin corrects the torques of virtual sites.

Are torques calculated for the virtual sites? I think I turned rotation off for the virtual sites. Only the GB and LJ particles need to rotate.
Thank you, that is good to know. Maybe I turn off the thermostat during warmup.

@jngrad
Copy link
Member

jngrad commented May 20, 2020

Are torques calculated for the virtual sites? I think I turned rotation off for the virtual sites.

I'm not exactly sure what happens with virtual site torques and thermostats, but after removing both the force cap and thermostat, the NaN issue persists, so the issue has a different origin.

I got it to fail on the first integration step:

The torque values are immediately that high? Even with the particles so far appart?

Yes, although I'm not sure why.

I have a few questions about the code though. Is the quaternion in espresso defined differntly than it is in the papers

Yes, the papers use a different formalism for quaternions. This already caused issues once (#2964). It's now clear to me we need to write down in rotation.cpp which formalism is used. The espresso quaternion uses scalar-first notation, while Sonnenschein 1985 and Omeylan 1998 use scalar-last notation. Looking at equation (4), I'd do the matrix multiplication as follows:
\mathbf{\dot{q}} = (1)/(2)\mathbf{A}\begin{pmatrix} \omega_x \\ \omega_y \\ \omega_z \\ 0 \end{pmatrix} = (1)/(2)\begin{pmatrix} -\zeta\omega_x -\chi\omega_y + \eta\omega_z \\ \chi\omega_x -\zeta\omega_y - \xi\omega_z \\ \xi\omega_x -\eta\omega_y - \chi\omega_z \\ -\eta\omega_x +\xi\omega_y - \zeta\omega_z \\ \end{pmatrix}

Qd[0] = 0.5 * (-p.r.quat[3] * p.m.omega[0] - p.r.quat[0] * p.m.omega[1] + p.r.quat[2] * p.m.omega[2]);
Qd[1] = 0.5 * (p.r.quat[0] * p.m.omega[0] - p.r.quat[3] * p.m.omega[1] - p.r.quat[1] * p.m.omega[2]);
Qd[2] = 0.5 * (p.r.quat[1] * p.m.omega[0] - p.r.quat[2] * p.m.omega[1] - p.r.quat[0] * p.m.omega[2]);
Qd[3] = 0.5 * (-p.r.quat[2] * p.m.omega[0] + p.r.quat[1] * p.m.omega[1] - p.r.quat[3] * p.m.omega[2]);

If my math is correct I have other issues like this one as the code continues, but can someone first check my math in context of equation 4

I think you have substituted the greek letters by the wrong quaternion indices in the expression of matrix A. The substitution should be \mathbf{q} = (\xi, \eta, \zeta, \chi) = (q_1, q_2, q_3, q_0) if I'm not mistaken.

@Kischy
Copy link
Contributor Author

Kischy commented May 20, 2020

Yes, the papers use a different formalism for quaternions. This already caused issues once (#2964). It's now clear to me we need to write down in rotation.cpp which formalism is used. The espresso quaternion uses scalar-first notation, while Sonnenschein 1985 and Omeylan 1998 use scalar-last notation. Looking at equation (4), I'd do the matrix multiplication as follows:
\mathbf{\dot{q}} = (1)/(2)\mathbf{A}\begin{pmatrix} \omega_x \ \omega_y \ \omega_z \ 0 \end{pmatrix} = (1)/(2)\begin{pmatrix} -\zeta\omega_x -\chi\omega_y + \eta\omega_z \ \chi\omega_x -\zeta\omega_y - \xi\omega_z \ \xi\omega_x -\eta\omega_y - \chi\omega_z \ -\eta\omega_x +\xi\omega_y - \zeta\omega_z \ \end{pmatrix}

Qd[0] = 0.5 * (-p.r.quat[3] * p.m.omega[0] - p.r.quat[0] * p.m.omega[1] + p.r.quat[2] * p.m.omega[2]);
Qd[1] = 0.5 * (p.r.quat[0] * p.m.omega[0] - p.r.quat[3] * p.m.omega[1] - p.r.quat[1] * p.m.omega[2]);
Qd[2] = 0.5 * (p.r.quat[1] * p.m.omega[0] - p.r.quat[2] * p.m.omega[1] - p.r.quat[0] * p.m.omega[2]);
Qd[3] = 0.5 * (-p.r.quat[2] * p.m.omega[0] + p.r.quat[1] * p.m.omega[1] - p.r.quat[3] * p.m.omega[2]);

If my math is correct I have other issues like this one as the code continues, but can someone first check my math in context of equation 4

I think you have substituted the greek letters by the wrong quaternion indices in the expression of matrix A. The substitution should be \mathbf{q} = (\xi, \eta, \zeta, \chi) = (q_1, q_2, q_3, q_0) if I'm not mistaken.

That is confusing and not clear at all. The comment

  /* Taken from "On the numerical integration of motion for rigid polyatomics:
   * The modified quaternion approach", Omeylan, Igor (1998), Eq. 12.*/

suggests, that the formalism of the paper is used and I thought it was scalar first. Thank you for clearing that up. Please note this somewhere! I would suggest updating the comment.

@Kischy
Copy link
Contributor Author

Kischy commented May 20, 2020

I think you have substituted the greek letters by the wrong quaternion indices in the expression of matrix A. The substitution should be \mathbf{q} = (\xi, \eta, \zeta, \chi) = (q_1, q_2, q_3, q_0) if I'm not mistaken.

@jngrad I think the signs in your matrix might be incorrect. The third row should be all positive signs. But even with this, the result does not match the code, does it?

@jngrad
Copy link
Member

jngrad commented May 20, 2020

The third row should be all positive signs.

Nice catch, it was a copy-paste error. The third line of the code should also have only positive signs.

I would suggest updating the comment.

We definitively need to update the comment to reflect the formalism used in the function.

But even with this, the result does not match the code, does it?

Updating the code with my solution (with the correct signs) fails the rotational_inertia.py test. To move forward, I think we need to come up with a simple case, e.g. a particle with angular velocity [1, 0, 0] and torque [5, 0, 0], for which we can work out an analytical solution of the quaternion first time-derivative. This would allow us to write a unit test for the function. This means first refactoring define_Qdd() to pass Vector3d/4d arguments instead of a Particle. This function doesn't need to know anything about the other properties of particles anyway.

@RudolfWeeber
Copy link
Contributor

First of all, I just took a very brief look at the script, so the comments below may not be exactly applicable here. I'll still comment, since I'm probably one of the people who has spent significant time with the rotation and virtual sites stuff.

I think the torque is just to big for the simulation to be stable.
If we pretend for a second, that rotation is linear and use the Euler scheme, in a single time step
omega(t=dt) =omega(t=0) +dt *torque(t=0)/inertia
The second summand should be less than one or so. Why exactly the rotational integratro produces wrong results, ifthe term is much bigger may not be that important.

I'd suggest experimenting with the following changes to the simulation protocol (again, I just had time for a quick glance, so I might have overlooked stuff)

  • Split the setup and warmup into several parts:
  1. Put in particles with GB and LJ interaction, turn on virtual sites and rotation on the central particles
  2. Perform steepest descent without a thermostat, until the energy per degree of freedom is less than 1kT or so
  3. Resize the box in a loop, 0.25% per iteration, then integrate for 100 steps or so, before reshaping again. That's the only way I got it to work with my magnetic gel simulations. At the time, steepest descent wasn't available, so I used velocity Verlet here. Again, the energy needs to be watched. I think I also used a smaller time step.
  4. As long as the charge is within the GB ellipsoid, it should be possibble to turn on electrostatics now without further warmup.

Steepest descent and force capping with electrostatics on is tricky, because the configuration in which two charges are very close is always the energetically most favorable one, if they only come close enough. The force capping allows for that.

Hope that helps

@Kischy
Copy link
Contributor Author

Kischy commented May 20, 2020

@jngrad
Are we absolutly sure then, that the formalism and therefore the new equations are correct? Are we sure, that the old code is incorrect? I am asking again because I did not expect an error in such a fundemental part of the code and we should be absolutly sure not to introduce new mistakes.

I think your sugestion for a test is perfect. Would it be possible to calculate the derivitaves using a gradient as we did on #3091 ? How was this tested until now?

Unfortunately I do not have much time for this at the moment. I will maybe find time on Friday or in two weeks.

@Kischy
Copy link
Contributor Author

Kischy commented May 20, 2020

@RudolfWeeber Thank you for your suggestions. I will test them.

@RudolfWeeber
Copy link
Contributor

With regards to errors in the rotation code:
Never say never, but I'd be somewhat surprised. Quite a few people have looked at it in detail, Bogdan for instance. Also, rather sophisticated time-dependent stuff was done with it. E.g., producing a correct Debye spectrum for a dipolar particle. That involves time auto-correlation functions of the particle orientation. A bug would have to be quite subtle to allow for those simulations to work (compare against theory).
But it is clear that that is not a validation as per the use of the word in software engineering...

@jngrad
Copy link
Member

jngrad commented May 20, 2020

How was this tested until now?

It's tested indirectly by spinning a particle with anisotropic inertial moment around a principal axis and checking for angular velocity conservation, if I read the code correctly.

Would it be possible to calculate the derivitaves using a gradient

It's probably easier to work out an analytical result. With a particle rotating around its principal axis and a torque applied on a perpendicular axis, the solution might just be a product of two quantities.

A bug would have to be quite subtle to allow for those simulations to work

I think the test failed when I substituted the quaternion indices because I didn't check if Qd and Qdd were scalar-first or scalar-last. The code is probably fine, however writing unit tests will help us get a better understanding of the Lagrangian in the paper and hopefully we can then figure out the domain of definition of the fourth-order solution.

Unfortunately I do not have much time for this at the moment. I will maybe find time on Friday or in two weeks.

I also don't have much time to spend on this issue this week.

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented May 20, 2020 via email

@Kischy
Copy link
Contributor Author

Kischy commented May 20, 2020

I think the test failed when I substituted the quaternion indices because I didn't check if Qd and Qdd were scalar-first or scalar-last. The code is probably fine, however writing unit tests will help us get a better understanding of the Lagrangian in the paper and hopefully we can then figure out the domain of definition of the fourth-order solution.

Hopefully you are right, but I do not understand the equations then. If espresso is using the scalar first convention and the paper uses the scalar last convention then your equations are correct and the code is incorrect. I do not see how the code can be correct, because there musst be a line in the resulting equations that has all positive signs (according to the paper) and there is no such line in code.
Edit: I didn't mean to sound harsh here. Sorry if I did. It just bothers me that I do not understand these lines.

@Kischy
Copy link
Contributor Author

Kischy commented May 20, 2020

With regards to errors in the rotation code:
Never say never, but I'd be somewhat surprised. Quite a few people have looked at it in detail, Bogdan for instance. Also, rather sophisticated time-dependent stuff was done with it. E.g., producing a correct Debye spectrum for a dipolar particle. That involves time auto-correlation functions of the particle orientation. A bug would have to be quite subtle to allow for those simulations to work (compare against theory).
But it is clear that that is not a validation as per the use of the word in software engineering...

I think a bug would only matter for systems that are out of equilibrium. The argument would go as follows: Monte Carlo and MD give the same equilibrium result. Monte Carlo does not need equations of motion, therefore it should not matter in MD, if the equations of motion are incorrect as long as the particles are allowed to rotate and move. Only if a system is out of equilibrium or one needs the trajectories, the correct equations of motions become relevant. Do you agree? (Therefore it is also completly unneccassry for me to set the rotational inertia of the particles to anything other than 1. The result musst be the same.)

@Kischy
Copy link
Contributor Author

Kischy commented May 20, 2020

There is also a test for anisotropic inertia moments. If we really want peace of mind with this, one can compare Espresso’s trajectory for a particle with an external torque and anistropic inertia moments at T=0 with a numerical solution of the Euler equations, e.g., using Scipy.integrate.solve_ivp.

Thanks for the suggestion!

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented May 20, 2020 via email

@Kischy
Copy link
Contributor Author

Kischy commented May 20, 2020

The diffusion coefficients and Debye spectra use correlations between omega now and omega in a while, though (or orientation vectors, when we look at Debye spectra). That requires an actual trajectory and is not an ensemble average that could be calculated with MC or an equation of motion with an incorrect inertia term.

Okay, nice. Then I assume for now, that the equations in the code are correct and I just do not understand where they come from at the moment.

Regarding my simulation: As you suggested I will try a different warmup and see if I can make it work.

@jngrad
Copy link
Member

jngrad commented Jun 3, 2020

Just had another look at the issue today. In a work-in-progress branch, I've clarified which formalism is used (a037bf5) and added an assertion to catch square roots of negative values (3a6ee73). If you tinker with torques and angular velocities, a pattern seems to emerge: when the largest component of the angular velocity multiplied by the time step gives 2.0 or more, the solution for lambda involves the square root of a negative number. For a MWE, replace src/core/unit_tests/grid_test.cpp with grid_test.cpp.txt and do:

make -j$(nproc) check_unit_tests ARGS='-R grid_test -V'

kodiakhq bot added a commit that referenced this issue Jun 5, 2020
Partial fix for #3713

Description of changes:
- document quaternion formalism in the core
- add an assertion to catch invalid square roots
- explain that force capping doesn't affect torques
- use vector operations
@Kischy
Copy link
Contributor Author

Kischy commented Jun 10, 2020

Just had another look at the issue today. In a work-in-progress branch, I've clarified which formalism is used (a037bf5)

Thank you, that is much better.

I still do not understand the equations. Even with the correct notation the equations in code should be different. As also calculated by @jngrad they should be:

Qd[0] = 0.5 * (-p.r.quat[3] * p.m.omega[0] - p.r.quat[0] * p.m.omega[1] + p.r.quat[2] * p.m.omega[2]);
Qd[1] = 0.5 * (p.r.quat[0] * p.m.omega[0] - p.r.quat[3] * p.m.omega[1] - p.r.quat[1] * p.m.omega[2]);
Qd[2] = 0.5 * (p.r.quat[1] * p.m.omega[0] + p.r.quat[2] * p.m.omega[1] + p.r.quat[0] * p.m.omega[2]);
Qd[3] = 0.5 * (-p.r.quat[2] * p.m.omega[0] + p.r.quat[1] * p.m.omega[1] - p.r.quat[3] * p.m.omega[2]);

As said before we should have a test to verify if there is an error in code or in the calculations done here. I do not have the time to do this right now. Sorry about that.


I also tried the simulation with different settings:

  • much lower timestep -> still fails
  • two warmups, starting the first warmup with no thermostat and decreasing the box size very slowly -> fails during first warmup

@KaiSzuttor
Copy link
Member

@jngrad are you still working on that?

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Aug 3, 2020 via email

@jngrad
Copy link
Member

jngrad commented Aug 3, 2020

I tried to come up with an analytical solution to write a simple unit test, but it's a second order differential equation... The Euler integration method suggested by Rudolf is a better approach.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants