-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmentation fault with Espresso 4.1.1 using p3m, Gay-Berne potential, Lennard-Jones potential and virtual sites #3713
Comments
Might be connected to issue #3663, but I am not sure. I do not remove any particles during the simulation. |
This looks like espresso 4.1.2, failing in function |
In 4.1.1 there are additional checks available to generate meaningful error messages before a segfault: espresso/src/core/electrostatics_magnetostatics/p3m.cpp Lines 557 to 570 in 35e18f7
Could you please recompile your version of espresso with the extra feature ADDITIONAL_CHECKS in your myconfig.hpp and share with us if there is a runtime error instead of a segfault? This would spare us from running GDB in MPI-parallel code.
|
I will try that tomorrow and come back with the output. Might take awile, the simulation does not fail on start. |
I have never tried without MPI, because the tuning of the p3m algorithm never finishes. Or at least not in a reasonable time.
What exactly do you mean with this question? |
You mentioned the system crashes depending on the charge and position of the virtual particle. The backtrace shows the failure happened on MPI rank 8. The MPI ranks are assigned to slices of the cellsystem. If you re-run the same script, the virtual particle will be in the same domain. If MPI rank 8 fails repeatedly, we can attach GDB to ranks 0 and 8 to get an opportunity to print the values of all local variables in every function shown in the backtrace. However it's quite tricky, and there is no need to do that if |
It only depends on the position of the virtual particle, the value of the charge does not change between different simulation runs. I will try This is another run:
and yet another run:
|
So I compiled using
|
It's not obvious to me what is happening here from the traces. Can you please provide a (minimal) script that reproduces the issue? |
doesn't that look like a blow up of forces if the top of the stack is |
I can send you the whole script. I can't send it to you here, because its not puplished yet. It is not minimal, because this is hard to reproduce. |
I just send you the script. Thank you for looking into this. If there is anything I can do, please let me know. |
Might be possible. In this simulation the charge is relatively exposed in comparision with the other simulations. |
Did you set the |
Almost, I would say. rel_virt_pos[2] holds the maximal distance. I had problems with the exact value, which I assumed to stem from floating point inaccuracies, so I added a tiny bit.
|
This is an open source project, we can not provide private support to you. If you want our help, the discussion has to be public, so that it is also helpful to others. |
I think this must be some kind of misunderstanding: I did never ask you for private support. I simply wanted to report a bug, which I will definitly not do again in such a hostile environment. You asked me for a simulation script and I send it to you. I understand your need for a public discussion, but with a pending paper I simply cannot puplish the script here yet. I can provide the script here once the paper is puplished. |
By the way: This is the second time I feel like I am beeing attacked for trying to contribute. For me, that is unacceptable for any kind of open source project. |
I'm sorry you feel that, that was not my intention. But referring to a private mail in public forum just isn't very helpful. |
That is right, but as I said, I can provide it later. And I did not refer to the private mail, I simply answered your question in a meaningful way. I could have simply stated "yes", which is not as helpful once the scripts are provided. The reference was a mere bonus. |
I just realized, that if this simulation does not work, it will not be included in the puplication anyway. So here you go: You can start the simulation from "Temp_alle/ILC.py" if you want. |
Ok, I can see how my comment could be understood as harsh, as I was saying, this wasn't my intention. With regards to the problem, I've run out of ideas for now, I'll have a look again next week. |
@Kischy Thank you for reporting this issue to us. This lead us to re-introduce one of the 4.1 It's unfortunate |
Okay, no worries, the tone was quite aggressive to me, therefore my reaction. If we are back to a helpfull discussion, I want to mention again, that this is the second time this happend and kindly ask for an improvement on that front. |
I understand the policy, the files are provided above. |
If you change the value in line 29 from 1.5 to 1.2 in file "particle_setup_q_2_zc_1_5.py" and then run the simulation, there is no issue. The value changes the position of the virtual particle relative to the Gay-Berne particle. In fact all values from 0.0 to 1.2 seem to work fine. Only when the charge is almost or close to the tip of the GB particle the simulations break for some reason. |
@Kischy no harm done. I think the communication was not optimal from both sides. Let's do better next time and move on. |
I tracked down the source of NaN to this expression: espresso/src/core/rotation.cpp Lines 197 to 203 in 35e18f7
We're taking the square root of a negative number, which propagates to the particle quaternion, to the particle director, and finally to the virtual site position via:
The lambda expression matches eq. 12 from the Omelyan paper, with
With regards to your script, I think the force cap you're applying during the warmup phase doesn't actually increase linearly. Furthermore the force cap doesn't cap torques, which is why the particle above has a torque of 1 million AU around the y axis while the linear force is less than 9 AU. |
I'll have a look at the paper again and come back if I find anything usefull.
One question about the parameters: From
I never thought it to be a problem, that the warmup cap does not increase linearly. I can change that and try again. |
It is not an issue in itself. The line
I got it to fail on the first integration step:
I did too. In fact, the steepest descent algorithm does exactly that. Not sure why the force cap doesn't, though. You could update the code in @fweik I can't really follow the old integrator code nor find where the temperature global is recalculated; could it be that force capping artificially decreases the linear velocities and thus the global temperature before langevin is applied, creating a negative feedback loop where langevin increases both the linear velocity of real particles and torques of real and virtual particles? |
Are torques calculated for the virtual sites? I think I turned rotation off for the virtual sites. Only the GB and LJ particles need to rotate. |
I'm not exactly sure what happens with virtual site torques and thermostats, but after removing both the force cap and thermostat, the NaN issue persists, so the issue has a different origin.
Yes, although I'm not sure why.
Yes, the papers use a different formalism for quaternions. This already caused issues once (#2964). It's now clear to me we need to write down in Qd[0] = 0.5 * (-p.r.quat[3] * p.m.omega[0] - p.r.quat[0] * p.m.omega[1] + p.r.quat[2] * p.m.omega[2]);
Qd[1] = 0.5 * (p.r.quat[0] * p.m.omega[0] - p.r.quat[3] * p.m.omega[1] - p.r.quat[1] * p.m.omega[2]);
Qd[2] = 0.5 * (p.r.quat[1] * p.m.omega[0] - p.r.quat[2] * p.m.omega[1] - p.r.quat[0] * p.m.omega[2]);
Qd[3] = 0.5 * (-p.r.quat[2] * p.m.omega[0] + p.r.quat[1] * p.m.omega[1] - p.r.quat[3] * p.m.omega[2]);
I think you have substituted the greek letters by the wrong quaternion indices in the expression of matrix A. The substitution should be if I'm not mistaken. |
That is confusing and not clear at all. The comment
suggests, that the formalism of the paper is used and I thought it was scalar first. Thank you for clearing that up. Please note this somewhere! I would suggest updating the comment. |
@jngrad I think the signs in your matrix might be incorrect. The third row should be all positive signs. But even with this, the result does not match the code, does it? |
Nice catch, it was a copy-paste error. The third line of the code should also have only positive signs.
We definitively need to update the comment to reflect the formalism used in the function.
Updating the code with my solution (with the correct signs) fails the |
First of all, I just took a very brief look at the script, so the comments below may not be exactly applicable here. I'll still comment, since I'm probably one of the people who has spent significant time with the rotation and virtual sites stuff. I think the torque is just to big for the simulation to be stable. I'd suggest experimenting with the following changes to the simulation protocol (again, I just had time for a quick glance, so I might have overlooked stuff)
Steepest descent and force capping with electrostatics on is tricky, because the configuration in which two charges are very close is always the energetically most favorable one, if they only come close enough. The force capping allows for that. Hope that helps |
@jngrad I think your sugestion for a test is perfect. Would it be possible to calculate the derivitaves using a gradient as we did on #3091 ? How was this tested until now? Unfortunately I do not have much time for this at the moment. I will maybe find time on Friday or in two weeks. |
@RudolfWeeber Thank you for your suggestions. I will test them. |
With regards to errors in the rotation code: |
It's tested indirectly by spinning a particle with anisotropic inertial moment around a principal axis and checking for angular velocity conservation, if I read the code correctly.
It's probably easier to work out an analytical result. With a particle rotating around its principal axis and a torque applied on a perpendicular axis, the solution might just be a product of two quantities.
I think the test failed when I substituted the quaternion indices because I didn't check if
I also don't have much time to spend on this issue this week. |
There is also a test for anisotropic inertia moments.
If we really want peace of mind with this, one can compare Espresso’s trajectory for a particle with an external torque and anistropic inertia moments at T=0 with a numerical solution of the Euler equations, e.g., using
Scipy.integrate.solve_ivp.
|
Hopefully you are right, but I do not understand the equations then. If espresso is using the scalar first convention and the paper uses the scalar last convention then your equations are correct and the code is incorrect. I do not see how the code can be correct, because there musst be a line in the resulting equations that has all positive signs (according to the paper) and there is no such line in code. |
I think a bug would only matter for systems that are out of equilibrium. The argument would go as follows: Monte Carlo and MD give the same equilibrium result. Monte Carlo does not need equations of motion, therefore it should not matter in MD, if the equations of motion are incorrect as long as the particles are allowed to rotate and move. Only if a system is out of equilibrium or one needs the trajectories, the correct equations of motions become relevant. Do you agree? (Therefore it is also completly unneccassry for me to set the rotational inertia of the particles to anything other than 1. The result musst be the same.) |
Thanks for the suggestion! |
I think a bug would only matter for systems that are out of equilibrium. The argument would go as follows: Monte Carlo and MD give the same equilibrium result. >Monte Carlo does not need equations of motion, therefore it should not matter in MD, if the equations of motion are incorrect as long as the particles are >allowed to rotate and move. Only if a system is out of equilibrium or one needs the trajectories, the correct equations of motions become relevant. Is this >argument correct? (Therefore it is also completly unneccassry for me to set the rotational inertia of the particles to anything other than 1. The result musst be the >same.)
Yes, if it were only ensemble averages such as <omega^2>.
The diffusion coefficients and Debye spectra use correlations between omega now and omega in a while, though (or orientation vectors, when we look at Debye spectra). That requires an actual trajectory and is not an ensemble average that could be calculated with MC or an equation of motion with an incorrect inertia term.
|
Okay, nice. Then I assume for now, that the equations in the code are correct and I just do not understand where they come from at the moment. Regarding my simulation: As you suggested I will try a different warmup and see if I can make it work. |
Just had another look at the issue today. In a work-in-progress branch, I've clarified which formalism is used (a037bf5) and added an assertion to catch square roots of negative values (3a6ee73). If you tinker with torques and angular velocities, a pattern seems to emerge: when the largest component of the angular velocity multiplied by the time step gives 2.0 or more, the solution for lambda involves the square root of a negative number. For a MWE, replace make -j$(nproc) check_unit_tests ARGS='-R grid_test -V' |
Partial fix for #3713 Description of changes: - document quaternion formalism in the core - add an assertion to catch invalid square roots - explain that force capping doesn't affect torques - use vector operations
Thank you, that is much better. I still do not understand the equations. Even with the correct notation the equations in code should be different. As also calculated by @jngrad they should be:
As said before we should have a test to verify if there is an error in code or in the calculations done here. I do not have the time to do this right now. Sorry about that. I also tried the simulation with different settings:
|
@jngrad are you still working on that? |
To my understanding
* The segfault was triggered when the rotational dynamic becomes unstable. This now produces an appropriate error msg, AFAIK.
* A boundary related issue in P3M was fixed, which was not directly involved in the report
* Since the quaternion stuff in the rotational integrator is hard to understand and we are not totally sure, the rotational dynamics of a particle with an external torque should be compared against a numerical solution of the Euler eqn. I opened a separate ticket for that.
This ticket can then probably be closed.
|
I tried to come up with an analytical solution to write a simple unit test, but it's a second order differential equation... The Euler integration method suggested by Rudolf is a better approach. |
I have a very weird segmentation fault in one of my simulations. The curios thing is that I do multiple simulations in exactly the same manner and the only thing I change between different simulations is the position of a virtual particle, that carries a charge. For one particular charge position the simulation fails with a segmentation fault, all the others work just fine. The point at which it fails is different in every run, but it fails consistently. I also tested it on a different machine and using a different version of espresso, but the error persists. I can give more information if needed.
Here is an example of the error message:
The text was updated successfully, but these errors were encountered: