-
Notifications
You must be signed in to change notification settings - Fork 121
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
QUIP/LAMMPS different energies for different number of cores #669
Comments
Really sounds like a bug in the LAMMPS-QUIP interface. Have you compared to using |
Can you try the same experiment with the SW parameter file at
|
If I calculate the energy for the system with the following command quip E=T atoms_filename=test.xyz param_filename=GAP.xml | grep AT | sed 's/AT//' > gap.xyz I get an energy of -2768.55 eV. This result is closer to the single core result in lammps of -2768.9 eV. The parallel run in lammps yields -2771.475 eV. The output files are attached in the initial question. I used the initial configuration. If I use the same type of atoms all three energies are the same. When I use the same lammps input script with the SW parameter file ip.parms.SW_SiC_CASTEP_elastic.xml I get no core energy dependence. This is also something I recognized for my trained potentials. Some of them have a core dependence and some don't. Do you think this has something to do with QUIP version and I could get rid of it by installing an older version because I don't think it has something to do with the lammps version. Thank you in advance. |
Those are way larger errors than can be explained by anything other than a bug. If we can get a complete setup to reproduce the issue that'd help. Can you can reproduce this behavior with a GAP xml (+sparsepoints) files we have access to (e.g. one of the published publicly available ones), or are you willing to share the one you created? Also, can you figure out how small a system you can reproduce this issue with? |
I uploaded all GAP files in the following git repository https://github.com/robertstella111/SiC_GAP_bug.git Do you also need the trainingsdata or is this enough? I have trained the potential with the commands written in the previous messages. For system sizes of 5x5x10 (48 atoms) Angstrom the error is still there, but also scaled down as the system size. |
Well, I've definitely reproduced the error, FWIW. I'll see if I can figure out what's going on. |
@albapa @jameskermode do you remember how the lammps interface is supposed to behave with |
I don't think so. I think we should only calculate descriptors of atoms with |
Something weird is going on. I have two input files, one of which lammps serial and mpi agree, and a slightly different one (change one atom type) where they do not. Neither agrees with pure QUIP. And in both cases atoms with local mask false have non-zero energy, which appears to contribute to the LAMMPS reported PE. More debugging in a few hours, but I have to go out now. |
And if I only add the ones where the mask is True, I get energies that are very different.
"good" is the input configuration where LAMMPS serial and mpi agree, "bad" is the one where they don't. "TF" is adding all the local energies, "T only" is only the ones where mask is True. These are manual sums of the local energies from QUIP, and the LAMMPS ones agree with the potential energy that LAMMPS prints out in its log file. |
Is there a chance that |
A couple of weird things.
any guesses? [added later] I can also read a 2b-only xml file, but not a 3b-only one. |
I had the same idea - to doctor the XML to figure out which descriptor is causing the problem. Happy to look at it. |
I'm trying to debug the xml parsing. BTW, If I remove the SOAP descriptors the energy value mismatch is still there (i.e 2b+3b). If I do 2b only, there's no mismatch. I think that means there's definitely an issue with the 3b, but I can't prove there isn't one with soap yet. |
As far as I remember I recently trained a potential without the 3b descriptor and there was still a missmatch. I will check check that in the next couple of hours when I am home again. I will let you know afterwars and if I remembered that correctly. |
OK, figured out the xm parsing issue - the number of the coordinate is embedded in its label. Let me test further. |
2b seems fine (QUIP and LAMMPS serial and mpi agree). 3b is bad. Just soap also appears to be OK. Note that 2b and 3b are It is, frankly, not entirely obvious to me how one would handle the local mask for |
What I think should be done is to distribute the energy term in I can have a look to see whether this is what happens in practice. Did you test the forces? Are they consistent across the different tests? |
I haven't looked at forces at all. When happens now with [for some reason I don't get any evidence of |
Is it possible there's an issue with compact_clusters? |
sorry about the false alarms - there was a bug extracting a small cluster from the large file that really gives an error. |
OK, some real issues. When I look at 3-body contributions from particular triplets of atoms, and compare what the quip call and lammps call do, I see some differences. For example, in this particular geometry file (modified from the one above), for a particular triplet of atoms (73 217 265) and a particular descriptor (C-C-Si), with quip I get
and with lammps I get
Basically, it looks like it encounters this particular triplet with its atoms in a different order ( |
And the bottom line is that atom 73 ends up with non-trivially different energy contribution between quip and lammps serial. This contribution differs by about 10 meV, and overall there's 19 meV difference in the total site energy of atom 73. There are maybe 10-20 such atoms overall, and a total energy difference of a fraction of an eV. If there's anything else useful I can provide (like these xyz/lammps-data configs and lammps input files), let me know. |
Note that sometimes they encounter the same triplet multiple times, and I see something similar. If it hits the same set of atoms multiple times, if the atoms are in the same order in the triplet I see identical xStar and equal energy contributions, but if they are different, I see permuted xStar and different energy contributions. E.g. 73 217 259
vs.
|
I don't really understand how this works, and they they might hit the same triplet multiple times, given that there's explicit handling of permutations inside the N-body descriptor, so I'm not sure where to go next. |
I'll try to retrace my steps. |
I wonder if they issue has to do with permutations, and in particular how they affect kernels that involve multiple elements. If it's a C-C-Si 3-body kernel, are you only allowed to permute two atoms if they are both C? |
Yes, this is exactly what I am looking at. And yes, for a C-C-Si there are two permutations. I want to see if some problem sneaked in the way we treat that. |
This should get fixed by libAtoms/GAP#91, right @albapa ? |
Yes! It will break backwards compatibility, but I think potentials fitted with earlier versions ( |
Hello,
I am attempting to create a GAP potential for silicon and carbon. I have installed the QUIP library with the architecture linux_x86_64_gfortran. For the training of the GAP I am using the following command
gap_fit energy_parameter_name=energy force_parameter_name=forces do_copy_at_file=F sparse_separate_file=T gp_file=GAP.xml at_file=train.xyz gap={distance_Nb order=2 cutoff=4.5 n_sparse=15 covariance_type=ard_se delta=5 theta_uniform=2.0 sparse_method=uniform compact_clusters=T : distance_Nb order=3 cutoff=3.5 n_sparse=200 covariance_type=ard_se delta=0.3 theta_uniform=2.0 sparse_method=uniform compact_clusters=T : soap atom_sigma=0.5 l_max=8 n_max=8 cutoff=4.5 cutoff_transition_width=1.0 delta=0.1 covariance_type=dot_product n_sparse=2000 zeta=4} default_sigma={0.005 0.08 0.0 0.0} config_type_sigma={dimer:0.0008:0.02:0.0:0.0:single:0.0001:0.005:0.0:0.0:perfect_crystall:0.00001:0.01:0.0:0.0:interstitials:0.0002:0.05:0.0:0.0} sparse_jitter=1e-8 gp_file=GAP.xml
I now want to run bigger simulations in lammps. For lammps I tried several versions (lammps_stable_sep_2021, stable_5Jun2019,stable_29Aug2024) and installed the QUIP pair_style by using cmake directly and also by referring to the libquip.a library.
When I now run lammps with mpi on mulitple cores and on a single core the results for the energies are quite different. For a cell of 385 atoms, the energy difference is around 3eV. The command for the pair_style looks as follows
pair_coeff * * GAP/29_08/without_stress/GAP.xml "Potential xml_label=GAP_2024_8_30_120_8_42_16_201" 14 6
If I now change all silicon atoms to carbon or all carbon to silicon the results for the mpi and single core runs are the same. I also encountered the problem for bigger cells and smaller cells, with defects and without defects. The problem with different results for different numbers of cores is also not present for all GAP potentials I have trained so far. The one above is just an example of where the error occurs.
Does someone have an idea where the error comes from and if this will influence the results I get from longer and more realistic runs (physical wise)?
log_serial.txt
log_parallel.txt
training.txt
h_110_1.txt
The text was updated successfully, but these errors were encountered: