Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

P3M benchmark randomly fails CI #2924

Closed
espresso-ci opened this issue Jun 18, 2019 · 18 comments · Fixed by #3096 or #3358
Closed

P3M benchmark randomly fails CI #2924

espresso-ci opened this issue Jun 18, 2019 · 18 comments · Fixed by #3096 or #3358

Comments

@espresso-ci
Copy link

https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/pipelines/7734

@jngrad
Copy link
Member

jngrad commented Jun 18, 2019

This is a recurring error:
https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/125861
https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/126068
https://gitlab.icp.uni-stuttgart.de/espressomd/espresso/-/jobs/126071
Re-starting the job usually fixes it. It is caused by this line:

energies = system.analysis.energy()

With error message: Exception: calc_long_range_energies failed: ERROR: number of cells 1 is smaller than minimum 8 (interaction range too large or min_num_cells too large) in function void dd_create_cell_grid(). It happens at random. The P3M benchmark script is MPI-capable, but its test currently runs without MPI. I never had any issue running this benchmark without MPI, even with a larger number of particles.

@jngrad jngrad changed the title CI build failed for merged PR P3M benchmark randomly fails CI Jun 18, 2019
@RudolfWeeber
Copy link
Contributor

Either the p3m tuning doesnot respect box_l/2 as maximum real space cutoff
or tune_skin(), which is called after p3m tuning does not.
Maybe some code needs to be inserted to calculate max_skin from the current interaction cutoffs and the local box length.

@fweik
Copy link
Contributor

fweik commented Jun 21, 2019

I'm looking into this, I think there is a bug in the tuning.

@jngrad
Copy link
Member

jngrad commented Jun 21, 2019

I can't reproduce the bug locally nor locally in the docker container, but I can reproduce it on coyote8 in the docker container (after 175 tries). I'll try again while printing the random seed and check if it's deterministic.

@fweik
Copy link
Contributor

fweik commented Jun 21, 2019

Please don't spend any more time on this, I will fix it, I know what the error is.

@RudolfWeeber
Copy link
Contributor

#2961
I assume that the resulting p3m parameters are outside usual values in the CI environment.
Short term, it might be enough to disable the tune_skin in the test, but generally #2961 has to be fixed.

@RudolfWeeber
Copy link
Contributor

Actually, the maximum safe skin might be available from s.cell_system.get_state()["max_skin"]

@RudolfWeeber
Copy link
Contributor

Did this re-occur, since the adjust_max_skin was added in the tune_skin() call?
Otherwise, it can be closed.

@jngrad
Copy link
Member

jngrad commented Aug 7, 2019

Hasn't yet in CI. But I was able to trigger a similar error today on coyote8 after 493 retries:

[...same as in the original logfile...]
  File "/home/espresso/espresso/build/testsuite/scripts/benchmarks/local_benchmarks/p3m_processed.py", line 164, in <module>
    adjust_max_skin=True)))
  File "cellsystem.pyx", line 308, in espressomd.cellsystem.CellSystem.tune_skin
  File "utils.pyx", line 261, in espressomd.utils.handle_errors
Exception: Error during tune_skin: ERROR: number of cells 1 is smaller than minimum 8 (interaction range too large or min_num_cells too large)

The last 3 lines changed. The error now throws in espressomd.cellsystem.CellSystem.tune_skin instead of espressomd.analyze.Analysis.energy.

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Aug 7, 2019 via email

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Aug 7, 2019 via email

@fweik
Copy link
Contributor

fweik commented Aug 7, 2019

The first one is correct, the second one isn't. You should also maybe have a look at #3053, which clarifies some of these things.

@jngrad
Copy link
Member

jngrad commented Aug 8, 2019

I've just merged #3053 locally in python and was able to get the same error message, plus a new one:

resulting parameters: mesh: (22 22 22), cao: 7, r_cut_iL: 3.6018e-01,
                      alpha_L: 9.0136e+00, accuracy: 9.9536e-05, time: 12.38

0: rs_mesh overflow! (pos 12.448619, nmp=24)
0: allowed coordinates: -1.600000 - 13.477258
0: rs_mesh overflow! (pos 12.482639, nmp=24)
0: allowed coordinates: -1.600000 - 13.477258
0: rs_mesh overflow! (pos 12.516053, nmp=24)
0: allowed coordinates: -1.600000 - 13.477258
[1617038c16f4:13557] *** Process received signal ***
[1617038c16f4:13557] Signal: Segmentation fault (11)
[1617038c16f4:13557] Signal code:  (128)
[1617038c16f4:13557] Failing at address: (nil)
[1617038c16f4:13557] [ 0] /lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7f2b8a88b390]
[1617038c16f4:13557] [ 1] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x22)[0x7f2b8a534512]
[1617038c16f4:13557] [ 2] /home/espresso/espresso/build3/src/core/EspressoCore.so.4(Particle::~Particle()+0x3c)[0x7f2b88d3ffbc]
[1617038c16f4:13557] [ 3] /home/espresso/espresso/build3/src/core/EspressoCore.so.4(invalidate_ghosts()+0x6f)[0x7f2b88dab89f]
[1617038c16f4:13557] [ 4] /home/espresso/espresso/build3/src/core/EspressoCore.so.4(cells_resort_particles(int)+0x2b)[0x7f2b88d3babb]
[1617038c16f4:13557] [ 5] /home/espresso/espresso/build3/src/core/EspressoCore.so.4(integrate_vv(int, int)+0x25a)[0x7f2b88db6e4a]
[1617038c16f4:13557] [ 6] /home/espresso/espresso/build3/src/core/EspressoCore.so.4(mpi_integrate(int, int)+0x77)[0x7f2b88d51ed7]
[1617038c16f4:13557] [ 7] /home/espresso/espresso/build3/src/core/EspressoCore.so.4(tune_skin(double, double, double, int, bool)+0x27b)[0x7f2b88e24a9b]
[1617038c16f4:13557] [ 8] /home/espresso/espresso/build3/src/python/espressomd/cellsystem.so(+0x1282f)[0x7f2b546cc82f]
[1617038c16f4:13557] [ 9] /home/espresso/espresso/build3/src/python/espressomd/script_interface.so(+0x1999c)[0x7f2b7e6af99c]
[1617038c16f4:13557] [10] /usr/bin/python3(PyObject_Call+0x47)[0x5c20e7]
...
[1617038c16f4:13557] *** End of error message ***
Segmentation fault (core dumped)

@jngrad
Copy link
Member

jngrad commented Aug 28, 2019

@jngrad
Copy link
Member

jngrad commented Sep 6, 2019

even after merging #3132, it's still possible to get the p3m benchmark to fail on coyote7 after 200 trials: Exception: Error during tune_skin: ERROR: number of cells 1 is smaller than minimum 8 (interaction range too large or min_num_cells too large)

@RudolfWeeber
Copy link
Contributor

Out of ideas. De-milestoning.

@jngrad
Copy link
Member

jngrad commented Nov 28, 2019

We might as well disable this test in CI. We already know this benchmark cannot be used due to the non-deterministic nature of the P3M tuning function. It also fails due to the other, less frequent bug reported above. We can re-enable the test once the tuning function gets re-implemented.

@fweik
Copy link
Contributor

fweik commented Nov 29, 2019 via email

jngrad added a commit that referenced this issue Dec 5, 2019
3358: Fix breaking tests on 4.1.1

Description of changes:
- disable benchmark tests in CI jobs where the P3M benchmark test fails repeatedly (closes #2924)
- increase tolerance of `field_coupling_fields` for i586 builds (partial fix for #3315)
@jngrad jngrad removed this from the Espresso 5 milestone Jun 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants