Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove() causes "double free or corruption (fasttop)" when using gnu #2437

Closed
lotuspaperboy opened this issue Jan 14, 2019 · 14 comments
Closed

Comments

@lotuspaperboy
Copy link

I'm a running the same python script on three different machines. On one machine I have pypresso compiled with icc and on the other two I have pypresso compiled with gcc. My code runs find on the icc machine. However, the same python script crashes in the same place on the two gcc machines:

for i in range(self._start[n], self._start[n] + self._size):
    print("DELITING -> {}".format(i))
    print("Does it exist? -> {}".format(self._system.part.exists(i)))
    self._system.part[i].remove()              #<- Crashed here

which gives the output:

DELITING -> 2107
Does it exist? -> True
Position -> [  49.28011854  448.51034963   51.89939159]
DELITING -> 2108
Does it exist? -> True
Position -> [  51.18122605  445.49477177   53.87249136]
DELITING -> 2109
Does it exist? -> True
[XPS] *** Process received signal ***
[XPS] Signal: Segmentation fault (11)
[XPS] Signal code: Address not mapped (1)
[XPS] Failing at address: 0x83c0000084b
[XPS] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7fbe5a1f3f20]
[XPS] [ 1] /lib/x86_64-linux-gnu/libc.so.6(+0x943c1)[0x7fbe5a2493c1]
[XPS] [ 2] /lib/x86_64-linux-gnu/libc.so.6(__libc_malloc+0x27d)[0x7fbe5a24c2ed]
....

as you can see the particle definitely existed before the remove() command was called. On different runs, the loop can get through between 1-3 iterations of the loop before crashing. This inconsistency points to some sort of threading issue.

Is there anything different that needs to be done when compiling with gcc than compared with icc? I can reproduce this issue so let me know if you require any other info.

@mkuron
Copy link
Member

mkuron commented Jan 14, 2019

GCC is our primary targeted compiler, so this definitely shouldn't be happening there if it doesn't happen on Intel. Removing particles is a supported operation, so please provide your script (or ideally, a minimal script that exhibits the error) so we can try to reproduce and fix the bug.

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Jan 14, 2019 via email

@RudolfWeeber
Copy link
Contributor

On what platform are you? 32 or 64 bit?
This is bug also occurs in the analyze_energy test caes of pr #2438 on i386.

@mkuron
Copy link
Member

mkuron commented Jan 15, 2019

@RudolfWeeber, I seem to recall we fixed that as part of the i386 support, so we must have forgotten to cherry-pick that to 4.0.1. I can't find the pull request though.

@mkuron
Copy link
Member

mkuron commented Jan 15, 2019

Ah, @fweik's #2410. seems like that was incompletely cherry-picked.

@lotuspaperboy
Copy link
Author

I'm using an 64bit Intel processor on both machines:

Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              8
Model name:          Intel(R) Core(TM) i7-3770 CPU @ 3.40GHz

I'm using espresso-4.0.0 (I downloaded the zip file from the website).

I'll try get a minimum working script up here by the end of the day.

Thanks,
David

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Jan 15, 2019 via email

@mkuron
Copy link
Member

mkuron commented Jan 15, 2019

We never saw it on 64-bit, but it was an actual bug that might also surface on non-32-bit for certain parameters.

@lotuspaperboy
Copy link
Author

I've been able to reproduce this behavior with a simple script (attached). I run the script with:

/usr/bin/mpirun -n 4 pypresso main.py

As some point during the run, I see the error:

Setting up
Adding particles
Bonding particles
Warming
Integrating
Does particle 0 exist? -> True
Removing 0
Does particle 1 exist? -> True
Removing 1
double free or corruption (fasttop)
[kotsis-XPS-8500:28892] *** Process received signal ***
[kotsis-XPS-8500:28892] Signal: Aborted (6)
[kotsis-XPS-8500:28892] Signal code:  (-6)

It doesn't always fail on id 0, sometimes it takes a few iterations.

I've also tried this with the espresso version 4.0.1 from the link above, but I still see this issue.

Playing around with my test script, I notice that of I comment out the section where I set up bonded interactions between pairs of particles, I no longer experience a crash. Do I have to remove/disable a bonded interaction before I remove a particle?

I'm using a 64bit OS (ubuntu 18.04 and 16.04). I've also attached myconfig.hpp, in case the error is related to a specific module I'm using.

myconfig.txt
script.txt

@RudolfWeeber
Copy link
Contributor

This seems to be working on the current development branch. The script also passes with pr #2441 for the 4.0.1 candidate branch.

@lotuspaperboy
Copy link
Author

Do you have any other suggestions of things I could try to get to the bottom of this? It must be something to do with how I am compiling the binary. To build I do:

mkdir build
cd build
cp <store>/myconfig.hpp .
cmake <path_to_espresso>
make
sudo make install

I also tried building the latest binary on github but I am still seeing the issue. Even running pypresso with a single core I can see the issue:

user@XPS:~/espresso-bug-2437$ /usr/bin/mpirun -n 1 pypresso main.py
Setting up
Adding particles
Bonding particles
Warming
Integrating
Does particle 0 exist? -> True
Removing 0
Does particle 1 exist? -> True
Removing 1
Integrating
Does particle 2 exist? -> True
Removing 2
double free or corruption (fasttop)
[XPS] *** Process received signal ***
[XPS] Signal: Aborted (6)
[XPS] Signal code:  (-6)
[XPS] [ 0] /lib/x86_64-linux-gnu/libc.so.6(+0x3ef20)[0x7f65feb67f20]
[XPS] [ 1] /lib/x86_64-linux-gnu/libc.so.6(gsignal+0xc7)[0x7f65feb67e97]
[XPS] [ 2] /lib/x86_64-linux-gnu/libc.so.6(abort+0x141)[0x7f65feb69801]
[XPS] [ 3] /lib/x86_64-linux-gnu/libc.so.6(+0x89897)[0x7f65febb2897]
[XPS] [ 4] /lib/x86_64-linux-gnu/libc.so.6(+0x9090a)[0x7f65febb990a]
[XPS] [ 5] /lib/x86_64-linux-gnu/libc.so.6(cfree+0x6b4)[0x7f65febc1004]
[XPS] [ 6] /home/dpower/Espresso/espresso/build-gnu/src/core/libEspressoCore.so.4(_ZN5Utils4ListIijE4copyERKS1_+0x7d)[0x7f65fb05739d]
....

My gcc version is:

gcc (Ubuntu 7.3.0-27ubuntu1~18.04) 7.3.0

@mkuron
Copy link
Member

mkuron commented Jan 15, 2019

It's an Espresso bug, fixed by #2410. Please either use the current version from Git or wait a few days until we release version 4.0.1, which contains that patch.

@RudolfWeeber
Copy link
Contributor

RudolfWeeber commented Jan 15, 2019 via email

@lotuspaperboy
Copy link
Author

Thanks Rudolf, I downloaded and compiled the 4.0.1 branch and I'm now no longer seeing the issue. I'll run my main script again overnight but it looks like issue has been solved. Thanks!

@fweik fweik closed this as completed Jan 23, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants