Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallel Simulations #576

Closed
costashatz opened this issue Dec 29, 2015 · 9 comments
Closed

Parallel Simulations #576

costashatz opened this issue Dec 29, 2015 · 9 comments
Milestone

Comments

@costashatz
Copy link
Contributor

Hello,

Merry Christmas to everyone.

I am running in parallel multiple simulations (around 24) and I am getting seg-faults. This is the backtrace:

#0  0x00007ffff667c5ab in std::_Rb_tree_rebalance_for_erase(std::_Rb_tree_node_base*, std::_Rb_tree_node_base&) () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#1  0x00007ffff76b5cce in dart::dynamics::Frame::changeParentFrame(dart::dynamics::Frame*) () from /usr/lib/libdart-core.so.5.1
#2  0x00007ffff76b638d in dart::dynamics::Frame::~Frame() () from /usr/lib/libdart-core.so.5.1
#3  0x00007ffff7769ac9 in dart::dynamics::BodyNode::~BodyNode() () from /usr/lib/libdart-core.so.5.1
#4  0x00007ffff7769ba1 in dart::dynamics::BodyNode::~BodyNode() () from /usr/lib/libdart-core.so.5.1
#5  0x00007ffff77ae8ee in dart::dynamics::Skeleton::~Skeleton() () from /usr/lib/libdart-core.so.5.1
#6  0x00007ffff77ba17d in std::_Sp_counted_ptr<dart::dynamics::Skeleton*, (__gnu_cxx::_Lock_policy)2>::_M_dispose() () from /usr/lib/libdart-core.so.5.1
#7  0x000000000040685b in _M_release (this=0x7fffe00016e0) at /usr/include/c++/4.9/bits/shared_ptr_base.h:149
#8  ~__shared_count (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/shared_ptr_base.h:666
#9  ~__shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/shared_ptr_base.h:914
#10 ~shared_ptr (this=<optimized out>, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/shared_ptr.h:93
#11 ~Hexapod (this=<optimized out>, __in_chrg=<optimized out>) at /home/kchatzil/Workspaces/ResiBots/include/hexapod_dart/hexapod.hpp:8
#12 destroy<hexapod_dart::Hexapod> (this=<optimized out>, __p=<optimized out>) at /usr/include/c++/4.9/ext/new_allocator.h:124
#13 _S_destroy<hexapod_dart::Hexapod> (__p=<optimized out>, __a=...) at /usr/include/c++/4.9/bits/alloc_traits.h:282
#14 destroy<hexapod_dart::Hexapod> (__a=..., __p=<optimized out>) at /usr/include/c++/4.9/bits/alloc_traits.h:411
#15 std::_Sp_counted_ptr_inplace<hexapod_dart::Hexapod, std::allocator<hexapod_dart::Hexapod>, (__gnu_cxx::_Lock_policy)2>::_M_dispose (this=<optimized out>) at /usr/include/c++/4.9/bits/shared_ptr_base.h:524
#16 0x00000000004063c6 in _M_release (this=0x7fffe001e5e0) at /usr/include/c++/4.9/bits/shared_ptr_base.h:149
#17 ~__shared_count (this=0x7fffed4c9c38, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/shared_ptr_base.h:666
#18 ~__shared_ptr (this=0x7fffed4c9c30, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/shared_ptr_base.h:914
#19 ~shared_ptr (this=0x7fffed4c9c30, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/shared_ptr.h:93
#20 foo () at ../clone.cpp:16
#21 0x00007ffff66c0e30 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x00007ffff73966aa in start_thread (arg=0x7fffed4ca700) at pthread_create.c:333
#23 0x00007ffff5e24eed in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109

The code I am using is extremely simple:

auto robot = global::global_robot->clone();
hexapod_dart::HexapodDARTSimu simu = hexapod_dart::HexapodDARTSimu(ctrl, robot);
simu.run(5);

This thing runs in parallel. The code can be found (not the parallel one - I tried using simple pthreads and tbb and I am getting the same seg fault) here.

They are obviously sharing something in the Frame, but I cannot find exactly what. Also, the seg-faults only appear when deleting the objects.

Any ideas?

@mxgrey
Copy link
Member

mxgrey commented Dec 29, 2015

I have a strong feeling I know what would cause this since it's occurring in Frame during deletions.

There is a special frame called the WorldFrame or Frame::World which is static (i.e. singleton). The World frame is meant to be read-only which should make it safe to use as a singleton, however I think in the current implementation, it is tracking all its child frames (because the Frame class is designed to keep track of its child frames). If multiple child frames are being deleted simultaneously in different threads from the shared World frame, it's easy to imagine a segfault occurring.

I think the solution will be to simply alter the behavior of the Frame class so that the World frame does not keep track of its children. That should make the World frame strictly read-only and therefore safe for use as a singleton. I should be able to make a patch for this a bit later today, and then you can check to see if it fixes your problem.

@jslee02
Copy link
Member

jslee02 commented Dec 29, 2015

I think it would also be good to have some regression tests in DART for parallel simulations like this.

@mxgrey
Copy link
Member

mxgrey commented Dec 29, 2015

I agree in principal, but it's worth noting that writing unit tests for race conditions is unreliable, since race conditions get triggered based on the operating system's scheduling, which we don't have control over. We can (and probably should) still have unit tests for concurrency, but we need to recognize that if a test passes it could just be a false positive.

@jslee02
Copy link
Member

jslee02 commented Dec 29, 2015

I agree with you on the point, reproducing a race condition is probably not always possible. But it would be better than nothing. 😄 We might want to start with Konstantinos's case.

@costashatz
Copy link
Contributor Author

Thanks for the prompt answer!

When you have the patch, I'll gladly check if it solves the problem. I could also write a small test to check (not unit test - an evolutionary algorithm seems like a better fit) for parallel simulations.

@mxgrey
Copy link
Member

mxgrey commented Dec 29, 2015

I managed to create a unittest which reliably reproduced a race condition that I believe is analogous to the one you were experiencing. You can find it in pull request #577 on branch grey/fix_world_concurrency. Whenever you get a chance to test it out, please let me know if it resolved your issue.

@costashatz
Copy link
Contributor Author

It seems to be working.

I will let a big job running, so that we can be sure. I will let you know if it works in around 2 days.

Thanks!

@costashatz
Copy link
Contributor Author

Happy New Year to all of you!

It seems to work quite good!

@costashatz
Copy link
Contributor Author

Fixed in #577

@jslee02 jslee02 added this to the DART 6.0.0 milestone Jan 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants