-
Notifications
You must be signed in to change notification settings - Fork 189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ParticleCache_test fails on Fedora 31 #2985
Comments
Note: the fix will have to be applied to both 4.0 and 4.1 (devel) branches. |
@jngrad will there a patch release, otherwise can you point be at the patch, so I can include it in the rpm? |
Unfortunately, there is no patch yet. Since this ticket was assigned to the 4.1 release, I left a reminder that any fix will have to be cherry-picked into the 4.0 branch. |
I can reproduce the error message in Fedora 31 (replacing |
As we did for #2507, we could patch boost::mpi in F31, if you have a fix. |
When running
I recompiled espresso in a new container of the same image but with
|
so UB |
F30 and F31 actually have the same Boost version. So it must be due to different compiler or MPI versions. |
On Fri, Jul 19, 2019 at 06:45:19AM -0700, Jean-Noël Grad wrote:
Note: the fix will have to be applied to both 4.0 and 4.1 (devel) branches.
I don't think, we'll do another 4.0 bugfix release.
Efforts need to be focussed on getting 4.1 ready.
|
@mkuron Fedora 31 uses OpenMPI 4.0.1 while Fedora 30 + Ubuntu <= eoan use <= 3.1.4 |
@KaiSzuttor We probably should de-milestone this, OpenMPI 4 is not fully backward-compatible with OpenMPI 3 source code. It is unclear to me how much work will be involved to support both v3 and v4 in our codebase, nor when our user base will start the transition to v4. |
How is that possible? MPI is a standard. We also support other implementations like MPICH (tested in the Intel container). If OpenMPI does not comply to the standard, we won't be able to support it, but I am pretty sure that it is compliant. |
This is how I interpreted the language in Open MPI: Version Number Methodology:
If the issue at hand is indeed a bug in our codebase we should address it, but if it comes from an API change in OpenMPI it will be harder to estimate the amount of efforts to invest into the v4 migration. |
This is a bit misleading as it is about ABI compatibility and not about API compatibility. This means you cannot combine libmpi.so and mpi.h from different major versions, only from different minor versions. API compatibility is guaranteed between all MPI implementations (MPICH, OpenMPI 3, OpenMPI 4, ...). |
Any news on this? |
Tracing the source of the failure got me to this line in the code, which only crashes on mpi rank 0: espresso/src/core/ParticleCache.hpp Lines 247 to 248 in 5322996
The boost::container::flat_set<Particle, detail::IdCompare> remote_parts object has correct iterators, and when using a temporary object as the output of boost::mpi::reduce to exclude any possibility of iterator invalidation, the issue persists (using 10 particles instead of 10000):
/* Reduce data to the master by merging the flat_sets from
* the nodes in a reduction tree. */
fprintf(stderr, "%d: before >> ", m_cb.comm().rank());
for(auto &p : remote_parts) {
fprintf(stderr, "%d ", p.identity());
}
fprintf(stderr, "<<\n");
if (m_cb.comm().rank() == 0) {
map_type remote_parts_tmp{};
boost::mpi::reduce(m_cb.comm(), remote_parts, remote_parts_tmp,
detail::Merge<map_type, detail::IdCompare>(), 0);
remote_parts = remote_parts_tmp;
} else {
boost::mpi::reduce(m_cb.comm(), remote_parts, remote_parts,
detail::Merge<map_type, detail::IdCompare>(), 0);
}
fprintf(stderr, "%d: after >> ", m_cb.comm().rank());
for(auto &p : remote_parts) {
fprintf(stderr, "%d ", p.identity());
}
fprintf(stderr, "<<\n");
}
In a non-failing environment (OpenMPI 3.1, Fedora 30), using the same boost version (1.69.0):
In Fedora 30, I was finally able to reproduce the same bug (occurs at the same line in ctest -V -R ParticleCache_test
It is not fully reproducible and sometimes requires 4 threads (
But in OpenMPI v4, the test doesn't show these warnings anymore, instead the test crashes and shows a list of new warnings. Only the |
The vader issue is not limited to Docker; #2271 saw it on a desktop computer. However, since @junghans‘s build logs do not contain any |
Bug not reproducible on Fedora 31 with MPICH. mpiexec -n 1 gdb src/core/unit_tests/ParticleCache_test : -n 1 src/core/unit_tests/ParticleCache_test
GNU gdb (GDB) Fedora 8.3.50.20190702-20.fc31
Copyright (C) 2019 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from src/core/unit_tests/ParticleCache_test...
(gdb) run
Starting program: /home/espresso/build/src/core/unit_tests/ParticleCache_test
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff6c79700 (LWP 2765)]
[New Thread 0x7ffff62aa700 (LWP 2766)]
[New Thread 0x7fffeffff700 (LWP 2767)]
Running 1 test case...
0: before >> 0 1 2 3 4 5 6 7 8 9 <<
Running 1 test case...
1: before >> 10 11 12 13 14 15 16 17 18 19 <<
1: after >> 10 11 12 13 14 15 16 17 18 19 <<
unknown location(0): fatal error: in "update": Throw location unknown (consider using BOOST_THROW_EXCEPTION)
Dynamic exception type: boost::wrapexcept<boost::mpi::exception>
std::exception::what: MPI_Recv: MPI_ERR_TAG: invalid tag
/local/es/src/core/unit_tests/ParticleCache_test.cpp(145): last checkpoint
*** No errors detected
*** 1 failure is detected in the test module "ParticleCache test"
[1564932286.293320] [ded9866126f7:2761 :0] mpool.c:37 UCX WARN object 0x7ffff467dee0 was not returned to mpool mm_recv_desc
[Thread 0x7fffeffff700 (LWP 2767) exited]
[Thread 0x7ffff62aa700 (LWP 2766) exited]
[Thread 0x7ffff6c79700 (LWP 2765) exited]
[Inferior 1 (process 2761) exited with code 0311]
Missing separate debuginfos, use: dnf debuginfo-install zlib-1.2.11-16.fc31.x86_64
(gdb) bt
No stack. |
I can't get a backtrace in gdb:
Did you try
catch throw
|
Thanks! We can finally work on something: (gdb) catch throw
Catchpoint 1 (throw)
(gdb) run
Starting program: /home/espresso/build/src/core/unit_tests/ParticleCache_test
Missing separate debuginfos, use: dnf debuginfo-install glibc-2.30-1.fc31.x86_64
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
[New Thread 0x7ffff6c79700 (LWP 2765)]
[New Thread 0x7ffff62aa700 (LWP 2766)]
[New Thread 0x7fffeffff700 (LWP 2767)]
Running 1 test case...
0: before >> 0 1 2 3 4 5 6 7 8 9 <<
Running 1 test case...
1: before >> 10 11 12 13 14 15 16 17 18 19 <<
1: after >> 10 11 12 13 14 15 16 17 18 19 <<
Thread 1 "ParticleCache_t" hit Catchpoint 1 (exception thrown), 0x00007ffff75e4a22 in __cxxabiv1::__cxa_throw (obj=0x61db90,
tinfo=0x474dd8 <typeinfo for boost::wrapexcept<boost::mpi::exception>>,
dest=0x45325c <boost::wrapexcept<boost::mpi::exception>::~wrapexcept()>)
at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:78
78 PROBE2 (throw, obj, tinfo);
Missing separate debuginfos, use: dnf debuginfo-install zlib-1.2.11-16.fc31.x86_64
(gdb) bt
#0 0x00007ffff75e4a22 in __cxxabiv1::__cxa_throw (obj=0x61db90,
tinfo=0x474dd8 <typeinfo for boost::wrapexcept<boost::mpi::exception>>,
dest=0x45325c <boost::wrapexcept<boost::mpi::exception>::~wrapexcept()>)
at ../../../../libstdc++-v3/libsupc++/eh_throw.cc:78
#1 0x0000000000450f50 in boost::throw_exception<boost::mpi::exception> (e=...)
at /usr/include/boost/throw_exception.hpp:70
#2 0x00007ffff78776e8 in boost::mpi::detail::packed_archive_recv (
comm=0x7ffff79a1b60 <ompi_mpi_comm_world>, source=<optimized out>,
tag=<optimized out>, ar=..., status=...)
at libs/mpi/src/point_to_point.cpp:93
#3 0x000000000045b76b in boost::mpi::detail::tree_reduce_impl<boost::container::flat_set<Particle, detail::IdCompare, boost::container::new_allocator<Particle> >, detail::Merge<boost::container::flat_set<Particle, detail::IdCompare, boost::container::new_allocator<Particle> >, detail::IdCompare> > (comm=...,
in_values=0x7fffffffb370, n=1, out_values=0x7fffffffb0d0, op=..., root=0)
at /usr/include/boost/mpi/collectives/reduce.hpp:134
#4 0x0000000000459338 in boost::mpi::detail::reduce_impl<boost::container::flat_set<Particle, detail::IdCompare, boost::container::new_allocator<Particle> >, detail::Merge<boost::container::flat_set<Particle, detail::IdCompare, boost::container::new_allocator<Particle> >, detail::IdCompare> > (comm=...,
in_values=0x7fffffffb370, n=1, out_values=0x7fffffffb0d0, op=..., root=0)
at /usr/include/boost/mpi/collectives/reduce.hpp:292
#5 0x000000000045724a in boost::mpi::reduce<boost::container::flat_set<Particle, detail::IdCompare, boost::container::new_allocator<Particle> >, detail::Merge<boost::container::flat_set<Particle, detail::IdCompare, boost::container::new_allocator<Particle> >, detail::IdCompare> > (comm=..., in_value=..., out_value=..., op=..., root=0) at /usr/include/boost/mpi/collectives/reduce.hpp:358
#6 0x000000000044a2d4 in ParticleCache<update::test_method()::<lambda()>, Utils::NoOp, const std::vector<Particle, std::allocator<Particle> >, Particle>::m_update(void) (this=0x7fffffffb330) at /local/es/src/core/ParticleCache.hpp:259
#7 0x000000000044aa1f in ParticleCache<update::test_method()::<lambda()>, Utils::NoOp, const std::vector<Particle, std::allocator<Particle> >, Particle>::update(void) (this=0x7fffffffb330) at /local/es/src/core/ParticleCache.hpp:408
#8 0x0000000000449c0a in ParticleCache<update::test_method()::<lambda()>, Utils::NoOp, const std::vector<Particle, std::allocator<Particle> >, Particle>::size(void) (this=0x7fffffffb330) at /local/es/src/core/ParticleCache.hpp:428
#9 0x0000000000449510 in update::test_method (this=0x7fffffffb5ef) at /local/es/src/core/unit_tests/ParticleCache_test.cpp:145
#10 0x0000000000449026 in update_invoker () at /local/es/src/core/unit_tests/ParticleCache_test.cpp:99
#11 0x0000000000458ecf in boost::detail::function::void_function_invoker0<void (*)(), void>::invoke (function_ptr=...) at /usr/include/boost/function/function_template.hpp:117
#12 0x00007ffff7766582 in boost::function0<void>::operator() (this=<optimized out>) at ./boost/function/function_template.hpp:677
#13 boost::detail::forward::operator() (this=<optimized out>) at ./boost/test/impl/execution_monitor.ipp:1312
#14 boost::detail::function::function_obj_invoker0<boost::detail::forward, int>::invoke (function_obj_ptr=...) at ./boost/function/function_template.hpp:137
#15 0x00007ffff77655ed in boost::function0<int>::operator() (this=0x7fffffffcaa0) at ./boost/function/function_template.hpp:677
#16 boost::detail::do_invoke<boost::shared_ptr<boost::detail::translator_holder_base>, boost::function<int ()> >(boost::shared_ptr<boost::detail::translator_holder_base> const&, boost::function<int ()> const&) (F=..., tr=...) at ./boost/test/impl/execution_monitor.ipp:286
#17 boost::execution_monitor::catch_signals(boost::function<int ()> const&) (this=0x7ffff77e1f80 <boost::unit_test::unit_test_monitor_t::instance()::the_inst>, F=...) at ./boost/test/impl/execution_monitor.ipp:875
#18 0x00007ffff7765678 in boost::execution_monitor::execute(boost::function<int ()> const&) (this=0x7ffff77e1f80 <boost::unit_test::unit_test_monitor_t::instance()::the_inst>, F=...) at ./boost/test/impl/execution_monitor.ipp:1214
#19 0x00007ffff776574e in boost::execution_monitor::vexecute(boost::function<void ()> const&) (this=this@entry=0x7ffff77e1f80 <boost::unit_test::unit_test_monitor_t::instance()::the_inst>, F=...) at /usr/include/c++/9/new:174
#20 0x00007ffff778f99f in boost::unit_test::unit_test_monitor_t::execute_and_translate(boost::function<void ()> const&, unsigned int) (this=0x7ffff77e1f80 <boost::unit_test::unit_test_monitor_t::instance()::the_inst>, func=..., timeout=timeout@entry=0) at ./boost/test/impl/unit_test_monitor.ipp:49
#21 0x00007ffff7775680 in boost::unit_test::framework::state::execute_test_tree (this=this@entry=0x7ffff77e1ae0 <boost::unit_test::framework::impl::(anonymous namespace)::s_frk_state()::the_inst>, tu_id=tu_id@entry=65536, timeout=0, p_random_generator=p_random_generator@entry=0x7fffffffcce0) at ./boost/test/utils/class_properties.hpp:58
#22 0x00007ffff7775914 in boost::unit_test::framework::state::execute_test_tree (this=0x7ffff77e1ae0 <boost::unit_test::framework::impl::(anonymous namespace)::s_frk_state()::the_inst>, tu_id=tu_id@entry=1, timeout=timeout@entry=0, p_random_generator=p_random_generator@entry=0x0) at ./boost/test/impl/framework.ipp:737
#23 0x00007ffff776cc74 in boost::unit_test::framework::run (id=1, id@entry=4294967295, continue_test=continue_test@entry=true) at ./boost/test/impl/framework.ipp:1631
#24 0x00007ffff778e822 in boost::unit_test::unit_test_main (init_func=<optimized out>, argc=<optimized out>, argv=<optimized out>) at ./boost/test/impl/unit_test_main.ipp:247
#25 0x00000000004498c7 in main (argc=1, argv=0x7fffffffd118) at /local/es/src/core/unit_tests/ParticleCache_test.cpp:250 |
It is our impression that this is caused by an incompatibility between Opnmpi 4 and boost-mpi. We're giving up for now. |
We need a minimum working example without Espresso so we can file a bug with OpenMPI or Boost.MPI as appropriate. Otherwise this bug is going to haunt us once Ubuntu 20.04 LTS ships with OpenMPI 4.0... |
@mkuron I was able to factor out the boost unit test logic, most of the Particle structure and most of the ParticleCache logic while preserving the backtrace in a MWE, but it still has the following dependencies:
I'm not familiar enough with our |
@mkuron finally got the |
That‘s still quite a lot of code... is there any chance you could reduce it further? I still have no idea where the issue might be coming from. |
Well the first overloaded function of |
So do we now know whether it‘s OpenMPI‘s or Boost.MPI‘s fault? If not, we should probably reach out to Boost.MPI first and open an issue with them. |
I still can't tell. Manually following the sample.cpp GDB trace (file followed by the corresponding code): /usr/include/boost/mpi/collectives/reduce.hpp:292
detail::tree_reduce_impl(comm, in_values, n, out_values, op, root, is_commutative<Op, T>());
/usr/include/boost/mpi/collectives/reduce.hpp:134
detail::packed_archive_recv(comm, child, tag, ia, status);
libs/mpi/src/point_to_point.cpp:93
BOOST_MPI_CHECK_RESULT(MPI_Recv, (ar.address(), count, MPI_PACKED, status.MPI_SOURCE, status.MPI_TAG, comm, &status));
/usr/include/boost/mpi/exception.hpp:100
boost::throw_exception(boost::mpi::exception(#MPIFunc, _check_result));
/usr/include/boost/throw_exception.hpp:70
throw boost::exception_detail::enable_both( e );
/usr/include/boost/exception/exception.hpp:517
return boost::exception_detail::wrapexcept<typename remove_error_info_injector<boost::mpi::exception>::type>( enable_error_info( x ) );
/usr/include/boost/exception/exception.hpp:486
boost::exception_detail::clone_impl<typename exception_detail::enable_error_info_return_type<remove_error_info_injector<boost::mpi::exception>::type>::type> ( x ) {}
/usr/include/boost/exception/exception.hpp:436
boost::exception_detail::copy_boost_exception(x)
/* after this point it looks really complex, probably a setup for std::exception */ It seemed the throw was triggered by the return value
Modifying
Both of these values are equal to Adding a printf in The authors of Boost.MPI will probably be able to interpret the GDB backtrace. |
Try reporting it at https://github.com/boostorg/mpi/issues. Tell them that we are unsure whether this is due to Boost.MPI incorrectly using MPI or OpenMPI 4.0 violating the MPI standard. |
New datapoint: the error is not reproducible on Ubuntu 18.04 with OpenMPI 4.0.1 and boost 1.69. I now think the issue comes from Fedora 31. For example the I can't tell for sure which of the @junghans if this can be of any help, here is my Dockerfile with the OpenMPI build procedure. I didn't look into compiling |
Where does the value of MPI_TAG_UB come from? It seems like it can be set from the outside too: https://github.com/open-mpi/ompi/blob/7962a8e40b132172488c8f3a38f531af44097b76/ompi/attribute/attribute_predefined.c#L132 |
I'm using code extracted from boost::mpi::environment::max_tag() |
Sure, but how does OpenMPI decide what value to use? |
@opoplawski, any idea what is special about rawhide's openmpi package. |
@jngrad Do you guys have a mini-reproducer I can give to Fedora's openmpi maintainer? |
As the problem is Fedora specific just open a bug report here: https://bugzilla.redhat.com/ |
We're currently unsure from which package the issue comes from. The GDB backtrace is incomplete, and when I tried to fill in the gaps by manually inspecting the boost::mpi header files, I ended up with a path that wasn't actually visited, because commenting out parts of that path had no effect on the GDB backtrace, excepted for showing inexistent filenames. Without a complete backtrace, it's difficult to find which part of boost::mpi is calling |
I discovered that the issue is somewhere in the ucx support of openmpi. If you rebuild openmpi without ucx, the problem goes away and the tag reports the correct value, as I mentioned on the Fedora bug report |
Not a bug in Open MPI. There are no guarantees in the MPI standard on what the tag ub is. Any code using MPI must use a tag below the tag ub. |
MPI tag issue reported on Red Hat Bugzilla under Bug 1746564. The root cause was an incorrect value for |
Yup. pml/ucx had a bug. Sorry for the noise. Max tag is something that should be covered by our tests but apparently not. |
Reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1728057
Build logs here: https://koji.fedoraproject.org/koji/taskinfo?taskID=36164517
This only seems to be an issue on 64-bit archs, i.e. x86_64, aarch64 and ppc64le, the other archs (i686, s390x, armv7hl) pass.
The text was updated successfully, but these errors were encountered: