Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

revert followup: Figure out why custom_test fails intermittently due to #11416 #11424

Closed
EricCousineau-TRI opened this issue May 9, 2019 · 23 comments · Fixed by #12095 or #12589
Closed

Comments

@EricCousineau-TRI
Copy link
Contributor

EricCousineau-TRI commented May 9, 2019

Relevant Slack convo: https://drakedevelopers.slack.com/archives/C270MN28G/p1557346698044300

I believe the repro recipe is using GCC, for #11416 @ cabd2ab:

bazel test -c dbg --runs_per_test=50 //bindings/pydrake/systems:py/custom_test

EDIT: Er, I guess, you have to run it on Xenial, most likely with GCC 5.4, per repro in #11421.

EDIT 2: From original Slack convo, failure was here:
https://drake-jenkins.csail.mit.edu/job/linux-xenial-gcc-bazel-experimental-debug/3638/consoleText

[2019-05-08T19:34:51.034Z] F.
[2019-05-08T19:34:51.034Z] ======================================================================
[2019-05-08T19:34:51.034Z] FAIL: test_leaf_system_overrides (custom_test.TestCustom)
[2019-05-08T19:34:51.034Z] ----------------------------------------------------------------------
[2019-05-08T19:34:51.034Z] Traceback (most recent call last):
[2019-05-08T19:34:51.034Z]   File "/media/ephemeral0/ubuntu/workspace/linux-xenial-gcc-bazel-experimental-debug/_bazel_ubuntu/74486129db018e3881280473f1d38471/sandbox/linux-sandbox/11992/execroot/drake/bazel-out/k8-dbg/bin/bindings/pydrake/systems/py/custom_test.runfiles/drake/bindings/pydrake/systems/test/custom_test.py", line 314, in test_leaf_system_overrides
[2019-05-08T19:34:51.034Z]     self.assertTrue(system.called_publish)
[2019-05-08T19:34:51.034Z] AssertionError: False is not true
[2019-05-08T19:34:51.034Z] 

\cc @RussTedrake @jwnimmer-tri

@EricCousineau-TRI
Copy link
Contributor Author

Huh, tried reproducing it in our Xenial Docker build recipe @ cabd2ab, but to no avail - at least per the above command...

@jwnimmer-tri Will see if I can pick your brain tomorrow.

@EricCousineau-TRI
Copy link
Contributor Author

Per f2f with Jamie, this test failed again on CI on master.

If it helps at all, instructions for valgrind:
https://gist.github.com/EricCousineau-TRI/ce79d3265bb72934267e24ddc8c623bc

I build it, use virtualenv to point to that, install deps, and add --python_path and --action_env=DRAKE_PYTHON_BIN_PATH in user.bazelrc.

@EricCousineau-TRI
Copy link
Contributor Author

Per f2f, @BetsyMcPhail can repro on machine with Xenial installed directly. Should could not repro for running custom_test multiple times (even 1000x), but she found that running test //... for the above recipe does cause the custom_test to fail...

@EricCousineau-TRI
Copy link
Contributor Author

Still very difficult to reproduce, but we should be able to check the results of running it under valgrind to snag the memory leak, or whatever it is.

@BetsyMcPhail
Copy link
Contributor

I ran valgrind per Eric's instructions above - currently sorting through the output to determine which, if any, reported errors are real.

@EricCousineau-TRI
Copy link
Contributor Author

Sweet! Can you post a Gist of the output?

@EricCousineau-TRI
Copy link
Contributor Author

Per f2f, Betsy used the Python+NumPy suppressions; may try one more time to run with Drake suppressions before posting output.

@BetsyMcPhail
Copy link
Contributor

valgrind output for custom_test: https://gist.github.com/BetsyMcPhail/c2e223f65b7fff11c44f04e0253e21cb

@EricCousineau-TRI
Copy link
Contributor Author

Hm... That's quite a few errors, and but they all look like just leaks.

Just check, can you post the relevant reproduction recipe, including this stuff?

  • The Drake source SHA, with mods included (e.g. if including other suppressions)
  • Any mods to the Python+valgrind setup, if necessary
  • The full command-line for your bazel run command-line

@BetsyMcPhail
Copy link
Contributor

  1. Checkout cabd2a
  • Update tools/dynamic_analysis/bazel.rc so we can run memcheck on Python tests:
-build:memcheck --test_lang_filters=-sh,-py
+build:memcheck --test_lang_filters=-sh
  • Update tools/dynamic_analysis/valgrind.sh to include any extra suppression files (e.g. python-debug/cpython/Misc/valgrind-python.supp and customtest_valgrind.supp)
  1. Follow Python+Valgrind setup, modifications:

    • Need a 'make' / 'make install' after the 'configure' in the first step
    • Skip GDB setup
  2. bazel test --config=memcheck //bindings/pydrake/systems:py/custom_test

@EricCousineau-TRI
Copy link
Contributor Author

Per f2f, trying to run all tests under DRD + Hellgrind.

@jwnimmer-tri
Copy link
Collaborator

Just as my 2c: this doesn't smell like a memory or race error to me. It smells more like a multiple-inheritance / ODR / python imp kind of error. If I were debugging this, my next step would be to add more instrumentation to the test to figure out where and why the assertTrue(called_publish) is ending up false, and if anything else is broken, etc.

@EricCousineau-TRI
Copy link
Contributor Author

EricCousineau-TRI commented Jun 27, 2019

FTR Most likely relates workaround in here: #11719
Slack convo: https://drakedevelopers.slack.com/archives/C270MN28G/p1561403942037200

Failure reproducibility is now almost 100%.

@BetsyMcPhail
Copy link
Contributor

BetsyMcPhail commented Jul 2, 2019

What we know so far:

The 'test_all_systems_leaf_overrides' test is failing because the 'called_publish' flag isn't set. The 'called_publish' flag should be set in the 'TrivialSystem::DoPublish' function that is is implemented as part of the test case.

To reproduce on Bionic:

  • Check out 12a8432
  • CC=clang CXX=clang++ bazel test -c dbg --config=python3 //bindings/pydrake/systems:py/custom_test Fail!
  • In custom_test.py, rename'test_leaf_system_overrides' to 'test_all_leaf_system_overrides'
  • CC=clang CXX=clang++ bazel test -c dbg --config=python3 //bindings/pydrake/systems:py/custom_test Passes!

Following the code through, it's possible to see that in the PYDRAKE_TRY_PROTECTED_OVERLOAD macro, the first call to PYBIND11_OVERLOAD_INT macro doesn't return as expected.

#define PYDRAKE_TRY_PROTECTED_OVERLOAD(RETURN, CLASS, NAME, ...)       \
  PYBIND11_OVERLOAD_INT(RETURN, CLASS, NAME, __VA_ARGS__);             \
  if (py::get_overload<CLASS>(this, "_" NAME)) {                       \
    WarnDeprecated(DeprecatedProtectedAliasMessage(NAME, "override")); \
    PYBIND11_OVERLOAD_INT(RETURN, CLASS, "_" NAME, __VA_ARGS__);       \
  }

Other observations:

@BetsyMcPhail
Copy link
Contributor

I found that for both cabd2a and 12a8432, if 'DoPublish' is removed from AddDeprecatedProtectedAliases, the test behaves as expected. Per comments in the source, this function is set to be deprecated soon.

@EricCousineau-TRI
Copy link
Contributor Author

Lowering priority due to deprecation, but would still be nice to root-cause given reproduciblity at a given SHA.

@EricCousineau-TRI
Copy link
Contributor Author

@BetsyMcPhail
Copy link
Contributor

In the code snippet linked above, Pybind11 caches functions that aren't overloaded in Python to avoid costly Python dictionary lookups. The cache is a map from (type.ptr, function name) to function. The cache is NOT cleared during execution of the tests.

Debug output from test_deprecated_protected_aliases. Note that 'DoPublish' is added to the 'inactive_overload_cache' for 'OldSystem'

****************************** get_type_overload ***************************************
 *** self =  <custom_test.TestCustom.test_deprecated_protected_aliases.<locals>.OldSystem object at 0x7fe586057f48>
 *** self.ptr() = 0x7fe586057f48
 *** type =  <class 'custom_test.TestCustom.test_deprecated_protected_aliases.<locals>.OldSystem'>
 *** type.ptr() = 0x2746c58
 *** LOOKING FOR 0x2746c58 DoPublish
+++++++++++++++ CONTENTS OF CACHE ++++++++++++++++++++++
0x2719c38 _DoHasDirectFeedthrough
0x2719c38 DoHasDirectFeedthrough
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
 *** overload is a c++ function, updating cache.
 
****************************** get_type_overload ***************************************
 *** self =  <custom_test.TestCustom.test_deprecated_protected_aliases.<locals>.OldSystem object at 0x7fe586057f48>
 *** self.ptr() = 0x7fe586057f48
 *** type =  <class 'custom_test.TestCustom.test_deprecated_protected_aliases.<locals>.OldSystem'>
 *** type.ptr() = 0x2746c58
 *** LOOKING FOR 0x2746c58 DoPublish
+++++++++++++++ CONTENTS OF CACHE ++++++++++++++++++++++
0x2746c58 DoPublish
0x2719c38 DoHasDirectFeedthrough
0x2719c38 _DoHasDirectFeedthrough
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
 *** found key in non-overload cache.

Debug output from test_all_leaf_system_overrides. Note we're checking for 'DoPublish' with the same key as above - even though it's a different class!!

 ****************************** get_type_overload ***************************************
 *** self =  <custom_test.TestCustom.test_leaf_system_overrides.<locals>.TrivialSystem object at 0x7fe5860ae048>
 *** self.ptr() = 0x7fe5860ae048
 *** type =  <class 'custom_test.TestCustom.test_leaf_system_overrides.<locals>.TrivialSystem'>
 *** type.ptr() = 0x2746c58
 *** LOOKING FOR 0x2746c58 DoPublish
+++++++++++++++ CONTENTS OF CACHE ++++++++++++++++++++++
0x2746c58 DoPublish
0x2719c38 DoHasDirectFeedthrough
0x2719c38 _DoHasDirectFeedthrough
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
 *** found key in non-overload cache.

@BetsyMcPhail
Copy link
Contributor

High-level overview and proposed solution

Pybind11 caches functions that aren't overloaded in Python to avoid costly Python dictionary lookups.

The non-overload cache is a set which uses (type.ptr, function name) as a key. Once an item is added to the cache, it is never removed.

The bug occurs when a type that has been added to the cache (e.g type = <class 'custom_test.TestCustom.test_deprecated_protected_aliases..OldSystem'>
type.ptr() = 0x2746c58) is destroyed and a new type is created with the exact same pointer (e.g. type = <class 'custom_test.TestCustom.test_leaf_system_overrides..TrivialSystem'>, type.ptr() = 0x2746c58) that should not be in the cache.

The first attempt at fixing the issue was to use self.ptr instead of type.ptr in the cache key. Testing demonstrated that this was not enough to solve the issue.

In the end, removing all associated cached items when an instance is deregistered seems to fix the issue.

@EricCousineau-TRI
Copy link
Contributor Author

EricCousineau-TRI commented Sep 4, 2019

Gotcha, nice! Another option would be to just extend the key (type.ptr, type_full_name, function name), where type_full_name is something like f"{cls.__module__}.{cls.__qualname__}".

@EricCousineau-TRI
Copy link
Contributor Author

FTR pybind PR (xref'd via GitHub, but doesn't show via ZenHub):
RobotLocomotion/pybind11#32

@EricCousineau-TRI
Copy link
Contributor Author

EricCousineau-TRI commented Sep 12, 2019

Ended up accidentally finding a secondary bug, so I've gone ahead and filed this upstream bug here: pybind/pybind11#1922

Will then file the secondary bug.

EDIT: Filed: pybind/pybind11#1923

@EricCousineau-TRI
Copy link
Contributor Author

EricCousineau-TRI commented Sep 26, 2019

Per #12105, seems like we have new a new issue with (potentially non-deterministic) segfaults, seemingly on Mac only?

Slack convo: https://drakedevelopers.slack.com/archives/C270MN28G/p1569507075015400?thread_ts=1569499007.013200&cid=C270MN28G

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment