revert followup: Figure out why `custom_test` fails intermittently due to #11416 #11424

EricCousineau-TRI · 2019-05-09T00:37:57Z

Relevant Slack convo: https://drakedevelopers.slack.com/archives/C270MN28G/p1557346698044300

I believe the repro recipe is using GCC, for #11416 @ cabd2ab:

bazel test -c dbg --runs_per_test=50 //bindings/pydrake/systems:py/custom_test

EDIT: Er, I guess, you have to run it on Xenial, most likely with GCC 5.4, per repro in #11421.

EDIT 2: From original Slack convo, failure was here:
https://drake-jenkins.csail.mit.edu/job/linux-xenial-gcc-bazel-experimental-debug/3638/consoleText

[2019-05-08T19:34:51.034Z] F.
[2019-05-08T19:34:51.034Z] ======================================================================
[2019-05-08T19:34:51.034Z] FAIL: test_leaf_system_overrides (custom_test.TestCustom)
[2019-05-08T19:34:51.034Z] ----------------------------------------------------------------------
[2019-05-08T19:34:51.034Z] Traceback (most recent call last):
[2019-05-08T19:34:51.034Z]   File "/media/ephemeral0/ubuntu/workspace/linux-xenial-gcc-bazel-experimental-debug/_bazel_ubuntu/74486129db018e3881280473f1d38471/sandbox/linux-sandbox/11992/execroot/drake/bazel-out/k8-dbg/bin/bindings/pydrake/systems/py/custom_test.runfiles/drake/bindings/pydrake/systems/test/custom_test.py", line 314, in test_leaf_system_overrides
[2019-05-08T19:34:51.034Z]     self.assertTrue(system.called_publish)
[2019-05-08T19:34:51.034Z] AssertionError: False is not true
[2019-05-08T19:34:51.034Z]

\cc @RussTedrake @jwnimmer-tri

The text was updated successfully, but these errors were encountered:

EricCousineau-TRI · 2019-05-09T01:03:03Z

Huh, tried reproducing it in our Xenial Docker build recipe @ cabd2ab, but to no avail - at least per the above command...

@jwnimmer-tri Will see if I can pick your brain tomorrow.

EricCousineau-TRI · 2019-05-16T15:17:38Z

Per f2f with Jamie, this test failed again on CI on master.

If it helps at all, instructions for valgrind:
https://gist.github.com/EricCousineau-TRI/ce79d3265bb72934267e24ddc8c623bc

I build it, use virtualenv to point to that, install deps, and add --python_path and --action_env=DRAKE_PYTHON_BIN_PATH in user.bazelrc.

EricCousineau-TRI · 2019-05-23T14:56:05Z

Per f2f, @BetsyMcPhail can repro on machine with Xenial installed directly. Should could not repro for running custom_test multiple times (even 1000x), but she found that running test //... for the above recipe does cause the custom_test to fail...

EricCousineau-TRI · 2019-06-06T14:51:17Z

Still very difficult to reproduce, but we should be able to check the results of running it under valgrind to snag the memory leak, or whatever it is.

BetsyMcPhail · 2019-06-13T14:10:02Z

I ran valgrind per Eric's instructions above - currently sorting through the output to determine which, if any, reported errors are real.

EricCousineau-TRI · 2019-06-13T14:10:38Z

Sweet! Can you post a Gist of the output?

EricCousineau-TRI · 2019-06-13T14:39:36Z

Per f2f, Betsy used the Python+NumPy suppressions; may try one more time to run with Drake suppressions before posting output.

BetsyMcPhail · 2019-06-14T16:20:38Z

valgrind output for custom_test: https://gist.github.com/BetsyMcPhail/c2e223f65b7fff11c44f04e0253e21cb

EricCousineau-TRI · 2019-06-14T16:33:52Z

Hm... That's quite a few errors, and but they all look like just leaks.

Just check, can you post the relevant reproduction recipe, including this stuff?

The Drake source SHA, with mods included (e.g. if including other suppressions)
Any mods to the Python+valgrind setup, if necessary
The full command-line for your bazel run command-line

BetsyMcPhail · 2019-06-20T14:01:27Z

Checkout cabd2a

Update tools/dynamic_analysis/bazel.rc so we can run memcheck on Python tests:

-build:memcheck --test_lang_filters=-sh,-py
+build:memcheck --test_lang_filters=-sh

Update tools/dynamic_analysis/valgrind.sh to include any extra suppression files (e.g. python-debug/cpython/Misc/valgrind-python.supp and customtest_valgrind.supp)

Follow Python+Valgrind setup, modifications:
- Need a 'make' / 'make install' after the 'configure' in the first step
- Skip GDB setup
bazel test --config=memcheck //bindings/pydrake/systems:py/custom_test

EricCousineau-TRI · 2019-06-20T14:56:53Z

Per f2f, trying to run all tests under DRD + Hellgrind.

jwnimmer-tri · 2019-06-27T14:10:47Z

Just as my 2c: this doesn't smell like a memory or race error to me. It smells more like a multiple-inheritance / ODR / python imp kind of error. If I were debugging this, my next step would be to add more instrumentation to the test to figure out where and why the assertTrue(called_publish) is ending up false, and if anything else is broken, etc.

EricCousineau-TRI · 2019-06-27T14:21:49Z

FTR Most likely relates workaround in here: #11719
Slack convo: https://drakedevelopers.slack.com/archives/C270MN28G/p1561403942037200

Failure reproducibility is now almost 100%.

BetsyMcPhail · 2019-07-02T19:16:03Z

What we know so far:

The 'test_all_systems_leaf_overrides' test is failing because the 'called_publish' flag isn't set. The 'called_publish' flag should be set in the 'TrivialSystem::DoPublish' function that is is implemented as part of the test case.

To reproduce on Bionic:

Check out 12a8432
CC=clang CXX=clang++ bazel test -c dbg --config=python3 //bindings/pydrake/systems:py/custom_test Fail!
In custom_test.py, rename'test_leaf_system_overrides' to 'test_all_leaf_system_overrides'
CC=clang CXX=clang++ bazel test -c dbg --config=python3 //bindings/pydrake/systems:py/custom_test Passes!

Following the code through, it's possible to see that in the PYDRAKE_TRY_PROTECTED_OVERLOAD macro, the first call to PYBIND11_OVERLOAD_INT macro doesn't return as expected.

#define PYDRAKE_TRY_PROTECTED_OVERLOAD(RETURN, CLASS, NAME, ...)       \
  PYBIND11_OVERLOAD_INT(RETURN, CLASS, NAME, __VA_ARGS__);             \
  if (py::get_overload<CLASS>(this, "_" NAME)) {                       \
    WarnDeprecated(DeprecatedProtectedAliasMessage(NAME, "override")); \
    PYBIND11_OVERLOAD_INT(RETURN, CLASS, "_" NAME, __VA_ARGS__);       \
  }

Other observations:

If TrivialSystems::DoPublish is renamed to TrivialSystems::_DoPublish (note leading underscore), the test passes.
~~If PYDRAKE_TRY_PROTECTED_OVERLOAD is replaced with PYBIND11_OVERLOAD_INT in PyLeafSystemBase::DoPublish, the test passes~~
Possibly relates to Deprecate and replace DoPublish() and other event dispatchers #10445?
Can't reproduce on current master - was "fixed" after Remove deprecated methods (2019-07) #11757 merged

BetsyMcPhail · 2019-07-18T14:07:36Z

I found that for both cabd2a and 12a8432, if 'DoPublish' is removed from AddDeprecatedProtectedAliases, the test behaves as expected. Per comments in the source, this function is set to be deprecated soon.

EricCousineau-TRI · 2019-08-01T14:52:38Z

Lowering priority due to deprecation, but would still be nice to root-cause given reproduciblity at a given SHA.

EricCousineau-TRI · 2019-08-08T14:38:07Z

Per f2f with Betsy and Bill, may be due to caching of overload lookup:
https://github.com/RobotLocomotion/pybind11/blob/4b8e231e7c209e5483b7ca2407a8212a30507277/include/pybind11/pybind11.h#L2399-L2454

BetsyMcPhail · 2019-08-13T15:01:02Z

In the code snippet linked above, Pybind11 caches functions that aren't overloaded in Python to avoid costly Python dictionary lookups. The cache is a map from (type.ptr, function name) to function. The cache is NOT cleared during execution of the tests.

Debug output from test_deprecated_protected_aliases. Note that 'DoPublish' is added to the 'inactive_overload_cache' for 'OldSystem'

****************************** get_type_overload ***************************************
 *** self =  <custom_test.TestCustom.test_deprecated_protected_aliases.<locals>.OldSystem object at 0x7fe586057f48>
 *** self.ptr() = 0x7fe586057f48
 *** type =  <class 'custom_test.TestCustom.test_deprecated_protected_aliases.<locals>.OldSystem'>
 *** type.ptr() = 0x2746c58
 *** LOOKING FOR 0x2746c58 DoPublish
+++++++++++++++ CONTENTS OF CACHE ++++++++++++++++++++++
0x2719c38 _DoHasDirectFeedthrough
0x2719c38 DoHasDirectFeedthrough
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
 *** overload is a c++ function, updating cache.
 
****************************** get_type_overload ***************************************
 *** self =  <custom_test.TestCustom.test_deprecated_protected_aliases.<locals>.OldSystem object at 0x7fe586057f48>
 *** self.ptr() = 0x7fe586057f48
 *** type =  <class 'custom_test.TestCustom.test_deprecated_protected_aliases.<locals>.OldSystem'>
 *** type.ptr() = 0x2746c58
 *** LOOKING FOR 0x2746c58 DoPublish
+++++++++++++++ CONTENTS OF CACHE ++++++++++++++++++++++
0x2746c58 DoPublish
0x2719c38 DoHasDirectFeedthrough
0x2719c38 _DoHasDirectFeedthrough
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
 *** found key in non-overload cache.

Debug output from test_all_leaf_system_overrides. Note we're checking for 'DoPublish' with the same key as above - even though it's a different class!!

 ****************************** get_type_overload ***************************************
 *** self =  <custom_test.TestCustom.test_leaf_system_overrides.<locals>.TrivialSystem object at 0x7fe5860ae048>
 *** self.ptr() = 0x7fe5860ae048
 *** type =  <class 'custom_test.TestCustom.test_leaf_system_overrides.<locals>.TrivialSystem'>
 *** type.ptr() = 0x2746c58
 *** LOOKING FOR 0x2746c58 DoPublish
+++++++++++++++ CONTENTS OF CACHE ++++++++++++++++++++++
0x2746c58 DoPublish
0x2719c38 DoHasDirectFeedthrough
0x2719c38 _DoHasDirectFeedthrough
+++++++++++++++++++++++++++++++++++++++++++++++++++++++
 *** found key in non-overload cache.

BetsyMcPhail · 2019-09-04T16:58:20Z

High-level overview and proposed solution

Pybind11 caches functions that aren't overloaded in Python to avoid costly Python dictionary lookups.

The non-overload cache is a set which uses (type.ptr, function name) as a key. Once an item is added to the cache, it is never removed.

The bug occurs when a type that has been added to the cache (e.g type = <class 'custom_test.TestCustom.test_deprecated_protected_aliases..OldSystem'>
type.ptr() = 0x2746c58) is destroyed and a new type is created with the exact same pointer (e.g. type = <class 'custom_test.TestCustom.test_leaf_system_overrides..TrivialSystem'>, type.ptr() = 0x2746c58) that should not be in the cache.

The first attempt at fixing the issue was to use self.ptr instead of type.ptr in the cache key. Testing demonstrated that this was not enough to solve the issue.

In the end, removing all associated cached items when an instance is deregistered seems to fix the issue.

EricCousineau-TRI · 2019-09-04T17:01:44Z

Gotcha, nice! Another option would be to just extend the key (type.ptr, type_full_name, function name), where type_full_name is something like f"{cls.__module__}.{cls.__qualname__}".

EricCousineau-TRI · 2019-09-12T13:46:36Z

FTR pybind PR (xref'd via GitHub, but doesn't show via ZenHub):
RobotLocomotion/pybind11#32

EricCousineau-TRI · 2019-09-12T21:25:59Z

Ended up accidentally finding a secondary bug, so I've gone ahead and filed this upstream bug here: pybind/pybind11#1922

Will then file the secondary bug.

EDIT: Filed: pybind/pybind11#1923

EricCousineau-TRI · 2019-09-26T14:22:56Z

Per #12105, seems like we have new a new issue with (potentially non-deterministic) segfaults, seemingly on Mac only?

Slack convo: https://drakedevelopers.slack.com/archives/C270MN28G/p1569507075015400?thread_ts=1569499007.013200&cid=C270MN28G

EricCousineau-TRI added priority: high configuration: python labels May 9, 2019

EricCousineau-TRI self-assigned this May 9, 2019

jwnimmer-tri mentioned this issue May 9, 2019

Re-apply "add missing bindings for State<T> abstract values" #11427

Merged

EricCousineau-TRI assigned jamiesnape May 9, 2019

sherm1 added the unused team: kitware label May 11, 2019

EricCousineau-TRI assigned BetsyMcPhail and unassigned jamiesnape and EricCousineau-TRI May 16, 2019

This was referenced Jun 24, 2019

primitives: Remove templates from RandomSource APIs #11670

Merged

Do not exclude all sh tests when running under Valgrind Memcheck #11716

Merged

EricCousineau-TRI added priority: medium and removed priority: high labels Aug 1, 2019

BetsyMcPhail mentioned this issue Sep 4, 2019

Remove deregistered objects from the inactive overload cache RobotLocomotion/pybind11#32

Merged

EricCousineau-TRI mentioned this issue Sep 12, 2019

get_type_overload may encounter false cache hits if derived instances are GC'd pybind/pybind11#1922

Closed

EricCousineau-TRI mentioned this issue Sep 25, 2019

pybind11: Use workaround for false cache hits for overloads #12095

Merged

2 tasks

EricCousineau-TRI closed this as completed in #12095 Sep 25, 2019

EricCousineau-TRI reopened this Sep 26, 2019

EricCousineau-TRI mentioned this issue Sep 26, 2019

Revert "pybind11: Use workaround for false cache hits for overloads" #12105

Merged

This was referenced Jan 15, 2020

Inactive overload cache key RobotLocomotion/pybind11#37

Merged

pybind11: Prevent false cache hits for overloads #12589

Merged

ggould-tri closed this as completed in #12589 Jan 23, 2020

BetsyMcPhail mentioned this issue Feb 19, 2021

geometry_test:test_unimplemented_rendering fails sporadically on Mac #14686

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

revert followup: Figure out why `custom_test` fails intermittently due to #11416 #11424

revert followup: Figure out why `custom_test` fails intermittently due to #11416 #11424

EricCousineau-TRI commented May 9, 2019 •

edited

Loading

EricCousineau-TRI commented May 9, 2019

EricCousineau-TRI commented May 16, 2019

EricCousineau-TRI commented May 23, 2019

EricCousineau-TRI commented Jun 6, 2019

BetsyMcPhail commented Jun 13, 2019

EricCousineau-TRI commented Jun 13, 2019

EricCousineau-TRI commented Jun 13, 2019

BetsyMcPhail commented Jun 14, 2019

EricCousineau-TRI commented Jun 14, 2019

BetsyMcPhail commented Jun 20, 2019

EricCousineau-TRI commented Jun 20, 2019

jwnimmer-tri commented Jun 27, 2019

EricCousineau-TRI commented Jun 27, 2019 •

edited

Loading

BetsyMcPhail commented Jul 2, 2019 •

edited

Loading

BetsyMcPhail commented Jul 18, 2019

EricCousineau-TRI commented Aug 1, 2019

EricCousineau-TRI commented Aug 8, 2019

BetsyMcPhail commented Aug 13, 2019

BetsyMcPhail commented Sep 4, 2019

EricCousineau-TRI commented Sep 4, 2019 •

edited

Loading

EricCousineau-TRI commented Sep 12, 2019

EricCousineau-TRI commented Sep 12, 2019 •

edited

Loading

EricCousineau-TRI commented Sep 26, 2019 •

edited

Loading

revert followup: Figure out why custom_test fails intermittently due to #11416 #11424

revert followup: Figure out why custom_test fails intermittently due to #11416 #11424

Comments

EricCousineau-TRI commented May 9, 2019 • edited Loading

EricCousineau-TRI commented May 9, 2019

EricCousineau-TRI commented May 16, 2019

EricCousineau-TRI commented May 23, 2019

EricCousineau-TRI commented Jun 6, 2019

BetsyMcPhail commented Jun 13, 2019

EricCousineau-TRI commented Jun 13, 2019

EricCousineau-TRI commented Jun 13, 2019

BetsyMcPhail commented Jun 14, 2019

EricCousineau-TRI commented Jun 14, 2019

BetsyMcPhail commented Jun 20, 2019

EricCousineau-TRI commented Jun 20, 2019

jwnimmer-tri commented Jun 27, 2019

EricCousineau-TRI commented Jun 27, 2019 • edited Loading

BetsyMcPhail commented Jul 2, 2019 • edited Loading

BetsyMcPhail commented Jul 18, 2019

EricCousineau-TRI commented Aug 1, 2019

EricCousineau-TRI commented Aug 8, 2019

BetsyMcPhail commented Aug 13, 2019

BetsyMcPhail commented Sep 4, 2019

EricCousineau-TRI commented Sep 4, 2019 • edited Loading

EricCousineau-TRI commented Sep 12, 2019

EricCousineau-TRI commented Sep 12, 2019 • edited Loading

EricCousineau-TRI commented Sep 26, 2019 • edited Loading

revert followup: Figure out why `custom_test` fails intermittently due to #11416 #11424

revert followup: Figure out why `custom_test` fails intermittently due to #11416 #11424

EricCousineau-TRI commented May 9, 2019 •

edited

Loading

EricCousineau-TRI commented Jun 27, 2019 •

edited

Loading

BetsyMcPhail commented Jul 2, 2019 •

edited

Loading

EricCousineau-TRI commented Sep 4, 2019 •

edited

Loading

EricCousineau-TRI commented Sep 12, 2019 •

edited

Loading

EricCousineau-TRI commented Sep 26, 2019 •

edited

Loading