-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
revert followup: Figure out why custom_test
fails intermittently due to #11416
#11424
Comments
Huh, tried reproducing it in our Xenial Docker build recipe @ cabd2ab, but to no avail - at least per the above command... @jwnimmer-tri Will see if I can pick your brain tomorrow. |
Per f2f with Jamie, this test failed again on CI on If it helps at all, instructions for I build it, use |
Per f2f, @BetsyMcPhail can repro on machine with Xenial installed directly. Should could not repro for running |
Still very difficult to reproduce, but we should be able to check the results of running it under |
I ran valgrind per Eric's instructions above - currently sorting through the output to determine which, if any, reported errors are real. |
Sweet! Can you post a Gist of the output? |
Per f2f, Betsy used the Python+NumPy suppressions; may try one more time to run with Drake suppressions before posting output. |
valgrind output for custom_test: https://gist.github.com/BetsyMcPhail/c2e223f65b7fff11c44f04e0253e21cb |
Hm... That's quite a few errors, and but they all look like just leaks. Just check, can you post the relevant reproduction recipe, including this stuff?
|
|
Per f2f, trying to run all tests under DRD + Hellgrind. |
Just as my 2c: this doesn't smell like a memory or race error to me. It smells more like a multiple-inheritance / ODR / python |
FTR Most likely relates workaround in here: #11719 Failure reproducibility is now almost 100%. |
What we know so far: The 'test_all_systems_leaf_overrides' test is failing because the 'called_publish' flag isn't set. The 'called_publish' flag should be set in the 'TrivialSystem::DoPublish' function that is is implemented as part of the test case. To reproduce on Bionic:
Following the code through, it's possible to see that in the PYDRAKE_TRY_PROTECTED_OVERLOAD macro, the first call to
Other observations:
|
I found that for both |
Lowering priority due to deprecation, but would still be nice to root-cause given reproduciblity at a given SHA. |
Per f2f with Betsy and Bill, may be due to caching of overload lookup: |
In the code snippet linked above, Pybind11 caches functions that aren't overloaded in Python to avoid costly Python dictionary lookups. The cache is a map from (type.ptr, function name) to function. The cache is NOT cleared during execution of the tests. Debug output from test_deprecated_protected_aliases. Note that 'DoPublish' is added to the 'inactive_overload_cache' for 'OldSystem'
Debug output from test_all_leaf_system_overrides. Note we're checking for 'DoPublish' with the same key as above - even though it's a different class!!
|
High-level overview and proposed solution Pybind11 caches functions that aren't overloaded in Python to avoid costly Python dictionary lookups. The non-overload cache is a set which uses (type.ptr, function name) as a key. Once an item is added to the cache, it is never removed. The bug occurs when a type that has been added to the cache (e.g type = <class 'custom_test.TestCustom.test_deprecated_protected_aliases..OldSystem'> The first attempt at fixing the issue was to use self.ptr instead of type.ptr in the cache key. Testing demonstrated that this was not enough to solve the issue. In the end, removing all associated cached items when an instance is deregistered seems to fix the issue. |
Gotcha, nice! Another option would be to just extend the key |
FTR pybind PR (xref'd via GitHub, but doesn't show via ZenHub): |
Ended up accidentally finding a secondary bug, so I've gone ahead and filed this upstream bug here: pybind/pybind11#1922 Will then file the secondary bug. EDIT: Filed: pybind/pybind11#1923 |
Per #12105, seems like we have new a new issue with (potentially non-deterministic) segfaults, seemingly on Mac only? Slack convo: https://drakedevelopers.slack.com/archives/C270MN28G/p1569507075015400?thread_ts=1569499007.013200&cid=C270MN28G |
Relevant Slack convo: https://drakedevelopers.slack.com/archives/C270MN28G/p1557346698044300
I believe the repro recipe is using GCC, for #11416 @ cabd2ab:
EDIT: Er, I guess, you have to run it on Xenial, most likely with GCC 5.4, per repro in #11421.
EDIT 2: From original Slack convo, failure was here:
https://drake-jenkins.csail.mit.edu/job/linux-xenial-gcc-bazel-experimental-debug/3638/consoleText
\cc @RussTedrake @jwnimmer-tri
The text was updated successfully, but these errors were encountered: