Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault due to pydrake Simulator.reset_context #12811

Open
mfeng9 opened this issue Mar 4, 2020 · 21 comments
Open

Segfault due to pydrake Simulator.reset_context #12811

mfeng9 opened this issue Mar 4, 2020 · 21 comments
Assignees
Labels
component: pydrake Python API and its supporting Starlark macros priority: backlog type: bug

Comments

@mfeng9
Copy link

mfeng9 commented Mar 4, 2020

I am having trouble resetting the context of my simulator object to a different one (potentially from another time step)

Following the documentation of pydrake, it seems the right way is the following

# clone the context at a particular time step
cloned_context = simulator.get_mutable_context().Clone()
# reset to this context later
simulator.reset_context( cloned_context )

When I continue to run this, I will receive a segmentation fault error with no exception.

My platform:
latest pydrake binary on Mac (downloaded from drake.mit.edu, not local builds)
MacOS Mojave 10.14.6
My particular problem instance deals with the ManipulationStation class, with a set of custom manipulands and initial condition, all loaded via the AddManipulandFromFile function.

@jwnimmer-tri
Copy link
Collaborator

Apologies for the delay in responding. The on-call issue triage last week fell through.

It is definitely a bug when pydrake segfaults, even if the APIs were used incorrectly.

However, I am unable to reproduce your problem. I tried using https://github.com/RobotLocomotion/drake/releases/tag/v0.15.0 on Ubuntu 18.04 with the following cobbled-together code:

import numpy as np

from pydrake.common import FindResourceOrThrow
from pydrake.examples.manipulation_station import ManipulationStation
from pydrake.math import RigidTransform
from pydrake.multibody.plant import MultibodyPlant
from pydrake.multibody.parsing import Parser
from pydrake.systems.analysis import Simulator

station = ManipulationStation()
parser = Parser(station.get_mutable_multibody_plant(),
                station.get_mutable_scene_graph())
plant = station.get_mutable_multibody_plant()

# Add models for iiwa and wsg
iiwa_model_file = FindResourceOrThrow(
    "drake/manipulation/models/iiwa_description/iiwa7/"
    "iiwa7_no_collision.sdf")
iiwa = parser.AddModelFromFile(iiwa_model_file, "iiwa")
X_WI = RigidTransform.Identity()
plant.WeldFrames(plant.world_frame(),
                 plant.GetFrameByName("iiwa_link_0", iiwa),
                 X_WI)
wsg_model_file = FindResourceOrThrow(
    "drake/manipulation/models/wsg_50_description/sdf/"
    "schunk_wsg_50.sdf")
wsg = parser.AddModelFromFile(wsg_model_file, "gripper")
X_7G = RigidTransform.Identity()
plant.WeldFrames(
    plant.GetFrameByName("iiwa_link_7", iiwa),
    plant.GetFrameByName("body", wsg),
    X_7G)

# Register models for the controller.
station.RegisterIiwaControllerModel(
    iiwa_model_file, iiwa, plant.world_frame(),
    plant.GetFrameByName("iiwa_link_0", iiwa), X_WI)
station.RegisterWsgControllerModel(
    wsg_model_file, wsg,
    plant.GetFrameByName("iiwa_link_7", iiwa),
    plant.GetFrameByName("body", wsg), X_7G)

# Finalize
station.Finalize()
assert station.num_iiwa_joints() == 7

# Simulate.
zero = [0.0] * 7
simulator = Simulator(station)
context = simulator.get_mutable_context()
station.GetInputPort("iiwa_position").FixValue(context, zero)
station.GetInputPort("iiwa_feedforward_torque").FixValue(context, zero)
station.GetInputPort("wsg_position").FixValue(context, 0.);
station.GetInputPort("wsg_force_limit").FixValue(context, 40.);

simulator.AdvanceTo(0.1)
cloned_context = simulator.get_mutable_context().Clone()
simulator.AdvanceTo(0.2)
simulator.reset_context(cloned_context)
jwnimmer@cons:~/tmp$ PYTHONPATH=$HOME/Downloads/drake-20200212-bionic/drake/lib/python3.6/site-packages python3 test.py; echo $?
[2020-03-08 20:19:51.615] [console] [warning] Currently MultibodyPlant does not handle joint limits for continuous models. However some joints do specify limits. Consider setting a non-zero time step in the MultibodyPlant constructor; this will put MultibodyPlant in discrete-time mode, which does support joint limits.
[2020-03-08 20:19:51.615] [console] [warning] Joints that specify limits are: `left_finger_sliding_joint`, `right_finger_sliding_joint`.
[2020-03-08 20:19:51.618] [console] [warning] Currently MultibodyPlant does not handle joint limits for continuous models. However some joints do specify limits. Consider setting a non-zero time step in the MultibodyPlant constructor; this will put MultibodyPlant in discrete-time mode, which does support joint limits.
[2020-03-08 20:19:51.618] [console] [warning] Joints that specify limits are: `iiwa_joint_1`, `iiwa_joint_2`, `iiwa_joint_3`, `iiwa_joint_4`, `iiwa_joint_5`, `iiwa_joint_6`, `iiwa_joint_7`.
0

Are you able to post steps that I can run to reproduce the problem?

See https://www.chiark.greenend.org.uk/~sgtatham/bugs.html for good tips on reporting an actionable bug.

@mfeng9
Copy link
Author

mfeng9 commented Mar 11, 2020

Thanks for the timely reply! Sorry for the confusion.
The segmentation error will only appear when the arm is commanded to move. Simply calling simulator.AdvanceTo will not show the error.
I have created a repo for you to replicate the problem:
https://github.com/mfeng9/drake-manip-station.git

As seen in the iiwa_drake.py, you should see that AdvanceTo will run OK, but moving the arm will cause segmentation fault.

@mfeng9
Copy link
Author

mfeng9 commented Mar 11, 2020

The code is extended based on the examples from MIT 6.881 Intelligent Robotic Manipulation class. The core piece is to use IK to plan for motion and move the arm. I wrote a few wrappers to make the code a little cleaner, but that is about the only change I have added.

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Mar 11, 2020

Thanks!

The from deepPHA.drakesim.manip_station_sim.robot_plans import * line fails for me, because I don't have any deepPHA code on my path. If I comment it out, then I get NameError: name 'JointSpacePlanRelative' is not defined instead. Looks like I need a bit more code, still?

@mfeng9
Copy link
Author

mfeng9 commented Mar 11, 2020

Sorry for the error, I have pushed the fix. Or you can simply remove

deepPHA.drakesim.manip_station_sim.

before the robot_plans, because it is just the relative path from the root of the package.

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Mar 12, 2020

Ok, the example needs a newer scipy that what's on Ubuntu 18.04, but with some fiddling I was able to obtain an error message:

$ python3 -m venv env
$ source env/bin/activate
$ pip install scipy matplotlib umsgpack zmq ipython pyyaml
$ env PYTHONPATH=/home/jwnimmer/Downloads/drake-20200212-bionic/drake/lib/python3.6/site-packages python3 iiwa_drake.py
...
[2020-03-11 20:43:46.148] [console] [warning] Joints that specify limits are: `iiwa_joint_1`, `iiwa_joint_2`, `iiwa_joint_3`, `iiwa_joint_4`, `iiwa_joint_5`, `iiwa_joint_6`, `iiwa_joint_7`.
works fine before resetting the context
now reset the context
works fine for simple AdvanceTo with no commands
segmentation fault if arm is commanded to move again
Failure at systems/framework/cache.cc:12 in GetPathDescription(): condition 'owning_subcontext_!= nullptr' failed.

That's with the 0.15.0 release.

With 0.16.0 release, I do see a segfault:

$ env PYTHONPATH=/home/jwnimmer/Downloads/drake-20200311-bionic/drake/lib/python3.6/site-packages python3 iiwa_drake.py
...
[2020-03-11 20:49:09.926] [console] [warning] Joints that specify limits are: `iiwa_joint_1`, `iiwa_joint_2`, `iiwa_joint_3`, `iiwa_joint_4`, `iiwa_joint_5`, `iiwa_joint_6`, `iiwa_joint_7`.
works fine before resetting the context
now reset the context
works fine for simple AdvanceTo with no commands
segmentation fault if arm is commanded to move again
Segmentation fault (core dumped)

I'll see what I can learn about the problem.

@jwnimmer-tri
Copy link
Collaborator

Using https://drake.mit.edu/python_bindings.html#debugging-with-the-python-bindings instructions and adding in trace, we get:

...
iiwa_drake.py(1147):   print('segmentation fault if arm is commanded to move again')
segmentation fault if arm is commanded to move again
iiwa_drake.py(1148):   sim.incremental_move_in_world(direction='y')
 --- modulename: iiwa_drake, funcname: incremental_move_in_world
iiwa_drake.py(921):     if event_name is not None:
iiwa_drake.py(922):       self.append_event(event_name, has_data = sensor_dt is not None)
 --- modulename: iiwa_drake, funcname: append_event
iiwa_drake.py(528):     self.event_history['action'].append(name)
iiwa_drake.py(529):     self.event_history['time'].append(
iiwa_drake.py(530):       self.station_context.get_time()
iiwa_drake.py(532):     self.event_history['has_data'].append(has_data)
iiwa_drake.py(923):     T_goal = self.get_arm_pose()
 --- modulename: iiwa_drake, funcname: get_arm_pose
iiwa_drake.py(700):     return copy.deepcopy(self.get_relative_frame_transform(self.world_frame, self.l7_frame).matrix())
 --- modulename: iiwa_drake, funcname: get_relative_frame_transform
iiwa_drake.py(615):     return self.plant.CalcRelativeTransform(
iiwa_drake.py(616):             self.plant_context,
iiwa_drake.py(617):             frame_A=frame_A,
iiwa_drake.py(618):             frame_B=frame_B)
[segmentation fault]

That points us in a good direction, I think.

I suspect that part of the problem is that in init_simulator we dig out objects from the simulator's context and save references to them in member fields:

    simulator = Simulator(self.diagram)
    self.simulator = simulator
    context = self.diagram.GetMutableSubsystemContext(
      self.station, simulator.get_mutable_context())
    self.station_context = context
    self.plant_context = self.diagram.GetMutableSubsystemContext(
      self.plant, self.simulator.get_mutable_context())

However, later when we reset_context() to change which context the simulator is using, we do not update those references. So methods like append_event that access self.station_context.get_time() or get_relative_frame_transform that access self.plant.CalcRelativeTransform(self.plant_context, ...) are still referring to the non-reset values.

The import point is that reset_context() points the simulator to a new Context object -- like a python assignment operation.

Here's one suggestion for a fix. The method Context.SetTimeStateAndParametersFrom is able to copy data from one Context to another. So instead of reset_context(), possibly you could call simulator.get_mutable_context().SetTimeStateAndParametersFrom(sim.simulator_init_context) instead. In that, case references like self.plant_context remain valid.

Alternatively, you could make sure that all references to sub-contexts are always pointing into the simulator's context. So every call to GetMutableSubsystemContext that saved its value needs to re-done after any reset_context.

There is still a bug in pydrake somewhere, though -- accessing the original context's subcontext values in get_relative_frame_transform should result in seeing the original values, not a segfault. I will see if I can track that down.

@jwnimmer-tri
Copy link
Collaborator

Oh, of course. A call to reset_context will delete the prior Context in C++ immediately, without respecting python's garbage collector. So even though the prior Context still has live uses in python, the C++ object they refer to has been deleted.

We even have a scary unit test about this:

context_default = simulator.get_mutable_context()
# WARNING: Once we call `simulator.reset_context()`, it will delete the
# context it currently owns, which is `context_default` in this case.
# BE CAREFUL IN SITUATIONS LIKE THIS!
# TODO(eric.cousineau): Bind `release_context()`, or migrate context
# usage to use `shared_ptr`.
context = system.CreateDefaultContext()
simulator.reset_context(context)
self.assertIs(context, simulator.get_mutable_context())

Given that, I don't even know why we have a binding for reset_context in pydrake. It seems like a footgun that serves no purpose. Sorry!

@EricCousineau-TRI what do you think about removing that binding? It seems like methods accepting a unique_ptr should generally not be bound using ownership-transfer semantics, since it is super easy to violate the reference counting.

@jwnimmer-tri jwnimmer-tri changed the title Segmentation Fault When Resetting to a Cloned Simulator Context Segmentation fault due to pydrake Simulator.reset_context Mar 12, 2020
@jwnimmer-tri jwnimmer-tri changed the title Segmentation fault due to pydrake Simulator.reset_context Segfault due to pydrake Simulator.reset_context Mar 12, 2020
@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Mar 12, 2020

Oh, we added reset_context on purpose, recently in #12368? The motivation cited in slack was:

... what is the best way to reset the simulator state from python (without creating a new simulator object for each run)? ...

Sounds familiar!

At minimum, if we keep this method around, we need to amend its pydrake doc to explain why it is scary and explain how to use it safely. (Or really, change the binding to use shared_ptr, not unique_ptr, if we must have it.)

But if simulator.get_mutable_context().SetTimeAndStateAndParametersFrom(other_context) succeeds, then it seems like we should just advocate for that, instead, and remove this one from pydrake.

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Mar 16, 2020

WIP of supporting shared_ptr at https://github.com/jwnimmer-tri/drake/commits/simulator-context-shared_ptr, but shared pointers are apparently more finicky than I realized. I'll have to revisit a little bit later.

A good next step might be to add a unit test for the SetTimeAndStateAndParametersFrom to see if it succeeds.

@EricCousineau-TRI
Copy link
Contributor

Just to check, how are the shared_ptrs finicky? Does it come from the fact that they're greedy (i.e. you can convert unique_ptr -> shared_ptr, but not the other way around)?

@jwnimmer-tri
Copy link
Collaborator

If I run the pydrake unit tests in my branch, I see errors that I'm guessing are because I haven't articulated https://pybind11.readthedocs.io/en/stable/advanced/smart_ptrs.html#std-shared-ptr quite correctly, or possibly missing py::reference here or there on other APIs. I haven't dug in yet.

@EricCousineau-TRI
Copy link
Contributor

Ah, yeah, if those are runtime errors, then yes, it is most likely b/c you need to change the py cls declaration to DefineTemplateClassWithDefault<Context<T>, shared_ptr<Context<T>>>(...) here:
https://github.com/jwnimmer-tri/drake/blob/75d7e8423cd7e71c65e7742e375f5ad2608154f6/bindings/pydrake/systems/framework_py_semantics.cc#L145
... which will introduce issues later on for any method that takes unique_ptr<Context<T>>, due to the greedy API of shared_ptr.

@EricCousineau-TRI
Copy link
Contributor

Basically, if any class wants to be used via shared_ptr<> in the API, it should be used everywhere in the API where possible (e.g. like the MathematicalProgram bindings); unique_ptr<> could only be used in raw factory functions that only ever live in C++.

@jwnimmer-tri
Copy link
Collaborator

Yup, trying to get shared_ptr working for Context in pydrake is too difficult.

My new opinion is that we should keep the unique_ptr in the Simulator API for use by C++, but we should also overload on raw pointer and bind only that overload (not unique_ptr) for pydrake -- with appropriate keep_alive, of course. That matches how we pass the System to the simulator. In C++ where typical lifetime is on the stack or unique_ptr, we continue to allow for that. In Python where we have GC, we'll bind using raw pointers with GC annotations.

@EricCousineau-TRI
Copy link
Contributor

Do we want to take time to revisit whether shared_ptr can be used in the Drake API without too much runtime penalty? Otherwise, we will continue run into these sorts of issues.

(Granted, the overhaul to change the API is probably super high cost at the moment, and less then episodically doing the raw-pointer setup.)

@jwnimmer-tri
Copy link
Collaborator

jwnimmer-tri commented Mar 19, 2020

I don't see how runtime penalty plays any part in these trade-offs. The rationale for the current C++ APIs is not that.

I think the way to stomp out these issues globally is to disallow binding unique_ptr C++ parameters as pydrake functions. There's not many of them right now, and you're correct that every one of them is wrong.

@EricCousineau-TRI
Copy link
Contributor

There's not many of them right now, and you're correct that every one of them is wrong.

I don't think every one is wrong. It's only wrong when you're able to delete the object in C++ explicitly (e.g. via Simulator.reset_context or Value.set_value). Things like DiagramBuilder.AddSytem is currently correct, if we cannot delete the system.
If we do want to permit adding / removing things, perhaps we may want to consider shared_ptr / weak_ptr vs. unique_ptr/ T*, for simplicity with things like Python (where we wouldn't have to annotate anything).

The rationale for the current C++ APIs is not that.

Can I ask what the rationale was? Was it purely following GSG, or do we like the exclusive-ownership w/o having to do some sort of registration setup? (If it's the exclusive ownership, it feels like T* opens its own can of worms?)

@sherm1
Copy link
Member

sherm1 commented Mar 19, 2020

I do recall the runtime cost of shared_ptr as being one of the factors in our banning them from common use in Drake. But I think the free-for-all ownership was a bigger factor.

@EricCousineau-TRI
Copy link
Contributor

I am going to make a broader-scoped issue, since that is where this is headed.

@EricCousineau-TRI
Copy link
Contributor

Unassigning myself for now (it's easier to tell responsibility when there's only one owner)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component: pydrake Python API and its supporting Starlark macros priority: backlog type: bug
Projects
Status: No status
Development

No branches or pull requests

4 participants