Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Airsim crashes during training for reinforcement learning #123

Open
ghost opened this issue Nov 11, 2019 · 17 comments
Open

Airsim crashes during training for reinforcement learning #123

ghost opened this issue Nov 11, 2019 · 17 comments

Comments

@ghost
Copy link

ghost commented Nov 11, 2019

I am using reinforcement learning to train my model. But the AirSim engine crashes after a few hundreds of episodes. Before crashing its gives the following error multiple times:

error while optimizing with nlopt: This likely means the optimization aborted early.
error while optimizing with nlopt: This likely means the optimization aborted early.
error while optimizing with nlopt: This likely means the optimization aborted early.
error while optimizing with nlopt: This likely means the optimization aborted early.
error while optimizing with nlopt: This likely means the optimization aborted early.

and finally I have this error:

Signal 11 caught.
Malloc Size=65538 LargeMemoryPoolOffset=65554 
CommonUnixCrashHandler: Signal=11
Malloc Size=65535 LargeMemoryPoolOffset=131119 
Malloc Size=86336 LargeMemoryPoolOffset=217472 
terminating with uncaught exception of type std::__1::bad_weak_ptr: bad_weak_ptr
Signal 6 caught.
Failed to find symbol file, expected location:
"/home/kaveh/AirSim/AirSim_Training/AirSimExe/Binaries/Linux/AirSimExe.sym"
terminating with uncaught exception of type std::__1::bad_weak_ptr: bad_weak_ptr
Signal 6 caught.
Malloc Size=44187 LargeMemoryPoolOffset=261675 
Engine crash handling finished; re-raising signal 11 for the default handler. Good bye.
Segmentation fault (core dumped)

@madratman
Copy link
Contributor

Are you using moveOnSpline as your "action" - aka your policy is spitting out target waypoints and/or velocities which are being sent to moveOnSpline APIs?
When this error happens, can you try logging the waypoints being sent to moveOnSpline and the drone odometry (from getMultirotorState()) so we can reproduce it.
Or is it that you are using moveOnSpline for taking off?

@ghost
Copy link
Author

ghost commented Nov 13, 2019

No, I am using moveByVelocityAsync. I only need to move the drone a little at each step based on output of my policy function, so there is no need to use moveOnSpline. I am sending every 100 millisecond a move action to the drone. I also tried different time steps but always got the same error.

@madratman
Copy link
Contributor

madratman commented Nov 13, 2019

Hmm, so this might be caused by drone_2 then. As that error message is associated with moveOnSpline.
How are you resetting the episode?
Try using this reset function: #94 (comment)

@ghost
Copy link
Author

ghost commented Nov 13, 2019

I did it but there was no difference. The only way to prevent this error is to load the environment again, but that makes my training very slow and it is impossible to train a reinforcement learning model with that speed.

@madratman
Copy link
Contributor

@kavehkamali hmm, so to repro this - I'll call the dummy_reset function in a loop by modifying this script : https://github.com/microsoft/AirSim-NeurIPS2019-Drone-Racing/blob/master/tests/test_reset.py. If you have a better way to help us repro this, let me know.
Also, you're on linux and qualification binaries, I assume?
In general, @yannbouteiller did you ever face this post the reset fixes?

@madratman
Copy link
Contributor

Alright, I am able to repro it with this script https://gist.github.com/madratman/e617b53ec20c5f38a7d10633ba3a42c9#file-airsim_neurips_reset_episode_crash_repro-py. Looking into it

@madratman
Copy link
Contributor

Alright, you just need to sleep a bit after the call to simResetRace() and before the call to simStartRace. See updated gist here https://gist.github.com/madratman/e617b53ec20c5f38a7d10633ba3a42c9
When reset is called, the drones lie on top of each other at world origin, then when simResetRace() is called, the drone meshes are teleported to the center of the cages. Now, due to gravity the meshes fall down for a fraction of second before settling down at the bottom.
If you call simStartRace before they settle down, I think the spline fitter is perhaps something weird as current position (I need to look a bit more to see why exactly this happened), but sleeping for 0.5 seconds (could be less) b/w reset and simStartRace is not crashing the sim.
You can see the diff in the gist from the previous comment to this comment here https://gist.github.com/madratman/e617b53ec20c5f38a7d10633ba3a42c9/revisions

@madratman
Copy link
Contributor

madratman commented Nov 13, 2019

I updated the gist https://gist.github.com/madratman/e617b53ec20c5f38a7d10633ba3a42c9 one more time and increased the amount of sleep b/w simResetRace and simStartRace to 1.0. There's no sleep needed b/w reset and simResetRace, so I removed it - see diff here https://gist.github.com/madratman/e617b53ec20c5f38a7d10633ba3a42c9/revisions

With 0.5 s of sleep, I did see the crash happen after some time, but 1 second is proving to be stable for more than half an hour.
Made a little screencast https://www.youtube.com/watch?v=UuCm8Sp3P_U&

@ghost
Copy link

ghost commented Nov 14, 2019

Thank you for the update, I will try this.

@madratman
Copy link
Contributor

Is it working for you now?

@yannbouteiller
Copy link

yannbouteiller commented Nov 14, 2019

Oh is this where those crashes come from?

Sleeping for 1.0s is super costy, though.

(Edit: oh, we are apparently not talking about the same crashes)

@ghost
Copy link
Author

ghost commented Nov 14, 2019

I am running the simulation on 20 clock, so sleeping 1 second is too much for me.
For now, I removed the competitor drone completely from the API and it works fine when I turn off the graphics in tier 1. Later, I will fine tune for two drones.

@ghost
Copy link
Author

ghost commented Nov 14, 2019

@madratman Also, the ip of the competitor drone is hard coded in the API. I had to change the API to be able to set the ip for the competitor drone.

@madratman
Copy link
Contributor

madratman commented Nov 15, 2019

I've update the airsim linux v0.3 training binaries, the pythonclient to 1.1.1, and the gist
now, the drone are reset close to the floor, and reset time is reduced. You can probably go lower than the current 0.1*2 sleeps, but I haven't tried it.
I found that when reset is called when drone_2 hasn't finished taking off (.join() of the takeoff call to drone_2 hasn't returned) the simulator crashes.
So, I am now instead sleeping in the pythonclient 1.1.1.

There seems to be one rare edge case, which seems to happen when reset is called from another thread at the same time when drone_2 is finishing its takeoff and starting the fly_through_all_gates..() call. At that point, I think the sim is freezing.
I saw a sim freeze at reset being called at race time 3.997 seconds.

yep, yann, we were talking about different things. I just tagged you to check if you had seen this

@ghost
Copy link
Author

ghost commented Nov 15, 2019

Thanks!

@changpowei
Copy link

Hi @madratman
After transferring to use AirSim Drone Racing Lab, I still face the same problem as @kavehkamali ! When I'm training agent for RL, AirSim Unreal Engine will crash unpredictable.
On this time, I have already trained 5588 steps in 979 episodes. The Unreal Engine crashed when the quadrotor is taking action, not doing reset between two episodes. The detail is showing below.

image

@kavehkamali May I ask u how to remove the competitor drone in tier 1 environment? and how to turn off the graphic in tier 1?
Looking forward to your reply. Thanks.

@vespagroup
Copy link

@changpowei I had a similar issue where the AirSim Unreal engine kept crashing in the middle of an episode. For me I was making frequent updates to moveOnSplineAsync(), and it would randomly crash. I tried the fix suggested here: #94 (def dummy_reset). I placed the reset function at the end of each episode and now my code runs without crashing AirSim. I just ran it for about 1.5 hrs with no issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants