-
Notifications
You must be signed in to change notification settings - Fork 93
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AirSim randomly freezes #104
Comments
Hmm, can you help us reproduce this? Can you give us a minimal function which replicates the calls you have? |
Also, I modded your reset function slightly. def dummy_reset(baseline_racer, drone_names, tier):
baseline_racer.airsim_client.simPause(True)
baseline_racer.airsim_client.reset()
for drone_name in drone_names:
baseline_racer.airsim_client.enableApiControl(vehicle_name=drone_name)
baseline_racer.airsim_client.arm(vehicle_name=drone_name)
baseline_racer.airsim_client.simPause(False) # unpause sim to simresetrace works as it's supposed to
baseline_racer.airsim_client.simResetRace()
baseline_racer.airsim_client.simStartRace(tier=tier) |
No, I have also made this modification in my original code. I'll try to reproduce this in a minimal example. |
Well, this is super difficult to reproduce, I ran a testing algorithm that performed random steps all night long without any issue, it only seems to happen when I am using my training framework, after a somewhat constant number of time steps. I guess I'd better build a workaround for this. |
What happens is that the process coresponding to the
command keeps eating 108% CPU on my machine, but the window and API become unresponsive and the GPU seems to stop being used. After a while, the pop up appears saying
|
Hmm, this sounds like a lot of fun. Does the CPU usage increase over time, or is it always constant? I think we'll have to log the linux process stats - cpu, ram, gpu usage et cetera. In general, https://github.com/Syllo/nvtop is useful. Let me see how we can log all this info. haven't done this before. I'll keep editing this comment as I find something useful/ Edits:
|
Well, I have some more info but not very useful I'm afraid. Actually the freeze doesn't happen on reset(), it happens before that: the window freezes but the simulator is responsive, until we call reset() or SimResetRace(), which hangs forever. I'm using glances to monitor CPU/memory/Disk usage. The GPU (780m) I use on my laptop to crash-test my algorithms is too old for nvidia-smi, so nvtop doesn't work, I use CPU-X instead for monitoring GPU activity. When the freeze happens, the GPU usage drops to 0, and the CPU usage of the simulator drops from ~ 300% to ~ 150 % when reset() or simResetTrack() are called after this (then the program hangs). Unreal logs and an example output of psrecord (simulator's PID) are attached. The output of psrecord has weird scale for some reason but the shape of the plot corresponds to what I see in glances: CPU usage drops when hanging on reset(). |
So, as a workaround for this issue, when this happens I kill and restart the simulator, which apparently works well, I even use reset() again with no issue, until I make a call to the control API. Then I get the following error (f is the future object returned by the control API):
|
Ah, that seems like the previous Python process is zombieing out in the background. Hence, the tornado error saying ioloop is already running. If you can grab it with a script and kill it along with the unreal binary, that should do it. |
Hi @madratman , the problem is, I don't kill the python process because it is responsible for holding the environment, so it keeps running. How can I reinitialize the API ? Calling directly the stop() method of airsim_client.client._loop doesn't work. When I do this before restarting the simulator and creating a new MultirotorClient, reset() will work but subsenquent call to the control API will hold forever instead of throwing the exception. |
This might feel icky, but I'd try generating a new multirotor client object. I am not immediately sure if this solves your problem - but you can test it by opening a new python process when the freeze happens, instantiate a multirotorclient object, and see if confirmconnection returns true and you're able to control the drone. I am looking at the tornado's API as well - |
I already have a function that instantiates a new MultirotorClient. The problem is that tornado doesn't really allow me to use it several times within the same process, because it keeps a global instance of its IOLoop. |
hmm doing some quick digging for the global ioloop you just mentioned: the following snippet sounds interesting from the tornado's ioloop constructor doc https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop
https://www.tornadoweb.org/en/stable/ioloop.html#tornado.ioloop.IOLoop msgpackrpc/loop.py has the wrapper class over tornado's api. which is calling make_current()
cc @sytelus any insights? |
(btw when I create a new MultirotorClient(), confirmConnection() returns "Connected", but I am not able to control the drone because I get the "RuntimeError: IOLoop is already running" exception when I call join() ) |
Part of an answer for the workaround seems to be setting the timeout_value of MultirotorClient() to a small number (e.g. 10) and to catch the TimeoutError exception on calls to reset() to decide whether to restart the simulator. Doing this almost works. After the simulator has been killed and reinstantiated, with a new MultirotorClient, reset() works, arm() returns True (enableApiControl always returns None so I don't know about it), BUT when I call the control API (moveByRollPitchYawrateZAsync().join() in this case), join() hangs until timeout (10s instead 0.1s which would be the value of duration) and no control is applied. Everything else seems to work, I get the kinematics etc. |
I have updated the description of the freeze. I noticed something weird: when the freeze happens in the simulator window, if I kill the python program that launched the simulator and that runs the API, then the window unfreezes. |
Okay, I have finally succeeded at writing a minimal example script that reproduces the bug. Apparently, this happens when the simulator is launched from the python program with subprocess.Popen (which I need to do because my environment manages several parallel instances or AirSim): For me the freeze happens after ~ 30 episodes |
I think I found the problem, rather than:
I should use:
Seems like I got rid of the freeze like this. I suppose AirSim was printing a ton of garbage that was caught in this subprocess.PIPE which I don't really know what it is, and at some point there was an overflow? I still cannot succeed at restarting the simulator, but hopefully without this freeze I won't need to do this anymore. |
(updated)
Hello again.
Sorry this never ends, but there is another recurrent bug with AirSim that makes the simulator freeze usually after maybe ~ 10000 - 20000 calls to the control API. I thought this was related to the bug with west when one drone crossed the last gate but apparently it is not.
The problem with this one is that the simulator doesn't crash, the image freezes, the window becomes unresponsive, but subsequent calls to the API seem to work until we call reset(), which hangs until timeout.
Even weirder, if we don't catch the TimeoutError from msgpckrpc (and so the python program exits with an error), then the window unfreezes.
The text was updated successfully, but these errors were encountered: