Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Robot controller error due to Vulkan crash #18

Closed
cheneeheng opened this issue Mar 22, 2021 · 13 comments
Closed

Robot controller error due to Vulkan crash #18

cheneeheng opened this issue Mar 22, 2021 · 13 comments
Assignees
Labels
more-information Need more information from reporter

Comments

@cheneeheng
Copy link

Hi there,

I have just installed the benchbot successfully on a machine with RTX2080 8GB, 32GB ram, i7-9700K CPU.

But when i tried to run
benchbot_run --robot carter --env miniroom:1 --task semantic_slam:passive:ground_truth
I keep getting a robot controller error. (small snippet below and the full log in the file attached.)

I'm wondering if you guys ever encountered this.

Thanks!

Chen.


...
Supervisor is now available @ 'http://0.0.0.0:10000' ...

Waiting until a robot controller is found @ 'http://benchbot_robot:10000' ...
Found
Sending environment data & robot config to controller ...
Ready

################################################################################
####################### BENCHBOT ROBOT CONTROLLER ERROR ########################
################################################################################

ERROR: The BenchBot Robot Controller container has exited unexpectedly. This
should not happen under normal operating conditions. Please see the complete
log below for a dump of the crash output:

Robot controller is now available @ 'http://0.0.0.0:10000' ...
Waiting to receive valid config data...
172.20.0.102 - - [2021-03-22 15:04:04] "GET // HTTP/1.1" 200 152 0.000542
172.20.0.102 - - [2021-03-22 15:04:05] "POST //configure HTTP/1.1" 200 137 0.066839
Starting the requested real robot ROS stack ...
THE PROCESS STARTED BY THE FOLLOWING COMMAND HAS CRASHED:
sed -i "0,/"pose":/{s/("pose": )(.)/\1[0.7, 0, 0, -0.7, 1.2, 1.5, 0.3]/}" /benchbot/isaac_sdk/apps/carter/carter_sim/bridge_config/carter_full_config.json && perl -0777 -i -pe 's/"static_mesh".?]/"static_mesh":[{"name": "bottle"}, {"name": "cup"}, {"name": "knife"}, {"name": "bowl"}, {"name": "wine glass"}, {"name": "fork"}, {"name": "spoon"}, {"name": "banana"}, {"name": "apple"}, {"name": "orange"}, {"name": "cake"}, {"name": "potted plant"}, {"name": "mouse"}, {"name": "keyboard"}, {"name": "laptop"}, {"name": "cell phone"}, {"name": "book"}, {"name": "clock"}, {"name": "chair"}, {"name": "table"}, {"name": "couch"}, {"name": "bed"}, {"name": "toilet"}, {"name": "tv"}, {"name": "microwave"}, {"name": "toaster"}, {"name": "refrigerator"}, {"name": "oven"}, {"name": "sink"}, {"name": "person"}]/s' /benchbot/isaac_sdk/apps/carter/carter_sim/bridge_config/carter_full_config.json && cd "/benchbot/addons/benchbot_addons/benchbot-addons/envs_isaac_develop/environments" && .sim_package/IsaacSimProject.sh "/Game/AI_vol3_03_base/Maps/AI_vol3_scene_03" -isaac_sim_config_json= "/benchbot/isaac_sdk/apps/carter/carter_sim/bridge_config/carter_full.json" -windowed -ResX=960 -ResY=540 -vulkan -game

...

log.txt

@btalb btalb self-assigned this Mar 22, 2021
@btalb
Copy link
Collaborator

btalb commented Mar 22, 2021

Thanks for reporting @cheneeheng .

We've seen this arbitrary Segmentation fault (core dumped) issue occur before when running on a machines with non-standard graphics configuration. Unfortunately it comes from Isaac Sim and doesn't give us much detail regarding the cause (everything that says "Error" in that log is part of a normal working run...).

Can we confirm how you are using the machine specified:

  1. Are you directly on the machine, and does it have a physical screen attached?
  2. Are you connecting via SSH with window forwarding?
  3. Are you using other remote software like remote desktop or alternatives?

@cheneeheng
Copy link
Author

Hi @btalb,

aiks that does not sound good.

  • But what do you mean by non-standard ?
  • Setup other than the ones mentioned in the prerequisite ?
  • What did you do the last time you encountered this issue?

As for your questions:

1. Are you directly on the machine, and does it have a physical screen attached?
Yes and yes.

2. Are you connecting via SSH with window forwarding?
No.

3. Are you using other remote software like remote desktop or alternatives?
No. Although the plan is to do so once everything is running.

Thanks !

@btalb
Copy link
Collaborator

btalb commented Mar 23, 2021

The core of the issue is Vulkan only seems to be happy when it is using a discrete GPU to render to a physical screen.

The reason I ask all of those questions is that we have had these issues when using configurations that tamper with that relationship. For example:

  • I have a working system next to me (see log_success.txt),
  • but when I instead SSH into it with window forwarding I get the same SegFault (see log_failure.txt)

I can see Vulkan is the cause as I see the following extra lines in the failure when I diff the logs:

[2021.03.23-20.25.55:954][  0]LogRHI: Warning: Failed to find entry point for vkDestroySurfaceKHR
[2021.03.23-20.25.55:954][  0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfaceSupportKHR
[2021.03.23-20.25.55:954][  0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfaceCapabilitiesKHR
[2021.03.23-20.25.55:954][  0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfaceFormatsKHR
[2021.03.23-20.25.55:954][  0]LogRHI: Warning: Failed to find entry point for vkGetPhysicalDeviceSurfacePresentModesKHR
[2021.03.23-20.25.55:959][  0]LogLinux: Warning: MessageBox: Failed to find all required Vulkan entry points! Try updating your driver.: No Vulkan entry points found!: 

I can also see those lines in the log you provided me. We need to dig a little deeper though to try and figure out why Vulkan is throwing those errors for the simulator:

  1. Run a barebones Vulkan command to show a spinning cube:
vkcube.mp4

Here's the command you need:

docker run -e DISPLAY --volume /tmp/.X11-unix:/tmp/.X11-unix --rm --gpus all -it benchbot/backend:base /bin/bash -c 'vkcube'
  1. If that fails, can you show me the output of the following diagnostic command for Vulkan:
docker run -e DISPLAY --volume /tmp/.X11-unix:/tmp/.X11-unix --rm --gpus all -it benchbot/backend:base /bin/bash -c 'vulkaninfo'

Thanks; I wish these things were simpler...

@btalb btalb added the more-information Need more information from reporter label Mar 23, 2021
@cheneeheng
Copy link
Author

I shall try them out once i get back to the lab on friday.

I regularly work with programs using CUDA, so this is not the worst I have seen 😃

@tyou1
Copy link

tyou1 commented Mar 25, 2021

Hi,

I encountered the same error log as @cheneeheng today when I tried to run

benchbot_run --robot carter --env miniroom:1 --task semantic_slam:passive:ground_truth

Last week when I run this command, there's no error but just no simulator window coming out after this:

Supervisor is now available @ 'http://0.0.0.0:10000' ...
Waiting until a robot controller is found @ 'http://benchbot_robot:10000' ...
Found
Sending environment data & robot config to controller ...
Ready

I tried this command as suggested:

docker run -e DISPLAY --volume /tmp/.X11-unix:/tmp/.X11-unix --rm --gpus all -it benchbot/backend:base /bin/bash -c 'vkcube

unfortunately I get this error

'Cannot find a compatible Vulkan installable client driver (ICD)'

I am using a x2goclient via SSH to a remote machine with RTX 3080. I wonder if it has anything to do with the remote desktop that the simulator window doesn't appear?

Thanks in advance!

@btalb btalb changed the title Robot controller error Robot controller error due to Vulkan crash Mar 25, 2021
@btalb
Copy link
Collaborator

btalb commented Mar 25, 2021

Thanks for the information @tyou1 .

The behaviour you experienced last week with the window showing up and disappearing was a bug which was hiding the error log. We weren't correctly bubbling the crash log up to benchbot_run's stdout. That's been fixed in v2.1 (benchbot_run --version), so you should always see the crash log now when the simulator crashes.

What's important to understand with any remote access systems is how they actually perform the rendering. I don't know much about x2go, but here's some examples I know of where a laptop is the client and the GPU machine is the server:

  • SSH with X forwarding: this passes the X rendering commands to your client machine (i.e. if I SSH into my GPU server from my laptop, my laptop does the rendering)
  • VNC servers & clients: this forwards whatever the server renders on its screen to your client, which just shows an image of what's been rendered by the server
  • Remote desktop systems with virtual screens: these often create virtual X servers & forward their contents to the client. Where things get tricky is the nvidia driver can only be tied to a single X server & can't be swapped without killing the entire X server. So when you turn your computer on & sign in, your GPU is locked away in that default X server and can't be used by the virtual X server created for remote desktops. TL;DR: remote desktop software almost never provides hardware-accelerated rendering, and my guess is x2go falls in this bucket

How does this tie in with BenchBot? NVIDIA's Isaac Simulator relies on hardware-accelerated rendering powered by Vulkan. If the system doing the rendering doesn't meet those requirements, then we get a crash from the simulator (a crash that should be much more verbose & explicit.... but a crash nonetheless).

So from those requirements, there's only a couple of solutions I would expect to work for cases where your GPU is on a remote machine:

  • VNC: setup a VNC server on your GPU machine & connect to it via a VNC client. Make sure the VNC server is attached to the same X server as your GPU. This is the easiest option to use BenchBot on a remote machine, and is extremely simple if the remote machine has a physical screen attached (@david2611 uses a solution like this one all the time).
  • SSH with tweaked X forwarding: you essentially tell benchbot_run to render the simulator on your GPU machine's screen, not your laptop's. The downside is you won't see the simulator remotely on your laptop, but it will run successfully (I use this solution all the time). To tell benchbot_run where to render you simply adjust the DISPLAY environment variable. For example, terminals opened on my GPU machine show this:
    ben@gpu-machine:~$ echo $DISPLAY
    :1
    
    Then when I'm SSHing from home, I manually set the DISPLAY target to :1 via:
    ben@home-machine:~$ ssh -X ben@gpu-machine
    ben@gpu-machine:~$ echo $DISPLAY
    localhost:10.0
    ben@gpu-machine:~$ export DISPLAY=:1
    ben@gpu-machine:~$ benchbot_run ...
    

Hope this helps. I know it's not an ideal solution, but hardware-accelerated rendering under Linux with Vulkan support is something that's traditionally caused enough challenges by itself. Crisp solutions for remote use on top of this unfortunately aren't quite there yet.

We're always interested in better solutions though. If anyone knows of better ways to enable remote hardware-accelerated rendering, especially on headless machines, we'd love to hear them. Unfortunately, it's not something I have time to dig too far into at the moment.

@cheneeheng
Copy link
Author

cheneeheng commented Mar 26, 2021

@btalb the vulkaninfo command is returning this error:

No protocol specified
WARNING: [Loader Message] Code 0 : loader_icd_scan: Can not find 'ICD' object in ICD JSON file /usr/share/vulkan/icd.d/nvidia_layers.json. Skipping ICD JSON
error: XDG_RUNTIME_DIR not set in the environment.
No protocol specified
XCB failed to connect to the X server due to error:1.
ERROR at /build/vulkan-tools-1.2.162.1~rc1-1lunarg18.04/vulkaninfo/vulkaninfo.h:847: AppCreateXcbSurface failed to establish connection


Update 1:
Reinstalled all nvidia drivers and cuda just to be safe.
Added root access to x-server xhost local:root and both commands to debug vulkan are working, but the original error still persists.


Update 2:
So I ran xhost local:root, and commented out the line xhost -local:root > /dev/null and it works. 😃
Somehow running the original script removes the root access to the X-server and causes the error at the beginning of this comment to occur again.

Update 2.1:
Worked through the tutorials, everything is working fine. (though some commands seem to be outdated 😃 )

Update 2.2:
It seems that removing this line xhost -local:root > /dev/null makes the error go away.

Issue can be closed if @btalb don't need anything more from my side.

@btalb
Copy link
Collaborator

btalb commented Mar 29, 2021

That's excellent @cheneeheng, great to hear!

I'm not sure the relation of that series of errors (the first error I've never seen before, even before xhost local:root was added into the scripts). Did a reboot fix that error?

It's a little odd that line is causing issues with containers as it is running after all of the containers have started, so shouldn't effect them. But maybe there is some asynchronous behaviour causing race conditions. Thanks for pointing that out though, that's a really good find.

I'll close this issue here, but feel free to open a new issue with any outdated commands you find in the documentation / tutorials. I'm always keen to fix those when they're found. Unfortunately, I'm a little documentation blind by this point.

@btalb btalb closed this as completed Mar 29, 2021
@cheneeheng
Copy link
Author

Reboot (x3) did not fix the error. Only the xhost command did.

@tyou1
Copy link

tyou1 commented Apr 6, 2021

Hi

May I ask where is this line xhost -local:root > /dev/nullis = true you mentioned that you commented out @cheneeheng? Unfortunately, I still have this crash log after I switch to Remmina to connect to the remote machine (RDP). So, I wonder if there is something else besides the remote access problem that causes the crash log.

Or is it only works with VNC server&client that the simulator window appears successfully? May I ask what specific VNC server & client that @david2611 use to run benchbot smoothly ?

Thanks a lot! :)

@cheneeheng
Copy link
Author

@tyou1

Here is the line :

xhost -local:root > /dev/null

You could try running xhost +local:root on terminal and run the vkcube command to check if the access rights is the problem. Also make sure the env DISPLAY is correctly set when you do this 😃

@btalb
Copy link
Collaborator

btalb commented Apr 6, 2021

Hi @tyou1 , good question.

Only VNC will work as RDP generally creates a virtual X server which won't have the hardware accelerated rendering.

@david2611 uses NoMachine, just make sure it's not using a virtual screen.

There's plenty of simple VNC options out there also like:

  • TigerVNC (I've had succcess with this many years ago)
  • TightVNC
  • RealVNC
  • Xvnc
  • etc.

The crucial thing is just to make sure it is mirroring a physical screen, and not creating a virtual one.

@btalb
Copy link
Collaborator

btalb commented Apr 6, 2021

Remmina also should be fine as a VNC client to conect to a server.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
more-information Need more information from reporter
Projects
None yet
Development

No branches or pull requests

3 participants