-
Notifications
You must be signed in to change notification settings - Fork 99
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloudsim startup and total runtime vs simulation time #412
Comments
Can you define what you mean by "ready to go"? For example, does that include all of the robot images entering some state? You will have to do some amount of checking because there is no feature that says a robot is ready. The simulation docker container starts first. There will be some elapsed simulation time before all of the robot docker images are pulled from ECR and launched. It looks like 3afb2077-590d-4f45-90a4-068d2bea84d5 ran for 344 seconds of simulation time. There could be a number of reasons why a simulation takes a long time to complete including:
I don't have an answer regarding the emails. I'll look into the email system. |
I am aware there is currently no such feature. I am asking for such a feature to be added because... see below. Some amount of checking is ok. The simulation 3afb2077-590d-4f45-90a4-068d2bea84d5 was in state of "Running" (as seen on the website) for more than 12h. I am not sure what "Running" means exactly but to an outsider it would seem to exclude the things you mention like pulling images, lack of machines or uploading log files. I would have expected the first two to happen before entering "Running" state and the last one to happen after exiting "Running" state. Your description of the startup seems to suggest that images are being pulled while the simulation is already running. Is that correct statement? So the bigger the images the bigger the delay? That might explain part of the problem since this test I was running was with a new Cuda base image which is larger that the original OpenGl image. If I understand correctly, the start of the solution containers can be delayed by hours (days?) from each other and from the start of the simulation depending on for example how big the image is. So if one robot is controlled be a small image, it might start immediately, while multigigabyte image could start hours later. This could be further intensified when EC2 machines are not available - simulation is running, one solution is running - other solutions are waiting for EC2 availability. If my interpretation above is correct, we definitely need some way how the solution containers can check if it is ok to start competing. Currently, there is no way to find out, if all the solution machines have been started and all solution containers are instantiated and ready to go. Or is there? |
I'll try to answer your questions:
|
Thanks for the answers.
According to our observation it takes around 2 or 3 minutes to start the robot controller with 9GB image.
That is more involved than it seems, because for that to work the robots need to know the size of the group. This information can be hard coded into the the docker image which precludes using one image for testing different scenarios. The only other way how to pass this info to the robots is using the robot names which allows only ascii characters (correct me if I am wrong). |
There is ROS parameter
subt/subt_ign/launch/cloudsim_bridge.ign Line 84 in 2f5dd1c
|
This is per rosmaster and on cloudsim, each robot has its own rosmaster. So it contains the name of the one robot. |
Oh, you're right, there will only be a single robot on each bridge instance... We're actually building separate images for each team configuration, but it doesn't have to be very painful... We do this for each robot: echo "FROM ${DOCKER_IMAGE_NAME}:latest" >"${script_dir}/tmp.Dockerfile"
echo "ENTRYPOINT [\"/ros_entrypoint.sh\", \"env\", \"roslaunch\", \"--wait\", \"subt_virtual\", \"${robot_config}.launch\", \"robot:=${robot_name}\", \"robots:=${robots}\", \"rec_subt_bag:=true\"]" >>"${script_dir}/tmp.Dockerfile"
docker build --rm -t ${DOCKER_IMAGE_NAME}:${robot_name} -f "${script_dir}/tmp.Dockerfile" --progress=plain "${script_dir}/.." As this temporary dockerfile is really lightweight (only changing entrypoint), the build is instant. But I agree it'd be nicer to have the list of robots passed to the bridge as a ROS parameter. |
It sounds like this issue has transitioned to a feature request that would involve making each robotname available to each robot ROS master. I'll mark this issue appropriately. |
Thanks for considering this. For us to be a little more useful, it'd be great if our solution container would get the same commandline as the sim container (all Commandline of the image is a bit more helpful than ROS parameters because that would allow us to run a bash script that decides which launch file to launch based on the robot type (and possibly name). That would allow us to have a single docker image for all robots and any configuration of them. Commandline parameters should be "backwards compatible" in the sense that if some team decides not to used them in their entrypoint, no change would be needed (it would be just a bash script called with more arguments than it actually reads). |
I experiencing a couple of strange behaviors on cloudsim.
First, I am still fighting with how to know when everything is ready to start. First suggestion was to wait for /clock but that is not enough (even when /clock is available, other stuff is not). At the end, I am waiting almost for everything the solution might need, one by one. This is an example or rosout from one such a run (it is short so I include it in full):
One of the things I started to wait for is at least one subscriber on
/rosout
because without it, our /rosout is not logged. Here the subscriber surfaced only after 341s of simulation (in other run it was 375s). Before that I was waiting only for subscriber on/robot_data
and that usually happened between 80 to 100s of simulation time.This is the contents of the resulting summary.yml:
But it took over 12h to get this result! The ID of the run is
3afb2077-590d-4f45-90a4-068d2bea84d5
and it was started 5/3/20, 9:50 AM (local time) and the summary email was received at Sun, 3 May, 22:09 (again local time).I have started some other simulations that finished over the night but I got no summary email about it, so I don't have the exact time when they finished.
I am right now running almost the same thing under the ID of 0d63b81a-a066-4538-aa9d-81fa4b2437f1 which has started 5/4/20, 9:48 AM and now is 14:13. The test running locally takes about 5 seconds and on cloudsim it is still running.
To summarize:
The text was updated successfully, but these errors were encountered: