Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudsim startup and total runtime vs simulation time #412

Open
zbynekwinkler opened this issue May 4, 2020 · 9 comments
Open

Cloudsim startup and total runtime vs simulation time #412

zbynekwinkler opened this issue May 4, 2020 · 9 comments
Labels

Comments

@zbynekwinkler
Copy link

I experiencing a couple of strange behaviors on cloudsim.

First, I am still fighting with how to know when everything is ready to start. First suggestion was to wait for /clock but that is not enough (even when /clock is available, other stuff is not). At the end, I am waiting almost for everything the solution might need, one by one. This is an example or rosout from one such a run (it is short so I include it in full):

0.000000000  Node Startup
95.872000000 INFO /rosbag_robot_data [/tmp/binarydeb/ros-melodic-rosbag-1.14.5/src/recorder.cpp:394(Recorder::startWriting)] [topics: /rosout] Recording to '/home/developer/.ros/robot_data_0.bag'.
0.000000000 INFO /tf_world_static [/tmp/binarydeb/ros-melodic-tf2-ros-0.6.5/src/static_transform_broadcaster_program.cpp:142(main)] [topics: /rosout, /tf_static] Spinning until killed publishing world to simple_urban_01
0.000000000 DEBUG /cudatest [tcpros_base.py:559(TCPROSTransport.connect)] [topics: /clock, /rosout] connecting to 10.44.0.2 52869
341.904000000 INFO /cudatest [entrypoint.py:17(main)] [topics: /clock, /rosout] got rosout logger
341.904000000 INFO /cudatest [entrypoint.py:19(main)] [topics: /clock, /rosout] waiting for clock
341.908000000 INFO /cudatest [entrypoint.py:21(main)] [topics: /clock, /rosout] rospy.Time[341908000000]
341.908000000 DEBUG /cudatest [tcpros_service.py:112(contact_service)] [topics: /clock, /rosout] connecting to ('10.44.0.2', 38949)
341.912000000 DEBUG /cudatest [tcpros_service.py:112(contact_service)] [topics: /clock, /rosout] connecting to ('10.44.0.2', 38949)
341.912000000 DEBUG /cudatest [tcpros_base.py:559(TCPROSTransport.connect)] [topics: /clock, /rosout] connecting to 10.44.0.2 38949
341.920000000 INFO /cudatest [entrypoint.py:27(main)] [topics: /clock, /rosout] /subt/start: True
344.384000000 INFO /cudatest [entrypoint.py:32(main)] [topics: /clock, /rosout] torch ok
344.388000000 INFO /cudatest [entrypoint.py:32(main)] [topics: /clock, /rosout] tf ok
344.388000000 DEBUG /cudatest [tcpros_base.py:559(TCPROSTransport.connect)] [topics: /clock, /rosout] connecting to 10.44.0.2 38949
344.392000000 INFO /cudatest [entrypoint.py:36(main)] [topics: /clock, /rosout] /subt/finish: True

One of the things I started to wait for is at least one subscriber on /rosout because without it, our /rosout is not logged. Here the subscriber surfaced only after 341s of simulation (in other run it was 375s). Before that I was waiting only for subscriber on /robot_data and that usually happened between 80 to 100s of simulation time.

This is the contents of the resulting summary.yml:

was_started: 1
sim_time_duration_sec: 3
real_time_duration_sec: 2
model_count: 1

But it took over 12h to get this result! The ID of the run is 3afb2077-590d-4f45-90a4-068d2bea84d5 and it was started 5/3/20, 9:50 AM (local time) and the summary email was received at Sun, 3 May, 22:09 (again local time).

I have started some other simulations that finished over the night but I got no summary email about it, so I don't have the exact time when they finished.

I am right now running almost the same thing under the ID of 0d63b81a-a066-4538-aa9d-81fa4b2437f1 which has started 5/4/20, 9:48 AM and now is 14:13. The test running locally takes about 5 seconds and on cloudsim it is still running.

To summarize:

  • Do we really have to explicitly wait & check for every single piece of the infrastructure ourselves? Can there be one topic or service or something that would notify us, that now everything should be ready to go and if it is not, it is a reportable error?
  • Why did rosout logging attached only after 341s of simulation in run 3afb2077-590d-4f45-90a4-068d2bea84d5?
  • Why did it take so long for 3afb2077-590d-4f45-90a4-068d2bea84d5 to finish?
  • Why am I missing most of the emails about simulations ending? They are not in spam.
@nkoenig
Copy link
Contributor

nkoenig commented May 6, 2020

Can you define what you mean by "ready to go"? For example, does that include all of the robot images entering some state? You will have to do some amount of checking because there is no feature that says a robot is ready.

The simulation docker container starts first. There will be some elapsed simulation time before all of the robot docker images are pulled from ECR and launched.

It looks like 3afb2077-590d-4f45-90a4-068d2bea84d5 ran for 344 seconds of simulation time. There could be a number of reasons why a simulation takes a long time to complete including:

  1. Time to pull ECR images
  2. Lack of EC2 machines
  3. Time to upload log files

I don't have an answer regarding the emails. I'll look into the email system.

@zbynekwinkler
Copy link
Author

Can you define what you mean by "ready to go"? For example, does that include all of the robot images entering some state? You will have to do some amount of checking because there is no feature that says a robot is ready.

I am aware there is currently no such feature. I am asking for such a feature to be added because... see below. Some amount of checking is ok.

The simulation 3afb2077-590d-4f45-90a4-068d2bea84d5 was in state of "Running" (as seen on the website) for more than 12h. I am not sure what "Running" means exactly but to an outsider it would seem to exclude the things you mention like pulling images, lack of machines or uploading log files. I would have expected the first two to happen before entering "Running" state and the last one to happen after exiting "Running" state.

Your description of the startup seems to suggest that images are being pulled while the simulation is already running. Is that correct statement? So the bigger the images the bigger the delay? That might explain part of the problem since this test I was running was with a new Cuda base image which is larger that the original OpenGl image.

If I understand correctly, the start of the solution containers can be delayed by hours (days?) from each other and from the start of the simulation depending on for example how big the image is. So if one robot is controlled be a small image, it might start immediately, while multigigabyte image could start hours later. This could be further intensified when EC2 machines are not available - simulation is running, one solution is running - other solutions are waiting for EC2 availability.

If my interpretation above is correct, we definitely need some way how the solution containers can check if it is ok to start competing. Currently, there is no way to find out, if all the solution machines have been started and all solution containers are instantiated and ready to go. Or is there?

@nkoenig
Copy link
Contributor

nkoenig commented Aug 17, 2020

I'll try to answer your questions:

  1. The "Running" state as seen on the SubT Portal indicates that simulation and all the robot images have been pulled, deployed, and are running.

  2. All machines are provisioned, and then simulation is launched followed by the bridges and robots. The bridges and robots launch in parallel. Launch means that the docker images are pulled and run. All of the images are pull within AWS, so the time required should not be significant. Also, all robots are spawned into simulation during the same simulation step.

  3. The start of a simulation can be delayed due to machine availability. This is usually not hours or days.

  4. You can use communication between robots to know when they are ready.

@nkoenig nkoenig self-assigned this Aug 17, 2020
@nkoenig nkoenig added bug Something isn't working enhancement New feature or request major labels Aug 17, 2020
@zbynekwinkler
Copy link
Author

Thanks for the answers.

All of the images are pull within AWS, so the time required should not be significant.

According to our observation it takes around 2 or 3 minutes to start the robot controller with 9GB image.

You can use communication between robots to know when they are ready.

That is more involved than it seems, because for that to work the robots need to know the size of the group. This information can be hard coded into the the docker image which precludes using one image for testing different scenarios. The only other way how to pass this info to the robots is using the robot names which allows only ascii characters (correct me if I am wrong).

@nkoenig nkoenig removed the enhancement New feature or request label Nov 18, 2020
@peci1
Copy link
Collaborator

peci1 commented Jan 6, 2021

for that to work the robots need to know the size of the group

There is ROS parameter /robot_names available in each solution container. It is a comma-separated list of all robots in the current run.

ign launch cloudsim_bridge.ign -v 4 $@

<command>roslaunch subt_ros competition_init.launch world_name:=<%=$worldName%> enable_ground_truth:=<%=($enableGroundTruth)?"1":"0"%> robot_names:=<%= robotNames.join(",") %></command>

<rosparam param="/robot_names" subst_value='true'>$(arg robot_names)</rosparam>

@zbynekwinkler
Copy link
Author

This is per rosmaster and on cloudsim, each robot has its own rosmaster. So it contains the name of the one robot.

@peci1
Copy link
Collaborator

peci1 commented Jan 6, 2021

Oh, you're right, there will only be a single robot on each bridge instance...

We're actually building separate images for each team configuration, but it doesn't have to be very painful... We do this for each robot:

echo "FROM ${DOCKER_IMAGE_NAME}:latest" >"${script_dir}/tmp.Dockerfile"
echo "ENTRYPOINT [\"/ros_entrypoint.sh\", \"env\", \"roslaunch\", \"--wait\", \"subt_virtual\", \"${robot_config}.launch\", \"robot:=${robot_name}\", \"robots:=${robots}\", \"rec_subt_bag:=true\"]" >>"${script_dir}/tmp.Dockerfile"
docker build --rm -t ${DOCKER_IMAGE_NAME}:${robot_name} -f "${script_dir}/tmp.Dockerfile" --progress=plain "${script_dir}/.."

As this temporary dockerfile is really lightweight (only changing entrypoint), the build is instant. But I agree it'd be nicer to have the list of robots passed to the bridge as a ROS parameter.

@nkoenig
Copy link
Contributor

nkoenig commented Jan 13, 2021

It sounds like this issue has transitioned to a feature request that would involve making each robotname available to each robot ROS master. I'll mark this issue appropriately.

@nkoenig nkoenig added proposal and removed bug Something isn't working major labels Jan 13, 2021
@peci1
Copy link
Collaborator

peci1 commented Jan 13, 2021

Thanks for considering this. For us to be a little more useful, it'd be great if our solution container would get the same commandline as the sim container (all robotNameX and robotConfigX options) + obviously one more parameter telling it which robot to select from this list.

Commandline of the image is a bit more helpful than ROS parameters because that would allow us to run a bash script that decides which launch file to launch based on the robot type (and possibly name). That would allow us to have a single docker image for all robots and any configuration of them.

Commandline parameters should be "backwards compatible" in the sense that if some team decides not to used them in their entrypoint, no change would be needed (it would be just a bash script called with more arguments than it actually reads).

@nkoenig nkoenig removed their assignment Oct 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants