Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloudsim: state.tlog files are much larger now #766

Closed
peci1 opened this issue Jan 25, 2021 · 11 comments
Closed

Cloudsim: state.tlog files are much larger now #766

peci1 opened this issue Jan 25, 2021 · 11 comments
Assignees

Comments

@peci1
Copy link
Collaborator

peci1 commented Jan 25, 2021

This is the size of our cloudsim state.tlog files from the last month or so:

4.8G    ./12-11/weeklyTest201211run01/ign-logs/state.tlog
4.7G    ./12-11/weeklyTest201211run02/ign-logs/state.tlog
985M    ./12-11/weeklyTest201211run03/ign-logs/state.tlog
1.6G    ./12-11/weeklyTest201211run04/ign-logs/state.tlog
2.4G    ./12-11/weeklyTest201211run05/ign-logs/state.tlog
2.9G    ./21-01-09/weeklyTest210109run01/ign-logs/state.tlog
4.8G    ./21-01-09/weeklyTest210109run02/ign-logs/state.tlog
27G     ./21-01-22/weeklyTest210122run01/ign-logs/state.tlog
50G     ./21-01-22/weeklyTest210122run02/ign-logs/state.tlog
37G     ./21-01-22/weeklyTest210122run03/ign-logs/state.tlog

Notice the 5-10-fold increase in size of state.tlog from this weeks simulations. All the simulations were run with the same team of robots and all of the in cave competition worlds.

What's the reason for this increase? It becomes inconvenient to play back such large files...

@peci1
Copy link
Collaborator Author

peci1 commented Jan 25, 2021

It seems the simulator did not quit when the simulation has finished. This is a plot of time_recv from the large state.tlog (it's not whole, as libreoffice only allowed me to import 1M rows, but the tlog file has 2.35M):

obrazek

This is from an older cloudsim run:

obrazek

@peci1
Copy link
Collaborator Author

peci1 commented Jan 25, 2021

Some more info (assume large.tlog is the large (30 GB) tlog file and small.tlog is an older (5 GB) one).

$ sqlite3 large.tlog "select count(*),max(topic_id) from messages group by topic_id"
2361757|1
1|2
90|3
$ sqlite3 large.tlog "select * from topics"
1|/world/cave_circuit_07/dynamic_pose/info|1
2|/logs/sdf|2
3|/world/cave_circuit_07/changed_state|3
$ sqlite3 large.tlog "select id,time_recv from messages where ROWID % 100000 = 0"
100000|893928000000
200000|1705692000000
300000|2458224000000
400000|3229700000000
500000|3616004000000
600000|3616004000000
700000|3616004000000
800000|3616004000000
900000|3616004000000
1000000|3616004000000
1100000|3616004000000
1200000|3616004000000
1300000|3616004000000
1400000|3616004000000
1500000|3616004000000
1600000|3616004000000
1700000|3616004000000
1800000|3616004000000
1900000|3616004000000
2000000|3616004000000
2100000|3616004000000
2200000|3616004000000
2300000|3616004000000
$ sqlite3 small.tlog "select count(*),max(topic_id) from messages group by topic_id"
1|1
480535|2
65|3
$ sqlite3 small.tlog "select * from topics"
1|/logs/sdf|1
2|/world/cave_circuit_05/dynamic_pose/info|2
3|/world/cave_circuit_05/changed_state|3
$ sqlite3 small.tlog "select id,time_recv from messages where ROWID % 100000 = 0"
100000|894756000000
200000|1664876000000
300000|2383568000000
400000|3081344000000

@zbynekwinkler
Copy link

On a maybe (?) related note, the simulation does not respond to SIGINT or SIGTERM which is the default way to stop containers. It needs to be killed instead, which is surprising and might be related to the problem at hand.

In order to respond to signals something like this needs to be there:

https://github.com/robotika/osgar/blob/9f76ea1c6621239110edca0c9c43d8af01fd5744/subt/docker/robotika/run_solution.bash#L104-L106

The solution is based on https://linuxconfig.org/how-to-propagate-a-signal-to-child-processes-from-a-bash-script. The main point is that if the process with PID 1 is a bash script, it needs to be inside wait call in order to even receive the signals and it further needs to have a trap installed to forward the signals to child processes.

The same problem has been worked around by creating a custom rosbag recorder in the bridge container that responds to ros::ok() but the problem is up the stack with signal forwarding within the container.

@peci1
Copy link
Collaborator Author

peci1 commented Jan 25, 2021

@nkoenig nkoenig self-assigned this Feb 1, 2021
@nkoenig
Copy link
Contributor

nkoenig commented Feb 2, 2021

Cloudsim now waits for the recording_complete message from the GameLogicPlugin instead of automatically tearing down the machines. This is done in order to guarantee that ROS bag files finish properly. We noticed a few instances where Cloudsim did not respond properly to the new signal, which left machines running longer than normal. We are looking into the problem.

@peci1
Copy link
Collaborator Author

peci1 commented Feb 3, 2021

Could this be the reason?

[Msg] Time limit[2000s] reached.
[Msg] Scoring has finished. Elapsed real time: 15794 seconds. Elapsed sim time: 2000 seconds.
terminate called after throwing an instance of 'rosbag::BagIOException'
  what():  Error seeking
[subt_ros_relay-3] process has died [pid 49650, exit code -6, cmd /home/developer/subt_ws/install/lib/subt_ros/subt_ros_relay __name:=subt_ros_relay __log:=/tmp/ign/logs/ros/7c2f8a28-6595-11eb-9158-0cc47abc8ae4/subt_ros_relay-3.log].

I just saw it when a local simulation has ended.

@peci1
Copy link
Collaborator Author

peci1 commented Mar 1, 2021

We've still been having problems with indefinitely running simulations (killed only after the 48 hour timeout). Examples: 6afe994e-71a0-4f69-a51a-8c4b9402695d and aa43cea4-36ad-4aab-a70e-92a02c1ea515 (state.tlog has 92 GB!!!).

@peci1
Copy link
Collaborator Author

peci1 commented Mar 1, 2021

I think I have a solution for this problem. See ros/ros_comm#2139 . It'd be great if someone can help with review of the PR to get it merged ASAP. Next Melodic sync will be in about a month (one happened last week), so it'd be great to get this merged until the next sync.

@peci1
Copy link
Collaborator Author

peci1 commented Apr 6, 2021

Slowly getting there: ros/ros_comm#2151 (comment) got merged into noetic-devel and is now waiting for backport to melodic-devel. And then a release of ros_comm... Hmm, I'm really curious whether this will make it earlier than the virtual challenge concludes...

@peci1
Copy link
Collaborator Author

peci1 commented Apr 6, 2021

Wow, it seems the PR came really on time - there is a backport to melodic containing the fix: ros/ros_comm#2152 .

@nkoenig nkoenig assigned iche033 and unassigned nkoenig Apr 26, 2021
@iche033
Copy link
Contributor

iche033 commented May 19, 2021

fix should be out in the latest melodic sync:
https://discourse.ros.org/t/new-packages-for-melodic-2021-05-17/20418

@nkoenig nkoenig closed this as completed Jun 24, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants