You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is related to #117 in that with large bags (1 - 100GB) it seems the default chunk size of 768KB results in slow opening of bag, such that even just rosbag info can take a long time.
While we use the --chunksize option to rosbag record to great effect (although there are issues with delays during playback with larger chunksize, see the discussion #117), when manipulating bag files with the python API and tools that use the python API, chunk size and compression is not preserved.
One issue is, that it seems the according properties in the data structure (python Bag class properties chunk_threshold and compression, which are combined in options) are not set when opening a bag file for reading (I think it is similar in C++).
# b is Bag object with uncompressed bag recorded with chunksize 50MB
# b2 is Bag object created from b with `rosbag compress`, i.e. it is compressed and has default chunksize of 768KB
In [57]: b.get_compression_info()
Out[57]: CompressionTuple(compression='none', uncompressed=12647256754, compressed=12647256754)
In [58]: b2.get_compression_info()
Out[58]: CompressionTuple(compression='bz2', uncompressed=276183077, compressed=151300832)
In [59]: b.options
Out[59]: {'chunk_threshold': 786432, 'compression': 'none'}
In [60]: b2.options
Out[60]: {'chunk_threshold': 786432, 'compression': 'none'}
So one issue seems to be that those values are not directly stored as meta data, but they could be determined in the following way:
compression could simply use the result of get_compression_info(), which checks the compression method in all bags and returns the most frequent.
If chunk_size was set during recording, it is the minimum size before a new chunk is started, the effective size of chunks might be larger, since messages are not split. So one could check all the chunks and select something like the median chunk size. I suggest the median as a simple robust mean, since there might be small chunks at the end (maybe just one, the final one that is not filled yet, but maybe also more, e.g. if with the current implementation some additional info was added by opening the bag in append mode without specifying chunk size) and there might also be larger chunks, if there are very large messages.
In [53]: sorted(set([x.uncompressed_size for x in b._chunk_headers.values()]))[:3]
Out[53]: [50390666, 52483966, 52484248]
In [54]: sorted(set([x.uncompressed_size for x in b2._chunk_headers.values()]))[:3]
Out[54]: [2911, 1064889, 1066666]
In [55]: np.mean([x.uncompressed_size for x in b._chunk_headers.values()]) / 1024 / 1024
Out[55]: 50.047153275042646
In [56]: np.mean([x.uncompressed_size for x in b2._chunk_headers.values()]) / 1024 / 1024
Out[56]: 1.0169448152932421
In [61]: np.median([x.uncompressed_size for x in b2._chunk_headers.values()]) / 1024 / 1024
Out[61]: 1.0211143493652344
In [62]: np.median([x.uncompressed_size for x in b._chunk_headers.values()]) / 1024 / 1024
Out[62]: 50.055255889892578
One theoretical downside of guessing the chunksize like this that if you process a bag many times, the chunk size will potentially always increase. I don't think it is a big concern in practice, but one could also implement some heuristic counter measures, like rounding down to the nearest megabyte or something.
Tools that process bag files (e.g. compress, decompress, filter, fix, reindex) should then attempt to preserve these two options and set them for the output file according to the input if they are not explicitly overwritten. Additionally, they would ideally all take command line options to explicitly set these two parameters. It seems there have been attempts to preserve these options for rosbag fix, but even if it was working at some point, I guess that it cannot work any more:
This is related to #117 in that with large bags (1 - 100GB) it seems the default chunk size of 768KB results in slow opening of bag, such that even just
rosbag info
can take a long time.While we use the
--chunksize
option torosbag record
to great effect (although there are issues with delays during playback with larger chunksize, see the discussion #117), when manipulating bag files with the python API and tools that use the python API, chunk size and compression is not preserved.One issue is, that it seems the according properties in the data structure (python Bag class properties
chunk_threshold
andcompression
, which are combined inoptions
) are not set when opening a bag file for reading (I think it is similar in C++).So one issue seems to be that those values are not directly stored as meta data, but they could be determined in the following way:
compression
could simply use the result ofget_compression_info()
, which checks the compression method in all bags and returns the most frequent.chunk_size
was set during recording, it is the minimum size before a new chunk is started, the effective size of chunks might be larger, since messages are not split. So one could check all the chunks and select something like the median chunk size. I suggest the median as a simple robust mean, since there might be small chunks at the end (maybe just one, the final one that is not filled yet, but maybe also more, e.g. if with the current implementation some additional info was added by opening the bag in append mode without specifying chunk size) and there might also be larger chunks, if there are very large messages.One theoretical downside of guessing the chunksize like this that if you process a bag many times, the chunk size will potentially always increase. I don't think it is a big concern in practice, but one could also implement some heuristic counter measures, like rounding down to the nearest megabyte or something.
Tools that process bag files (e.g.
compress
,decompress
,filter
,fix
,reindex
) should then attempt to preserve these two options and set them for the output file according to the input if they are not explicitly overwritten. Additionally, they would ideally all take command line options to explicitly set these two parameters. It seems there have been attempts to preserve these options forrosbag fix
, but even if it was working at some point, I guess that it cannot work any more:ros_comm/tools/rosbag/src/rosbag/migration.py
Lines 158 to 159 in 94aaaec
Any thoughts on this?
Would a possible PR to implement some of this likely merged? What about backporting to kinetic?
The text was updated successfully, but these errors were encountered: