Preserving chunk size and compression when manipulating bags #1342

NikolausDemmel · 2018-03-05T21:26:55Z

This is related to #117 in that with large bags (1 - 100GB) it seems the default chunk size of 768KB results in slow opening of bag, such that even just rosbag info can take a long time.

While we use the --chunksize option to rosbag record to great effect (although there are issues with delays during playback with larger chunksize, see the discussion #117), when manipulating bag files with the python API and tools that use the python API, chunk size and compression is not preserved.

One issue is, that it seems the according properties in the data structure (python Bag class properties chunk_threshold and compression, which are combined in options) are not set when opening a bag file for reading (I think it is similar in C++).

# b is Bag object with uncompressed bag recorded with chunksize 50MB
# b2 is Bag object created from b with `rosbag compress`, i.e. it is compressed and has default chunksize of 768KB

In [57]: b.get_compression_info()
Out[57]: CompressionTuple(compression='none', uncompressed=12647256754, compressed=12647256754)

In [58]: b2.get_compression_info()
Out[58]: CompressionTuple(compression='bz2', uncompressed=276183077, compressed=151300832)

In [59]: b.options
Out[59]: {'chunk_threshold': 786432, 'compression': 'none'}

In [60]: b2.options
Out[60]: {'chunk_threshold': 786432, 'compression': 'none'}

So one issue seems to be that those values are not directly stored as meta data, but they could be determined in the following way:

compression could simply use the result of get_compression_info(), which checks the compression method in all bags and returns the most frequent.
If chunk_size was set during recording, it is the minimum size before a new chunk is started, the effective size of chunks might be larger, since messages are not split. So one could check all the chunks and select something like the median chunk size. I suggest the median as a simple robust mean, since there might be small chunks at the end (maybe just one, the final one that is not filled yet, but maybe also more, e.g. if with the current implementation some additional info was added by opening the bag in append mode without specifying chunk size) and there might also be larger chunks, if there are very large messages.

In [53]: sorted(set([x.uncompressed_size for x in b._chunk_headers.values()]))[:3]
Out[53]: [50390666, 52483966, 52484248]

In [54]: sorted(set([x.uncompressed_size for x in b2._chunk_headers.values()]))[:3]
Out[54]: [2911, 1064889, 1066666]

In [55]: np.mean([x.uncompressed_size for x in b._chunk_headers.values()]) / 1024 / 1024
Out[55]: 50.047153275042646

In [56]: np.mean([x.uncompressed_size for x in b2._chunk_headers.values()]) / 1024 / 1024
Out[56]: 1.0169448152932421

In [61]: np.median([x.uncompressed_size for x in b2._chunk_headers.values()]) / 1024 / 1024
Out[61]: 1.0211143493652344

In [62]: np.median([x.uncompressed_size for x in b._chunk_headers.values()]) / 1024 / 1024
Out[62]: 50.055255889892578

One theoretical downside of guessing the chunksize like this that if you process a bag many times, the chunk size will potentially always increase. I don't think it is a big concern in practice, but one could also implement some heuristic counter measures, like rounding down to the nearest megabyte or something.

Tools that process bag files (e.g. compress, decompress, filter, fix, reindex) should then attempt to preserve these two options and set them for the output file according to the input if they are not explicitly overwritten. Additionally, they would ideally all take command line options to explicitly set these two parameters. It seems there have been attempts to preserve these options for rosbag fix, but even if it was working at some point, I guess that it cannot work any more:

ros_comm/tools/rosbag/src/rosbag/migration.py

Lines 158 to 159 in 94aaaec

    
           bag = rosbag.Bag(inbag, 'r') 
        
           rebag = rosbag.Bag(outbag, 'w', options=bag.options)

Any thoughts on this?

Would a possible PR to implement some of this likely merged? What about backporting to kinetic?

The text was updated successfully, but these errors were encountered:

racko · 2018-03-06T20:11:26Z

Which version did you work with? Did it already include #1223?

(The changes you propose might be relevant anyway.)

NikolausDemmel · 2018-03-06T21:12:50Z

Which version did you work with? Did it already include #1223?

Hm... no this was with kinetic, which does maybe not include all recent patches, but for rosbag the changeset seems to be quite small.

I don't quite see how #1223 is relevant?

NikolausDemmel · 2018-03-06T21:14:40Z

I don't quite see how #1223 is relevant?

Ah, you mean this might improve the performance issue I was talking about here:

(although there are issues with delays during playback with larger chunksize, see the discussion #117)

or something else?

racko · 2018-03-07T18:40:01Z

Yes, that was my point. But I see now that it's not central to your issue.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserving chunk size and compression when manipulating bags #1342

Preserving chunk size and compression when manipulating bags #1342

NikolausDemmel commented Mar 5, 2018 •

edited

Loading

racko commented Mar 6, 2018

NikolausDemmel commented Mar 6, 2018

NikolausDemmel commented Mar 6, 2018

racko commented Mar 7, 2018

Preserving chunk size and compression when manipulating bags #1342

Preserving chunk size and compression when manipulating bags #1342

Comments

NikolausDemmel commented Mar 5, 2018 • edited Loading

racko commented Mar 6, 2018

NikolausDemmel commented Mar 6, 2018

NikolausDemmel commented Mar 6, 2018

racko commented Mar 7, 2018

NikolausDemmel commented Mar 5, 2018 •

edited

Loading