Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preserving chunk size and compression when manipulating bags #1342

Open
NikolausDemmel opened this issue Mar 5, 2018 · 4 comments
Open

Preserving chunk size and compression when manipulating bags #1342

NikolausDemmel opened this issue Mar 5, 2018 · 4 comments

Comments

@NikolausDemmel
Copy link
Contributor

NikolausDemmel commented Mar 5, 2018

This is related to #117 in that with large bags (1 - 100GB) it seems the default chunk size of 768KB results in slow opening of bag, such that even just rosbag info can take a long time.

While we use the --chunksize option to rosbag record to great effect (although there are issues with delays during playback with larger chunksize, see the discussion #117), when manipulating bag files with the python API and tools that use the python API, chunk size and compression is not preserved.

One issue is, that it seems the according properties in the data structure (python Bag class properties chunk_threshold and compression, which are combined in options) are not set when opening a bag file for reading (I think it is similar in C++).

# b is Bag object with uncompressed bag recorded with chunksize 50MB
# b2 is Bag object created from b with `rosbag compress`, i.e. it is compressed and has default chunksize of 768KB

In [57]: b.get_compression_info()
Out[57]: CompressionTuple(compression='none', uncompressed=12647256754, compressed=12647256754)

In [58]: b2.get_compression_info()
Out[58]: CompressionTuple(compression='bz2', uncompressed=276183077, compressed=151300832)

In [59]: b.options
Out[59]: {'chunk_threshold': 786432, 'compression': 'none'}

In [60]: b2.options
Out[60]: {'chunk_threshold': 786432, 'compression': 'none'}

So one issue seems to be that those values are not directly stored as meta data, but they could be determined in the following way:

  1. compression could simply use the result of get_compression_info(), which checks the compression method in all bags and returns the most frequent.
  2. If chunk_size was set during recording, it is the minimum size before a new chunk is started, the effective size of chunks might be larger, since messages are not split. So one could check all the chunks and select something like the median chunk size. I suggest the median as a simple robust mean, since there might be small chunks at the end (maybe just one, the final one that is not filled yet, but maybe also more, e.g. if with the current implementation some additional info was added by opening the bag in append mode without specifying chunk size) and there might also be larger chunks, if there are very large messages.
In [53]: sorted(set([x.uncompressed_size for x in b._chunk_headers.values()]))[:3]
Out[53]: [50390666, 52483966, 52484248]

In [54]: sorted(set([x.uncompressed_size for x in b2._chunk_headers.values()]))[:3]
Out[54]: [2911, 1064889, 1066666]

In [55]: np.mean([x.uncompressed_size for x in b._chunk_headers.values()]) / 1024 / 1024
Out[55]: 50.047153275042646

In [56]: np.mean([x.uncompressed_size for x in b2._chunk_headers.values()]) / 1024 / 1024
Out[56]: 1.0169448152932421

In [61]: np.median([x.uncompressed_size for x in b2._chunk_headers.values()]) / 1024 / 1024
Out[61]: 1.0211143493652344

In [62]: np.median([x.uncompressed_size for x in b._chunk_headers.values()]) / 1024 / 1024
Out[62]: 50.055255889892578

One theoretical downside of guessing the chunksize like this that if you process a bag many times, the chunk size will potentially always increase. I don't think it is a big concern in practice, but one could also implement some heuristic counter measures, like rounding down to the nearest megabyte or something.

Tools that process bag files (e.g. compress, decompress, filter, fix, reindex) should then attempt to preserve these two options and set them for the output file according to the input if they are not explicitly overwritten. Additionally, they would ideally all take command line options to explicitly set these two parameters. It seems there have been attempts to preserve these options for rosbag fix, but even if it was working at some point, I guess that it cannot work any more:

bag = rosbag.Bag(inbag, 'r')
rebag = rosbag.Bag(outbag, 'w', options=bag.options)

Any thoughts on this?

Would a possible PR to implement some of this likely merged? What about backporting to kinetic?

@racko
Copy link
Contributor

racko commented Mar 6, 2018

Which version did you work with? Did it already include #1223?

(The changes you propose might be relevant anyway.)

@NikolausDemmel
Copy link
Contributor Author

Which version did you work with? Did it already include #1223?

Hm... no this was with kinetic, which does maybe not include all recent patches, but for rosbag the changeset seems to be quite small.

I don't quite see how #1223 is relevant?

@NikolausDemmel
Copy link
Contributor Author

I don't quite see how #1223 is relevant?

Ah, you mean this might improve the performance issue I was talking about here:

(although there are issues with delays during playback with larger chunksize, see the discussion #117)

or something else?

@racko
Copy link
Contributor

racko commented Mar 7, 2018

Yes, that was my point. But I see now that it's not central to your issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants