Protobuf Any compression #6193

aaronc · 2020-05-11T20:36:03Z

Summary

As a consequence of #6030 and #6081, we migrated to protobuf Any which results in slightly more disk usage because of type URLs embedded in transactions and state. We could potentially mitigate this through compression at the persistence layer without affecting tx or state hashes.

Proposed Approaches

tm-db Compression

A custom compression layer could be introduced that implements the tm-db DB interface and wraps any existing DB. This would allow for the same compression layer to be used for both the state and block stores without modifying any of the existing code in those layers. The compression layer would just need to be inserted when the DB is configured and from then on would function more or less as an integrated compression layer.

Algorithm

The compression layer would:

assign a persistent integer ID to every type URL in the global protobuf type registry at startup time. This integer should get persisted to disk by the compression layer in a separate DB.
cache the type URL <-> integer ID map to memory
compress incoming bytes by scanning for strings that match a type URL (all of them start with the character /) and replace them with a replacement byte sequence (prefixed with /)
escape all / prefix characters with an escape sequence
decompress outgoing bytes by scanning for the prefix

The text was updated successfully, but these errors were encountered:

github-actions · 2020-07-04T00:08:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

odeke-em · 2020-07-04T09:48:09Z

Definitely not stale. Prioritizing it just took sometime and I shall work on it.

odeke-em · 2020-07-27T10:59:05Z

I've mailed out PR #6854, with the exception here that am
a) firstly considering the use of protobuf marshaled types.Any since it looks types.Any.Pack(proto.Message) is the main usage of types.Any and so the proposed algorithm doesn't directly translate. However, adding also JSON support might be trivial and will use the same scans and can be transparently implemented too

b) Using a 6 byte suffix of [FNV32Hash(typeURL)]+"\xfe\xff" to uniquely identify the compressed values. To add JSON support, we can repurpose the second last end magic value to take in the type i.e. either proto or JSON

I'll keep developing that PR as the next 2 days proceed.

aaronc · 2020-08-04T17:25:22Z

@odeke-em in preparation for our call tomorrow I want to call out that one of the key use cases I believe is actually being able to compress the Tendermint block history. That may be more of an issue in terms of disk storage than the application state.

We actually haven't even really had a chance to discuss this with anyone from the tendermint core team. @marbar3778 I wonder what your thoughts are on a compression layer that specifically compresses the type_urls on protobuf Anys and where that would fit? Is a tm-db Db wrapper the right part of the stack or is tm using something different now?

tac0turtle · 2020-08-05T07:54:21Z

I wonder what your thoughts are on a compression layer that specifically compresses the type_urls on protobuf Anys and where that would fit?

Some db's are already compressing on writes (leveldb), but not for others (boltdb). I dont really have a preference nor have put any thought to this so I would not be the best to answer.

In Tendermint, we haven't tested any other compression tactics so I would be unable to say if its needed or not. If you are thinking of introducing something to tm-db then moving this conversation there would be best.

I will cc @erikgrinaker to get a second, more experienced take on this.

erikgrinaker · 2020-08-05T09:27:44Z

tm-db treats keys and values as opaque byte slices, and I don't think we should start trying to interpret the values to apply dictionary compression on Protobuf Any fields in tm-db itself. If anything, this should happen at the encoding level, in practice causing us to use a custom variant of Protobuf, but I would advise against this since it makes it harder to debug database entries and use other tooling (see also: Amino).

If necessary, I would just use a standard compression algorithm instead. These generally do not work well on small pieces of data, i.e. individual keys and values, so it should be done at the storage level. Compressed data is not amenable to random access, so e.g. B-tree based database backends (such as BoltDB) have to compress individual blocks, with the same poor results as individual keys/values. However, LSM-tree based backends (such as LevelDB, RocksDB, and BadgerDB) write long runs of immutable data, and are much more susceptible to compression.

tl;dr Use built-in LevelDB/RocksDB compression and call it a day.

aaronc · 2020-08-05T13:33:09Z

I did benchmark the built in leveldb and rocksdb compression with some test data using Any's. The built in compression does help but there is still overhead that could potentially be significant. I can't say that we've done benchmarks with a real world data set to see how the default compression performs with say a million blocks of transactions. Maybe that should really be done first.

We are opposed to using a custom variant of protobuf (like amino) and this is why we are thinking of handling this at the storage layer for each value written to the database. Since we have a pre-populated dictionary, the thought is that this should be much more efficient than the default compression. Regarding tm-db integration, this would be a layer that wraps an existing tm-db in a new tm-db instance.

I'm not understanding your concrete reasons for why this would be a bad idea. Could you maybe say a bit more on this @erikgrinaker ?

erikgrinaker · 2020-08-05T14:05:29Z

I feel like doing this in tm-db breaks separation of concerns, and violates abstraction boundaries. tm-db shouldn't care about what keys and values contain, that's an application concern. If the benefit is large then it's sometimes worth it, but breaking abstraction boundaries comes at a complexity and maintenance cost. I suspect the compression benefit here is minor, and not worth it.

Also, what's being stored in the database is no longer Protobuf messages, they're now a custom format, and so it's harder to access the database entries for e.g. debugging or other tooling. I'd say our past experience with custom formats hasn't been great.

odeke-em · 2020-08-14T17:02:32Z

I mailed out PR #6854 and in there I do comparisons that are reproducible by anyone or see https://gist.github.com/odeke-em/f92f76fe2acfcd566f7e5fa19bf3741e to do a tree listing then comparison of results.

Results

Given 5 millions blocks with exactly ONE kind of typeURL per block:

RocksDB we have savings of 16MiB
BoltDB, we have savings of 16MiB
GoLevelDB, we have savings of 3.078MiB

If we make the blocks heterogenous, we definitely will save even more.

I went strictly with blocks to validate any concerns about real world apples to apples comparison given that we'll be storing Tendermint's state. Now as we add a mixed variety of entries, we start to see results regardless of database.
The argument about inspecting the stored state can be responded to by the fact that if you want to audit, retrieval from the database will return you back the plain bytes transparently.

aaronc · 2020-08-14T17:14:15Z

Thanks for doing those benchmarks @odeke-em !

clevinson · 2021-04-30T14:56:12Z

We're putting this in the Status: Ready call and the next step would be from someone from our team to review this.

tac0turtle · 2022-09-02T13:48:55Z

reopen if still relevant.

aaronc changed the title ~~Protobuf Any compression in tm-db~~ Protobuf Any compression May 11, 2020

clevinson added this to the v0.39 milestone Jun 11, 2020

github-actions bot added the stale label Jul 4, 2020

alexanderbez added pinned and removed stale labels Jul 4, 2020

alexanderbez modified the milestones: v0.40 [Stargate], v0.40.1 Jul 22, 2020

odeke-em mentioned this issue Jul 27, 2020

anycompress: a layer for transparently compressing and caching typeURLs #6854

Closed

9 tasks

aaronc mentioned this issue Aug 10, 2020

Use protobuf public key in BaseAccount #6886

Closed

aaronc added ice-box S:blocked Status: Blocked labels Oct 14, 2020

aaronc modified the milestones: v0.40.1, v0.42 Jan 6, 2021

odeke-em mentioned this issue Jan 26, 2021

anycompress: a layer for transparently compressing and caching typeURLs #8432

Closed

9 tasks

aaronc added Status: Backlog and removed Status: Ice Box labels Apr 30, 2021

clevinson added Status: Ready and removed Status: Backlog labels Apr 30, 2021

aaronc added Status: Backlog and removed Status: Ready labels May 12, 2021

clevinson added Status: Ready and removed Status: Backlog labels May 14, 2021

clevinson removed the Status: Ready label Jun 30, 2021

tac0turtle removed the pinned label May 9, 2022

tac0turtle closed this as completed Sep 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Protobuf Any compression #6193

Protobuf Any compression #6193

aaronc commented May 11, 2020 •

edited

Loading

github-actions bot commented Jul 4, 2020

odeke-em commented Jul 4, 2020

odeke-em commented Jul 27, 2020

aaronc commented Aug 4, 2020

tac0turtle commented Aug 5, 2020

erikgrinaker commented Aug 5, 2020

aaronc commented Aug 5, 2020 •

edited

Loading

erikgrinaker commented Aug 5, 2020

odeke-em commented Aug 14, 2020 •

edited

Loading

aaronc commented Aug 14, 2020

clevinson commented Apr 30, 2021

tac0turtle commented Sep 2, 2022

Protobuf Any compression #6193

Protobuf Any compression #6193

Comments

aaronc commented May 11, 2020 • edited Loading

Summary

Proposed Approaches

tm-db Compression

Algorithm

github-actions bot commented Jul 4, 2020

odeke-em commented Jul 4, 2020

odeke-em commented Jul 27, 2020

aaronc commented Aug 4, 2020

tac0turtle commented Aug 5, 2020

erikgrinaker commented Aug 5, 2020

aaronc commented Aug 5, 2020 • edited Loading

erikgrinaker commented Aug 5, 2020

odeke-em commented Aug 14, 2020 • edited Loading

Results

aaronc commented Aug 14, 2020

clevinson commented Apr 30, 2021

tac0turtle commented Sep 2, 2022

aaronc commented May 11, 2020 •

edited

Loading

aaronc commented Aug 5, 2020 •

edited

Loading

odeke-em commented Aug 14, 2020 •

edited

Loading