Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Protobuf Any compression #6193

Closed
aaronc opened this issue May 11, 2020 · 12 comments
Closed

Protobuf Any compression #6193

aaronc opened this issue May 11, 2020 · 12 comments
Labels
S:blocked Status: Blocked

Comments

@aaronc
Copy link
Member

aaronc commented May 11, 2020

Summary

As a consequence of #6030 and #6081, we migrated to protobuf Any which results in slightly more disk usage because of type URLs embedded in transactions and state. We could potentially mitigate this through compression at the persistence layer without affecting tx or state hashes.

Proposed Approaches

tm-db Compression

A custom compression layer could be introduced that implements the tm-db DB interface and wraps any existing DB. This would allow for the same compression layer to be used for both the state and block stores without modifying any of the existing code in those layers. The compression layer would just need to be inserted when the DB is configured and from then on would function more or less as an integrated compression layer.

Algorithm

The compression layer would:

  • assign a persistent integer ID to every type URL in the global protobuf type registry at startup time. This integer should get persisted to disk by the compression layer in a separate DB.
  • cache the type URL <-> integer ID map to memory
  • compress incoming bytes by scanning for strings that match a type URL (all of them start with the character /) and replace them with a replacement byte sequence (prefixed with /)
  • escape all / prefix characters with an escape sequence
  • decompress outgoing bytes by scanning for the prefix
@aaronc aaronc changed the title Protobuf Any compression in tm-db Protobuf Any compression May 11, 2020
@clevinson clevinson added this to the v0.39 milestone Jun 11, 2020
@github-actions
Copy link
Contributor

github-actions bot commented Jul 4, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale label Jul 4, 2020
@odeke-em
Copy link
Collaborator

odeke-em commented Jul 4, 2020

Definitely not stale. Prioritizing it just took sometime and I shall work on it.

@odeke-em
Copy link
Collaborator

I've mailed out PR #6854, with the exception here that am
a) firstly considering the use of protobuf marshaled types.Any since it looks types.Any.Pack(proto.Message) is the main usage of types.Any and so the proposed algorithm doesn't directly translate. However, adding also JSON support might be trivial and will use the same scans and can be transparently implemented too

b) Using a 6 byte suffix of [FNV32Hash(typeURL)]+"\xfe\xff" to uniquely identify the compressed values. To add JSON support, we can repurpose the second last end magic value to take in the type i.e. either proto or JSON

I'll keep developing that PR as the next 2 days proceed.

@aaronc
Copy link
Member Author

aaronc commented Aug 4, 2020

@odeke-em in preparation for our call tomorrow I want to call out that one of the key use cases I believe is actually being able to compress the Tendermint block history. That may be more of an issue in terms of disk storage than the application state.

We actually haven't even really had a chance to discuss this with anyone from the tendermint core team. @marbar3778 I wonder what your thoughts are on a compression layer that specifically compresses the type_urls on protobuf Anys and where that would fit? Is a tm-db Db wrapper the right part of the stack or is tm using something different now?

@tac0turtle
Copy link
Member

I wonder what your thoughts are on a compression layer that specifically compresses the type_urls on protobuf Anys and where that would fit?

Some db's are already compressing on writes (leveldb), but not for others (boltdb). I dont really have a preference nor have put any thought to this so I would not be the best to answer.

In Tendermint, we haven't tested any other compression tactics so I would be unable to say if its needed or not. If you are thinking of introducing something to tm-db then moving this conversation there would be best.

I will cc @erikgrinaker to get a second, more experienced take on this.

@erikgrinaker
Copy link
Contributor

tm-db treats keys and values as opaque byte slices, and I don't think we should start trying to interpret the values to apply dictionary compression on Protobuf Any fields in tm-db itself. If anything, this should happen at the encoding level, in practice causing us to use a custom variant of Protobuf, but I would advise against this since it makes it harder to debug database entries and use other tooling (see also: Amino).

If necessary, I would just use a standard compression algorithm instead. These generally do not work well on small pieces of data, i.e. individual keys and values, so it should be done at the storage level. Compressed data is not amenable to random access, so e.g. B-tree based database backends (such as BoltDB) have to compress individual blocks, with the same poor results as individual keys/values. However, LSM-tree based backends (such as LevelDB, RocksDB, and BadgerDB) write long runs of immutable data, and are much more susceptible to compression.

tl;dr Use built-in LevelDB/RocksDB compression and call it a day.

@aaronc
Copy link
Member Author

aaronc commented Aug 5, 2020

I did benchmark the built in leveldb and rocksdb compression with some test data using Any's. The built in compression does help but there is still overhead that could potentially be significant. I can't say that we've done benchmarks with a real world data set to see how the default compression performs with say a million blocks of transactions. Maybe that should really be done first.

We are opposed to using a custom variant of protobuf (like amino) and this is why we are thinking of handling this at the storage layer for each value written to the database. Since we have a pre-populated dictionary, the thought is that this should be much more efficient than the default compression. Regarding tm-db integration, this would be a layer that wraps an existing tm-db in a new tm-db instance.

I'm not understanding your concrete reasons for why this would be a bad idea. Could you maybe say a bit more on this @erikgrinaker ?

@erikgrinaker
Copy link
Contributor

I feel like doing this in tm-db breaks separation of concerns, and violates abstraction boundaries. tm-db shouldn't care about what keys and values contain, that's an application concern. If the benefit is large then it's sometimes worth it, but breaking abstraction boundaries comes at a complexity and maintenance cost. I suspect the compression benefit here is minor, and not worth it.

Also, what's being stored in the database is no longer Protobuf messages, they're now a custom format, and so it's harder to access the database entries for e.g. debugging or other tooling. I'd say our past experience with custom formats hasn't been great.

@odeke-em
Copy link
Collaborator

odeke-em commented Aug 14, 2020

I mailed out PR #6854 and in there I do comparisons that are reproducible by anyone or see https://gist.github.com/odeke-em/f92f76fe2acfcd566f7e5fa19bf3741e to do a tree listing then comparison of results.

Results

Given 5 millions blocks with exactly ONE kind of typeURL per block:

  • RocksDB we have savings of 16MiB
  • BoltDB, we have savings of 16MiB
  • GoLevelDB, we have savings of 3.078MiB

If we make the blocks heterogenous, we definitely will save even more.

I went strictly with blocks to validate any concerns about real world apples to apples comparison given that we'll be storing Tendermint's state. Now as we add a mixed variety of entries, we start to see results regardless of database.
The argument about inspecting the stored state can be responded to by the fact that if you want to audit, retrieval from the database will return you back the plain bytes transparently.

@aaronc
Copy link
Member Author

aaronc commented Aug 14, 2020

Thanks for doing those benchmarks @odeke-em !

@aaronc aaronc added ice-box S:blocked Status: Blocked labels Oct 14, 2020
@aaronc aaronc modified the milestones: v0.40.1, v0.42 Jan 6, 2021
@clevinson
Copy link
Contributor

We're putting this in the Status: Ready call and the next step would be from someone from our team to review this.

@tac0turtle
Copy link
Member

reopen if still relevant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S:blocked Status: Blocked
Projects
None yet
6 participants