Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support compression with external dictionary #272

Open
sluongng opened this issue Sep 22, 2023 · 6 comments · May be fixed by #276
Open

Support compression with external dictionary #272

sluongng opened this issue Sep 22, 2023 · 6 comments · May be fixed by #276

Comments

@sluongng
Copy link
Contributor

Both Brotli and ZSTD support speeding up compression with a set of external dictionaries, trained from sample data.

It would be nice if our API could be made aware that (a) there is a dictionary involved in the compression and (b) how to upload/download that dictionary from/to client to/from server.

@mostynb
Copy link
Contributor

mostynb commented Sep 22, 2023

One of the main scenarios for cache server compression is the ability to store blobs in compressed form and to serve those directly without recompression on each request. If the server supported a single dictionary I think that could still work, as long as the client has the dictionary or knows how to get it. But to do this I think we would need a way to expose this to clients without creating a backwards incompatibility (eg add new Compressor values like ZSTD_DICTIONARY and a way to expose the dictionary). Would that cover your use case?

I have a feeling that supporting multiple dictionaries would add a significant amount of complexity (but I could be wrong).

@bergsieker
Copy link
Collaborator

I'm not familiar with the details, but from the description it sounds like the dictionaries are used to "speed up" compression by providing hints about the likely format of the data. I'd assume that arbitrary data could still be compressed, but that doing so would be slower than compression dictionary-like data.

My take on this is that the dictionary can and should be provided via the control plane, and thus doesn't belong in the RE API itself. That is, however you specify your worker shapes and pool sizes, you can also specify a compression dictionary, which will then be used implicitly when data is uploaded or served. The system admin is responsible for making sure that the dictionary is a reasonable match for the expected data.

@sluongng
Copy link
Contributor Author

My bad for being vague on the original post. It's what I have in the back of my head for the last 2 months or so, so I did not include many details. So let me provide a bit more context.

ZSTD / Brotli compression does not work very well with small data. The smaller the amount of data to compress, the more difficult it is to compress well. Most compression algorithms leverage on some "base" data and then write subsequent bytes as some form of "delta" / diff against the "base" to reduce the data size. So for smaller files, there isn't enough data to generate a practical "base", or you spend way too much compute for very little gain.

To solve this, you could compute a dictionary from sample data sets, representing your data, and use that dictionary as a base. It will improve compression/decompression speed while improving the compression ratio. Brotli uses a hardcoded dictionary by default, and ZSTD supports the usage of multiple external dictionaries. See http://facebook.github.io/zstd/#small-data and brotli for more info.

Given the fact that our API is mostly used with Build Tools. If you zoom out a bit, most of the input/output files for actions in a graph are source code and artifacts derived from source code. This means that most of the artifacts being transferred with our API are small and using the default ZSTD compression might be inefficient. We should investigate the usage of external dictionaries to help improve compression efficiency.


  1. Would using a single dictionary cover our use case?

I think that would work as the first iteration. As a monorepo changes over time, it's very likely that the effective dictionary will also change over time. Our API, at minimum, should accommodate for migration use case between different dictionary.

Taking a look at the usage of ZSTD's external dictionaries in real life in Meta's Sapling VCS, a PackBlob could store multiple versions of the same files(blob) in compressed form and use the first version of a file as the dictionary to compress the latter versions. That means dictionary usage could be as fine-grained as 1 dictionary per file path.

Another example is how Git Servers are optimizing their storage today: git objects when pushed to the server side were often stored as loose (plain, non-compressed) objects. Regularly the server will run git-repack (or git-gc, which runs git-repack) to delta compress the loose objects into a single pack-file. During the repack process, the base of the delta could be re-calculated based on the input objects. If a server wants to optimize its git storage backend over time, git-repack needs to run very regularly, leading to base and delta changes. There are some recent optimizations around this space, but the general concepts are still true today with Github, Gitlab, GitTea, etc... implementations. I think if we start to support dictionary compression, RE API servers would want to implement similar optimization for each tenant using the server.

Supporting multiple dictionaries would add a significant amount of complexity

It depends. If we simply treat dictionaries as blobs, then it should be as simple as calling download on blobs.
Are you thinking of something more complicated? I would love to hear more!

  1. Could this be handled by the control plane instead of within RE API?

It could be if we are looking into a single hard-coded dictionary use case.
However, as stated above, it would be more effective and flexible if we allowed the usage of multiple dictionaries.

This means that ideally, (a) this should be part of RE API as a new compressor value and (b) we provide some guidance for how dictionaries could be discovered and exchanged between client and server. FWIW, I don't think we need a new RPC to accommodate for this, simply an agreement on ordering of things would work.

As an example: Chrome is adding some specs to support Brotli/Zstd compression for HTTP responses https://chromestatus.com/feature/5124977788977152. And the specification is being worked on here https://github.com/WICG/compression-dictionary-transport

@mostynb
Copy link
Contributor

mostynb commented Oct 3, 2023

  1. Would using a single dictionary cover our use case?

I think that would work as the first iteration. As a monorepo changes over time, it's very likely that the effective dictionary will also change over time. Our API, at minimum, should accommodate for migration use case between different dictionary.
...
It depends. If we simply treat dictionaries as blobs, then it should be as simple as calling download on blobs. Are you thinking of something more complicated? I would love to hear more!

If I understand correctly, zstandard data compressed with a dictionary can only be decompressed by a receiver that also has the dictionary. Dictionaries are either implicit, or specified by a 32bit dictionary ID. This leads to some obvious questions:

  1. When uploading data, how would clients know which dictionaries are supported?
  2. When uploading data, how would clients know which ID to use for each dictionary?
  3. When serving data, which dictionaries can the server safely use?
  4. When serving data, which dictionary IDs should the server use for each dictionary?

(I am not as familiar with brotli- does it provide a way to specify which dictionary was used for compression?)

@sluongng
Copy link
Contributor Author

sluongng commented Oct 5, 2023

(I typed up something on my mobile browser... misclicked and it's gone... So this is going to be my attempt no. 2)

These are great questions which is what I was not sure myself when I first created this issue.
However, we could reduce the constraints for the initial version by applying some assumptions as follows:

  1. Because Brotli Shared Compression has a complicated history, let's only focus on ZSTD dictionary support for now.

  2. In zstd dictionary compression, the dictionary comes with an ID, which is added to every compressed file's header. However, this could be opt-ed out to omit the dict's ID addition. For the sake of simplicity, let's only support the case where the dict's ID is included in the compressed file header for now. This should help us differentiate between zstd compressed file without a dictionary vs zstd compressed file with a dictionary but without a dict's ID.

  3. We should also be able to assume that only server implementation could/should create the dictionary. The client almost always has fewer data than the Remote Cache backend and thus, would not be able to decide which dictionary(ies) would work best. This means we should not have to worry about client -> server dictionary upload, but only care about server -> client dictionaries download.

With these assumptions in mind, I was thinking that we could advertise a DictionaryTree: <Digest> field in the CacheCapabilities message. This tree object should be shallow, meaning there should be no sub-trees (directories) or symlinks in it. In this tree, dictionary files are included with their ID being the file's name. We could include a special file's name such as latest or default to signify the default dictionary the client should be using to compress contents to be uploaded. Each dictionary file could also opt-ed to be compressed by the default dictionary, or any other dictionary available in the tree.

When the client first runs GetCapabilties RPC from the server, it can obtain the digest and download the DictionaryTree content accordingly. Then for each dictionary ID found in the header of a compressed file, the client could download and cache the dictionary locally before performing decompression.

Note that we should leave the server implementations with some freedom to decide how to generate and store the dictionaries. So I won't discuss that further in detail here.

If interested, I could take a crack at submitting a PR.

@sluongng
Copy link
Contributor Author

sluongng commented Oct 5, 2023

With Brotli support in mind, it's probably should be DictionaryTrees: map[Compressor_Value]Digest instead of a single DictionaryTree.

sluongng added a commit to sluongng/remote-apis that referenced this issue Oct 13, 2023
@sluongng sluongng linked a pull request Oct 13, 2023 that will close this issue
sluongng added a commit to sluongng/remote-apis that referenced this issue Oct 13, 2023
sluongng added a commit to sluongng/remote-apis that referenced this issue Oct 13, 2023
sluongng added a commit to sluongng/remote-apis that referenced this issue Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants