-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Help Wanted, Questions] Improving Dictionary Training Process #4127
Comments
Hey @Cyan4973! Apologies for the direct ping, just wondering if you had any insight on our questions & whether our process is sensible, or if you see room for improvement? Thanks! |
Hi @neiljohari , it's a complex use case, unfortunately, it will require more than one sentence to answer.
That's a starting point. That being said, this statement is valid in general. It would be relatively simple to manufacture a counter example: While the scenario described above is a trope, it wouldn't be too hard to be in a somewhat related scenario. And now, I wonder if that's what's happening for your use case. One of the problems of the dictionary builder is that, that's not the use case it's aiming for. Hence, the "ideal" dictionary builder for this use case doesn't exist, or at least is not (yet) provided by this repository. One suggestion could be to use one complete JSON file as a point of reference, that would become the dictionary. It's a relatively simple test, and should give some information on how to move forward with more ideas. |
Hi team,
We are integrating zstd + shared compression dictionaries at Roblox for serving feature flag payloads! We think this is a good use case because the payload looks similar over time (people add new flags, flip them, or delete them but week-over-week the data is very similar) and we control the client so we can ship it with the dictionary.
I've been playing around with various training parameters and set up a few harnesses to help try out different param combinations (and found previous insight here), and found a few approaches that seem to work well. However it feels a bit like blindly guessing + we aren't sure if this is the best we can do, and we were wondering if the maintainers/community have insight into how we can improve our process.
Our payloads currently are ~435kb JSON that differ by client type, though the majority of flags between clients are the same. Examples:
We currently artificially expand our payload into a bunch of training files:
We validate the compression ratio it achieves on historical payloads to see the dictionary's effectiveness over time.
Example of our training dir structure:
Some things we've noticed that we don't quite understand:
Thanks in advance for your time, we're really excited to roll this out soon and would love your insights on how we can do even better!
The text was updated successfully, but these errors were encountered: