-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zstandard Compression and the application/zstd Media Type #105
Comments
Is there a particular context in which you're interested in it being supported? As an HTTP I think @indygreg has some familiarity with Zstandard, and maybe @ddragana or @martinthomson might have opinions? |
@dbaron I'm interested in support zstd in |
I believe that the Facebook folks indicated that they didn't intend for this to be used on the web on the basis that brotli was sufficiently performant. On that basis, this isn't that interesting, if this is indeed the case. We'd probably want to see performance numbers that justified the costs (which would include changes to the HTTP/QUIC header compression static tables, if we were serious). |
Caveat emptor: I'm not familiar with the implications of commenting on this matter and my words here reflect my personal opinion as someone familiar with the technology and not that of an official Mozilla position. I'm also not familiar with the nuances involved in making a decision to support zstandard in a web browser. When I wrote https://gregoryszorc.com/blog/2017/03/07/better-compression-with-zstandard/ in March 2017, my opinion would have been "zstandard on the web doesn't make much sense because the web has brotli and zstandard it isn't sufficiently different from brotli." This was also the conclusion we reached in https://bugzilla.mozilla.org/show_bug.cgi?id=1352595 when investigating zstandard for omni.ja compression (we ended up supporting brotli compression because it was readily available). Had zstandard been added to the web before brotli, I would have said the same thing at the time were someone to propose adding brotli to an already zstandard-enabled web. Fast forward ~1.5 years. In the time since, zstandard has continued to improve substantially. A lot of work has gone into dictionary compression, which could have positive benefits to the web. Read more at https://github.com/facebook/zstd/releases/tag/v1.2.0, https://github.com/facebook/zstd/releases/tag/v1.3.5, and https://github.com/facebook/zstd/releases/tag/v1.3.6. Another major new feature is support for lower/faster compression ratios (exposed as negative compression levels). This allows zstandard to approach lz4's compression/decompression speed. Read more at https://github.com/facebook/zstd/releases/tag/v1.3.4. While I have doubts it will be useful for web scenarios (due to high memory usage requirements), the "long distance matching" or "long range mode" has improved a bit, allowing faster compression at ultra high compression ratios. A potentially killer feature for the web is "adaptive compression," where zstandard can dynamically adjust compression parameters to account for line speed. e.g. if the receiver isn't consuming data as fast as zstandard can generate it, zstandard can throw more CPU at compression and reduce the amount of data going over the wire. Or if things are sender-side bottlenecked, zstandard can reduce CPU/compression and send more bits over the wire. The good news is zstandard has this feature and it is actively being improved (see https://github.com/facebook/zstd/releases/tag/v1.3.6). The bad news is it isn't part of the libzstd C API. I'm not sure if this feature will ever be part of the C API. Nor do I know how much work it would be to port this feature to the web. When I wrote my aforementioned blog post about zstandard, I lauded the flexibility of zstandard's compression/performance settings. You could go from low CPU/memory and very fast but poor ratio compression all the way to high CPU/memory and slow but high ratio compression. In the time since, the introduction of negative compression levels and long distance matching has broadened the use cases for zstandard. I believe it is without a doubt the best general purpose compression format available today. I should add a caveat that I haven't been following brotli's development super closely. But its development velocity is slower than zstandard's and a quick perusal of its release notes doesn't seem to reveal anything too exciting. It kind of looks like it is in maintenance mode or only looking for iterative improvements. (This could be a good thing for web technologies, I dunno.) One aspect of zstandard that is important for web consideration is its memory usage. Different compression settings have vastly different memory requirements on both producer and receiver. Obviously not all devices are able or willing to use all available memory settings. So considerations must be made on what "acceptable" memory use should be. RFC 8478 recommends limiting decoder-side memory usage to 8 MB. Should zstandard be exposed to the web, some thought should go into more formally expressing memory limits. My (pretty naive about web matters opinion) is that it would be wrong to limit to 8 MB across the board because some exchanges could benefit from using the extra memory and 8 MB could be really small in the future (just like zlib/deflate's 32 KB max window size is absurdly small in 2018). I think it would be better for peers to advertise memory limits and to negotiate an appropriate setting. Maybe 8 MB is the default and adaptive compression is used to increase, if allowed. I suspect a media type parameter could be leveraged to express memory requirements. I'm not sure if this was discussed as part of publishing RFC 8478... It's also worth noting that RFC 8478 and the application/zstandard media type only begin to scratch the surface with what's possible with (zstandard) compression on the web. Compression contexts in existing web technologies seem to map to single requests/responses. e.g. an HTTP Content-Encoding or HTTP/2 stream has a lifetime for the HTTP message payload. But you can do so much more with zstandard. For example, you can keep the compression context alive across logical units and "flush" data at those logical boundaries. This would allow the compressor/decompressor to reference already-sent data in a future send, reducing bytes over the wire. This was recently discussed at facebook/zstd#1360. And Mercurial's new wire protocol leverages this to minimize bytes over wire. You can also "chain" logical units so the compression context for item N+1 is seeded with the content of item N, allowing zstandard to effectively generate deltas between logical units. I have an API in python-zstandard for this https://github.com/indygreg/python-zstandard#prefix-dictionary-chain-decompression. Both these "flushing" and "chaining" concepts can be implemented in the form of a custom media type (and I don't believe are unique to zstandard). But I believe web technologies could potentially benefit by promoting these ideas to 1st class citizens, where appropriate. e.g. in "stream" APIs that send N discrete logical units between peers. (There are obviously security/performance considerations to keeping long-running compression contexts alive in memory, potentially across logical requests.) What I'm trying to say is it feels like web technologies are only scratching the surface of what's possible with compression and there's potentially some significant performance wins that can be realized by leveraging "modern compression" on the web. But I digress. All that being said, I'm not sure if there's enough here to justify both brotli and zstandard on the web. I do believe zstandard is the superior technology. But a strong case can be made that brotli is "good enough," especially if we're limiting the web's use of compression to "simple" use cases, such as traditional one-shot or streaming compression using mostly-fixed compression settings. Zstandard's case grows stronger if you want to explore dictionary compression, adaptive compression, negotiation of compression levels/settings, and flushing/chaining scenarios. With regards to Martin's comment about Facebook's prior indications, I would encourage reaching out to Yann Collet (the primary author of zstandard - and lz4) for his thoughts. He's Cyan4973 on GitHub. I hope this information is useful in making a decision! |
Well, maybe @Cyan4973 can add thoughts here. I understand that there are advantages, but the disadvantages of having yet another format are not insubstantial. That dictionary-based scheme is where I hold the most hope for zstd. There are several highly-motivated people working on studying this now. Those schemes are considerably more complex than a simple |
"Normal" For the web, aka the typical webpage displayed by Mozilla's Firefox, the situation is substantially different. A single innocent looking web page is nowadays composed of multiple resources of different nature (and different sources). html proper is merely one small part of it, there is also css, javascript, json, xml, etc. One could imagine to divide typical web traffic into a manageable set of ~5/6 categories, and then use a dedicated dictionary for each category. Seems simple design. Sure, dynamic dictionary fetching can provide even more benefits, but I don't see that happening any time soon (for public environments). I suspect the current scrutiny regarding potential (unknown) security implications will delay any adoption in this area by a number of years, if ever. As a consequence, I am more in favor of baby steps, reducing risks, and delivering some of the benefits of dictionary compression within a manageable timeframe. Introducing a static set of dictionaries, bundled with Of course, it also means that designing this static set of dictionaries becomes a critical operation, since global efficiency will directly depend on it, and eventually it will become a baseline for a number of years. For this critical stage, it's very important to know the web, type of resource, respective share and their evolution, have some available representative sample set, do some training, testing, shadowing, etc. As far as web expertise is concerned, it's hard to imagine any organization better than Mozilla. |
How would you choose a static dictionary for something like JavaScript? Beyond the basic syntax and idioms of the language, anything more seems like it would be Mozilla (or whoever) choosing a winner among JS libraries / frameworks -- something that seems like it would be good to avoid on an Open Web. |
An efficient dictionary is built on collection of statistics. A general plan to reach this goal is to grab a sufficiently large and representative sample of the web, and pass that to the generator. The generator automatically determines the best fragments and their ranking. There is no manual selection of any content anywhere in the process, so no one gets to pick a winner. One could say that selecting the samples could be an indirect way of favoring a winner. And that's why it's important to collect samples in a way which is as neutral and universal as possible. A few players stand out in this respect, and can be trusted, both technically and ethically, to get close to this objective (I obviously think of Mozilla as one of them). I also believe the sampling methodology should be published, to increase trust. There are a few golden rules to create a good sample set :
A compact dictionary is merely a few dozens of kilobytes, so the final selected fragments truly "stands out" and will be present a lot of times in the sample set. When it comes to text-based sources, the final dictionary can even be visually inspected. One could say that, even with all these safeguards in place, the sample set will be representative of the web as it is now, and therefore not follow its future evolution. For this topic, one can answer :
For discussion. |
i'm skeptical of custom/dynamic dictionaries (and state-based compression) being worth the trouble for most [web] product-developers. web-projects are fraught with risks, and many have learned (the hard-way) to always choose the safe, zero-config solution over the marginally-better-but-more-complicated one. you should limit your scope with the jquery-approach - focus on a universal, zero-config solution thats easy-to-use (and with "good-enough" performance-improvements over zlib), ratherr than for all-out performance at the cost of usability. |
@kaizhu256: you're absolutely right. A solution that requires individual site maintainers to create, configure, and deploy the state that forms the basis for stateful compression will never be (and should never be) widely adopted by site operators. However, I think that that objection misses the point. Judging a tool's utility by looking at unweighted operator adoption ignores the fact that actual HTTP traffic is extremely strongly skewed towards a very small number of very large operators (e.g., Akamai, Cloudflare, AWS S3, Google, Facebook...). For these organizations, anything that produces wins will probably be worth deploying, and the engineering cost to do so will be tractable given those organizations' scale. And even if only those organizations were to deploy such a system, internet users as a whole would benefit, since a significant amount of their traffic is terminated by those origins. Beyond that though, I think we can also pursue some form of state-based compression that can be implemented correctly/safely with no operator oversight (other than enabling/linking some |
@Cyan4973 I understand how to build a representative dictionary. What I'm asking is whether empowering content that is already popular by making it more compressible -- thereby giving it a competitive advantage over its competition, including new libraries -- is good for the long-term health of the Web. |
I suspect there is a difference in projected timeline. We have been looking at long term pattern evolutions, and our current understanding is, there is no such thing as a "universal timeless referential", immune to framework/coding trends. If one looks at the code of websites 10 years ago, one sees little in common with nowadays. All referential decay. By artificially undervaluing present samples, the main thing it achieves is to create a dictionary which is less relevant today, but it doesn't get any better in the future (we tried). A dictionary is expected to have a short lifespan. How short is the good question. In private environments, dictionaries can be updated every week, even faster for special cases. For the web "at large" it wouldn't work. I suspect that just discussing the update mechanism could take several months. Targeting a lifespan of a few years, including overlapping periods with older / newer dictionaries, feels more comfortable. Even at such a slower pace, the main property remains : this scheme does not remain "stuck in time", it evolves with the web. Also :
|
Would be nice to learn about streaming abilities of zstd. With some of our previous designs of compressors (like gipfeli) we used formats that were faster but less streamable than brotli. In brotli nearly every byte you receive will bring you new decodable data, i.e., no hidden buffer, and you can start decoding from the first bytes. If further processing of decompressed data is CPU heavy (such as parsing and dom reconstruction), being able to start it earlier during the transfer can lead to substantial savings. If further processing depends on high latency events like fetching new data, being able to issue these fetches earlier is rather important. How many bytes do you need to receive before you can emit the first byte? Does making it acceptable for web use need special settings that impact compression density or decoding speed? |
Would be nice to see a comparative benchmark that compresses and decompresses payloads for the internet with the planned decoding memory use (something like 256 kB to 4 MB range for backward reference window). Even better if I can run that benchmark myself on my favorite architecture. zstd used to be in this benchmark, but the benchmark author removed it. Perhaps we could convince them to add it back. https://sites.google.com/site/powturbo/home/web-compression |
(adding for later use) -- Bugzilla bug is https://bugzilla.mozilla.org/show_bug.cgi?id=1301878 |
What I'm seeing in this thread is that, while everyone who has weighed in so far thinks zstandard is a neat technology, its utility in a general-purpose web browser is of somewhat limited utility (given that we're shipping brotli already), and (as @martinthomson points out) adding more formats without significant improvements is generally something we want to avoid (due, among other things, to increased maintenence cost). This calculus may change as standardized dictionaries are published and we get a feel for performance with such dictionaries. My proposal is to mark this as Any final comments before I close this issue accordingly? |
1 year ago (which is when this thread was started), I would have agreed with this conclusion. But since then, our experience on http traffic has improved, and as a consequence, our position has shifted a bit. Facebook is a big user of brotli. We will continue to use brotli, but we had to dial it down a bit, for the following reasons :
Therefore, if Facebook found it advantageous for its architecture to deploy That's a new element, which wasn't available one year ago, and may be relevant to the outcome of this RFP. On the topic of dictionary compression for the web, there are progresses too, and early experiments measure excellent results. But it's early days, it's not widely deployed yet, and we hope to share more details in the future. A set of static dictionaries would be a nice way to bridge that gap, since it removes the issue of dynamic fetching, where most security risks are concentrated. Even as a temporary solution bridging a few years while waiting for standardization of fully dynamic dictionaries, it can bring valuable benefits for the Internet ecosystem. Plus it's not new since brotli already ships a standard dictionary as part of its library. I believe this is a different topic, and it may deserve opening another RFP, keeping this one for |
I have built a POC infrastructure at work with custom tweaked In our end |
Tagging @indygreg, @ddragana, @martinthomson, @dbaron, @lukewagner, @bholley, and @annevk for input, taking the conversation so far into account. It would be most ideal if you could weigh in on your proposed disposition of this topic (from among |
I classify as: |
I lean towards To reconsider, I'd want to see a more comprehensive and quantitative analysis of the benefits zstd would unlock on the web today - either dictionary-less, with static dictionaries, or with dynamic dictionaries. For either of the latter two options, we'd also need a credible plan for generating and delivering those dictionaries that aligns with our values. Compression schemes can mature and flourish without ground-floor support from web browsers. There are lots of organizations with strong economic incentives to use the best technology between endpoints under their control - so if zstd is truly the superior choice, I expect we'll see other players starting to deploy it. Repeated and diverse success stories would certainly bolster the case for inclusion in the web platform. |
@brunoais There is wget2 with upstream support for zstd and brotli (and others). You can adjust the number of parallel threads for testing (for lists of URLs or for recursive downloads) to quickly compare impact of different compression types. Wget2 also has --stats-* options to write timings and payload sizes (compressed and uncompressed) and more as CSV - easy to feed that into most stats/gfx tools. https://gitlab.com/gnuwget/wget2 |
I sincerely appreciate the input from advocates of the ZSTD scheme here, and I thank you for taking the time and effort to make your case. In parsing out the positions of Mozilla community members, I'm seeing a pretty clear signal that we want to place this in I plan to close this as
|
Do you known a public http/https server with zstd compression enabled? |
wget2 -d 'https://de-de.facebook.com/unsupportedbrowser' shows that the server uses zstd over http/2. From the debug log:
|
I run with success |
@gvollant, most of Facebook's services now support the
There's a pretty good diversity of implementations, which exercise the various features of the spec pretty well. We do chunked, streaming transfers (should be visible by requesting https://www.facebook.com/, for example). We also have paths that stream a sequence of independent frames, which is an exciting feature of the spec that lets the server and client drop the compression context between chunks, rather than having to hold a window buffer open. This feature let us significantly increase the connection tenancy on one of our servers that streams updates over time (which had previously been memory-bound holding these contexts open). |
#771 seems like it might be a potential path to getting Compression Dictionaries, which could resolve some of the concerns that left this Request with a |
Request for Mozilla Position on an Emerging Web Specification
Other information
https://facebook.github.io/zstd/
The text was updated successfully, but these errors were encountered: