-
Notifications
You must be signed in to change notification settings - Fork 155
Design Doc: Resource Cache Extension
Joshua Marantz, 2010-04-06
One of the high-value rewrites we can do is to extend the cache TTL of resources, so that there is less traffic between the browser and the server. This can be done safely by signing the URL based on the content (e.g. sha1 or md5). This idea begs the question: why doesn't the site owner set a high TTL on their resources in the first place. The answer is that setting a high cache TTL is challenging for site owners that are trying to incrementally evolve their javascript and css files.
While it's possible for site owners to version those files when referencing them in their HTML, this is often not done, and in some cases cannot be done. One Google service, for example, serves a 100k javascript file with a cache TTL of only 5 minutes. This file is referenced from an HTML "snippet" that is pasted on end-user sites throughout the web, so it's impossible to put a version number there.
However, the Apache rewriter can extend the cache lifetime of any resource on a web page, regardless of where the origin is. Let's look at an example. Embedded in every site is the following snippet:
<script type="text/javascript" src="http://www.google.com/.../example.js"></script>
served with this HTTP header:
Cache-Control:public, max-age=300
Content-Encoding:gzip
Content-Type:text/javascript; charset=UTF-8
Date:Tue, 06 Apr 2010 15:21:59 GMT
Expires:Tue, 06 Apr 2010 15:26:59 GMT
Last-Modified:Tue, 06 Apr 2010 15:21:59 GMT
Transfer-Encoding:chunked
X-Content-Type-Options:nosniff
We can rewrite this resource in the HTML as:
<script type="text/javascript" src=
"http://www.my_apache_server.com/r?908709898079_http_www.google.com_.../example.js"
></script>
The "908709898079" is a signature computed from the content of example.js. The HTML rewriter respects the original TTL by polling the original URL every 5 minutes, and determining whether the page signature has changed. If it has, then it changes the signature that it embeds in the rewritten URL to bust browser and proxy caches. As long at the rewritten HTML itself is not cached, this will avoid serving content that is more than 5 minutes stale, per the original Cache-Control header.
This is a better system because, although the rewriter must re-poll the origin content every 5 minutes, it can continue to serve the same URLs to browsers as long as the content doesn't change, so end-user browser-caches will remain valid.