Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider shared caching #22

Closed
joelweinberger opened this issue Dec 21, 2015 · 63 comments
Closed

Consider shared caching #22

joelweinberger opened this issue Dec 21, 2015 · 63 comments

Comments

@joelweinberger
Copy link
Contributor

We've had a lot of discussions about using SRI for shared caching (see https://lists.w3.org/Archives/Public/public-webappsec/2015May/0095.html for example). An explicit issue was filed at w3c/webappsec#504 suggesting a sharedcache attribute to imply that shared caching is OK. We should consider leveraging SRI for more aggressive caching.

@btrask
Copy link

btrask commented Apr 24, 2016

I hope this is a reasonable place to comment. (If not please tell me where to go.)

I've been working on content addressing systems for several years. I understand that content addresses, which are "locationless," are inherently in conflict with the same-origin policy, which is location-based.

An additional/alternate solution is for a list of acceptable hashes to be published by the server at a well-known location.

For example, the user agent could request https://example.com/.well-known/sri-list, which would return a plain text file with a list of acceptable hashes, one per line. Hashes on this list would be treated as if they were hosted by the server itself, and thus could be fetched from a shared cache while being treated for all intents and purposes like they were fetched from the server in question.

This does add some complexity both for user agents and for site admins. On the other hand, the security implications are well understood, and wouldn't require new permission logic.

Thanks for your work on SRI.

@joelweinberger
Copy link
Contributor Author

An interesting idea (although I know many folks who are vehemently against well-known location solutions, but I won't pretend to fully grasp why). If implemented, though, it would still require a round trip to get .well-known/sri-list, right? Which seems to lose a lot of the benefit of these acting as libraries.

Another suggestion, that I think I heard somewhere, is, if the page includes a CSP, only use an x-origin cache for an integrity attribute resource if the CSP includes the integrity value in the script-hash whitelist. I think this would address @mozfreddyb's concerns listed in Synzvato/decentraleyes#26, but I haven't thought too hard about it. On the other hand, it also starts to look really weird and complicated :-/

Also, these solutions don't address timing attacks with x-origin caches. Although, as a side not, someone recently pointed out to me that history timing attacks in this case are probably not too concerning from a security perspective since it's a "one-shot" timing attack. That is, the resource is definitively loaded after the attack happens, so you can't attempt the timing again, and that makes the timing attack much more difficult to pull off, since timing attacks usually rely on repeated measurement.

@btrask
Copy link

btrask commented Apr 26, 2016

Using a script-hash whitelist in the HTTP headers (as part of CSP or separately) is better for a small number of hashes, since it doesn't require an extra round trip. Using a well-known list is better for a large number of hashes, since it can be cached for a long time.

I agree that well-known locations are ugly. Although it works for /robots.txt and /favicon.ico, there is a high cost for introducing new ones.

The privacy problem is worse than timing attacks: if you control the server, you can tell that no request is ever made. This seems insurmountable for cross-origin caching.

Perhaps the gulf between hashes and locations is too large to span. For true content-addressing systems (like what I'm working on), my preference is to treat all hashes as a single origin (so they can't reference or be referenced by location-based resources).

Thanks for your quick reply!

@mozfreddyb
Copy link
Collaborator

I'd be slightly more interested in blessing the hashes for cross-origin caches by mentioning in the CSP. .well-known would add another roundtrip. I'm not sure if that's going to impact hamper the performance benefit that we wanted in the first place.

The idea to separate hashed resources into their own origin is interesting, but I don't feel comfortable drilling holes that deep into the existing weirdness of origins.

@btrask
Copy link

btrask commented Apr 26, 2016

To be clear, giving hashes their own origin only makes sense if you are loading top-level resources by hash. In that case, you can give access to all other hashes, but prohibit access to ordinary URLs. But that is a long way off for any web browsers and far from the scope of SRI.

@mozfreddyb
Copy link
Collaborator

For the record, @hillbrad wrote a great document outlining the privacy and security risks of shared caching: https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html

@kevincox
Copy link

That document doesn't appear to consider an opt-in approach. While this would reduce the number of people who do it it could be quite useful.

<script src=jquery.js integrity="..." public/>

This tag should only be put on scripts for which timing is not an issue. Of course deciding what is pubic is now the responsibility of the website. However since the benefit would be negligible for anything that is website specific this might be pretty clear. For example loading a script specific to my site has a single URL anyways, so I may as well not put public otherwise malicious sites can figure out who has been to my site recently even though I don't get any benefit from the content-addressed cache. However if I am including jQuery there will be a benefit because there are many different copies on the internet and at the same time it means that knowing whether a user has jQuery in their cache is much less identifying.

That being said if FF had a way to turn this on now I would enable it, I don't see the privacy hit to be large and the performance would be nice to have.

@hillbrad
Copy link
Contributor

hillbrad commented Dec 21, 2016 via email

@kevincox
Copy link

kevincox commented Dec 21, 2016 via email

@btrask
Copy link

btrask commented Dec 21, 2016

A "public" flag seems like a good solution to me. It seems to encapsulate both the benefits and the drawbacks of shared caching. It says, "yes, you can share files publicly, but that means anyone can see them."

That said, if it's opt-in, there's the question of how many sites would actually use it, and whether it's worth the trouble. Especially if it has to be set in HTML, rather than say by CDNs automatically. Maybe it would work better as an HTTP header?

@ScottHelme
Copy link

ScottHelme commented Dec 22, 2016

Setting in the HTML doesn't seem to be a big problem. If large CDN providers include this in their example script/style tags then sites will copy and paste support for this. A similar approach is currently being used for SRI and although it's not as fast as I'd like, usage will slowly grow. Sites that are also looking for those extra performance boosts would be keen to implement it.

@kevincox
Copy link

kevincox commented Jan 2, 2017

The idea of a public header (or even another key in Cache-Control) sounds quite interesting and elegant, however I think it would make it more difficult to use as one significant use case of this is to let each site to point to their own copy of a script, rather then a centrally hosted one. This means that each site would have to add headers to some of their scripts rather then just a modification in HTML. Not that either is a huge barrier but often static site hosting makes it difficult to set headers especially for a subset of paths.

At the end of the day I have not major objections to either option though.

@btrask
Copy link

btrask commented Jan 3, 2017

@kevincox Yes, I was suspecting that Cache-Control: public might be appropriate. It seems like the HTTP concept of a "shared cache" is fundamentally equivalent to SRI shared caching. See here for definitions of public and private: https://tools.ietf.org/html/rfc7234#section-5.2.2.5

The Cache-Control security concerns (cache poisoning, accidentally caching sensitive information) are prevented by hashing. The only remaining security consideration is information leaks, which Cache-Control: public seems to address.

I'm not opposed to using an HTML attribute instead, but I think it's good to reuse existing mechanisms when they fit. Caching has traditionally been controlled via HTTP, not HTML.

There are a few other ways to break this down:

  • Does an HTML attribute make more sense for non-HTTP (file:, data:, ftp:, etc.) resources? (There's an argument for shared caching across protocols, which a HTTP header wouldn't really help with; on the other hand, caching doesn't make much sense for some protocols)
  • Is publicness a property of the resource itself, or the use of that resource? (My intuition says the resource, since the point is that it can be shared between different contexts)
  • Which is better for third party resources (e.g. hotlinking)? (Either approach can be limiting)

I think that thinking about it in terms of "which method is easier for non-expert webmasters to deploy?" is likely to lead to a suboptimal solution. Yes some people don't know how to set HTTP headers, and some hosts don't let users set them, but in that case they are already stuck with limited caching options. Unless we're going to expose all of Cache-Control via HTML.

@brillout
Copy link

brillout commented Mar 8, 2017

@btrask A website highly concerned about privacy and loading <script src='/uncommon-datepicker.jquery.js' integrity="sha....." /> will want to make sure that uncommon-datepicker.jquery.js is never loaded from the shared cache. Whether the shared cache should be used or not is to be controlled by the website using the resource and not by the server who first delivered the resource.

@btrask
Copy link

btrask commented Mar 8, 2017

@brillout: Yes, good point. Using a mechanism not in the page source defeats the purpose, when the page source is the only trusted information. Thanks for the tip!

@brillout
Copy link

brillout commented Mar 8, 2017

@metromoxie
@mozfreddyb
@kevincox
@ScottHelme

Are we missing any pieces?

The two concerns are;

  • CSP
  • Privacy / "history attacks"

Solution to privacy: We can make the shared cache an opt-in option via an HTML attribute. I'd say it to be enough. (But if we want more protection then browsers could add a resource to the shared cache only when many domains use that resource. As described in https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html#solution and w3c/webappsec#504 (comment)).

Solution to CSP: UA should treat scripts with enabled shared cache as inline scripts. (As described here w3c/webappsec#504 (comment).)

It would be super exciting to be able to use bunch of web components using different frontend frameworks behind the web component curtain. A date picker using Angular, an infinite scroll using React and a video player using Vue. This is currently prohibitive KB-wise but a shared cache would allow it.

And with WebAssembly the sizes of libraries will get bigger increasing the need of such shared cache.

@nomeata Funny to see you on this thread, the world is small

@annevk
Copy link
Member

annevk commented Mar 8, 2017

An opt-in privacy leak isn't a great feature to have.

@brillout
Copy link

brillout commented Mar 8, 2017

An opt-in privacy leak isn't a great feature to have.

How about opt-in + a resource is added to the shared cache only after the resource has been loaded by several domains?

@kevincox
Copy link

kevincox commented Mar 8, 2017 via email

@brillout
Copy link

brillout commented Mar 8, 2017

I don't think that really helps as the attacker can purchase two domains
quite easily.

Yes it can't be n domains where n is predefined. But making n probabilistic makes it considerably more difficult for an attack to be successful. (E.g. last comment at w3c/webappsec#504 (comment).)

@strugee
Copy link

strugee commented Mar 10, 2017

CSP has (is getting?) a nonce-based approach. IIUC the concern with CSP is that an attacker would be able to inject a script that loaded an outdated/insecure library through the cache, thus bypassing controls based on origin. However requiring nonces for SRI-based caching seems to solve this issue as the attacker wouldn't know the nonce; it also creates a performance incentive for websites to move to nonces, which are more secure than domain whitelists for the same reason[1].

I think it's possible that we could solve the privacy problem by requiring a certain number of domains to reference the script... it'd be really useful to have some metrics from browser telemetry here. For example if we determined that enough users encountered e.g. a reference to jQuery in >100 domains for that to be the minimum, it might be that we could load things from an SRI cache if they had been encountered in 100+ distinct top-level document domains (i.e. domains the user explicitly browsed to, not that were loaded in a frame or something). The idea being that because of the top-level document requirement, the attacker would have to socially engineer the user into visiting 100 domains, which would be very, very difficult. However if telemetry told us that 100 is too high a number and it's actually more like 20 for a particular jQuery version, that'd be a different story.

[1]: consider e.g. being able to load an insecure Angular version from the Google CDN because the site loaded jQuery from the Google CDN

@zrm
Copy link

zrm commented Apr 5, 2017

For example, the user agent could request https://example.com/.well-known/sri-list, which would return a plain text file with a list of acceptable hashes, one per line.

For some domains that file could be too large and change too often. Consider Tumblr's image hosting (##.media.tumblr.com) where each of the domain names host billions of files and the list changes every second.

How about something similar to HTTP ETag but with a client-specified hash algorithm. If the hash is correct you only get a response affirming as much instead of the entire file, which the browser can cache. It doesn't save you the round trip but it saves you the data.

@MatthewSteeples
Copy link

From a privacy perspective, could we make it so that the resource is loaded from each origin at least once (if for no other reason than to verify that the SRI hash is valid). The browser could still then only cache one instance of it (and re-use whatever compilation cache etc that it deems relevant) but only stores that information once (and with various weightings etc the file may persist in cache for longer).

This removes some of the benefit that user agents could get from a "first load" perspective, but solves the privacy issue and keeps some of the other benefits.

As a side note, this could actually be implemented without the use of SRI hashes. If the browser links together identical files based on contents (eg stored against a hash), then it could perform this kind of optimisation irrespective of whether the website declares SRI hashes.

@ArneBab
Copy link

ArneBab commented Feb 5, 2020

@MatthewSteeples which benefits remain? If the browser only downloads but skips compilation, the privacy problems resurface via timing attacks.

@MatthewSteeples
Copy link

@ArneBab while theoretically possible, we're talking about a one-shot attempt to time how long it took the browser to compile something. You couldn't do repeated measurements to benchmark the speed of the device, or know what else was happening at the same time, so I'm not sure how reliable the numbers would be unless you're targeting a significantly large JS file. Would the same be true for CSS files?

If it's still too much of a privacy risk, you could still have the battery benefit by just sleeping for how long the compilation took last time

@ArneBab
Copy link

ArneBab commented Feb 6, 2020

@MatthewSteeples they could provide other files with intentional changes to benchmark the browser during the access, and sleeping can be detected, because it can speed up the other compiles.

So you don’t really win much in exchange for giving up the benefits of not accessing the site at all. For CSS files this is true, too. As an example you can take this page with minimal resources which shows significant parse-time in Firefox.

But it would be possible to provide real privacy with a browser-provided whitelist and canonical URLs. That keeps the benefit of already having the file locally most of the time.

So the core question is: if you download (and compile, because otherwise this is detectable), even though you have the file locally, which benefits remain? Are there benefits that remain?

@Cristy94
Copy link

Cristy94 commented May 29, 2020

A shared cache does definitely bring a lot of advantages (faster sites, less data usage for the user, less network usage for the ISPs, browsers could cache the compiled/interpreted files, etc).

From what I read in this thread, the main pushback is the privacy concern that a specific user could be tracked by checking whether he has a specific file cached or not, meaning that we can know if the user visited a site (or same site) before that had the same file included.

The solutions I see for the privacy concerns:

  • A browser flag, similar to DoNotTrack, where users can opt-out (or in) of using the shared cache
  • The requests to the "public" files in the shared cache send reduced user information (eg. don't send cookies, but hiding the IP for example might not be possible).
  • Browsers could only cache files in the shared cache if they are included on a lot of domains with a lot of traffic, that way the tracking is effectively worthless as you can not say that user X visited site Y, you could only say yes, user X probably visited one of those 100.000 domains. Alternatively, only add the resource to the cache with a small chance, the more domains it's on, the higher the chance.
  • This is a bit crazy, but browsers could try to download the resource in the background from a random domain/server which is close to the user, acting as a CDN by itself. So I go to X.com, there is a miss of the shared cache file, browser has a list of domains that have that resources, downloads it from a random one Y.com. One issue with this is that the traffic usage might be shifted towards servers who don't really want that traffic, but maybe this could be the compromise you have to do in order to use the "public" shared cache on your site: you allow users from other sites to load that resource from you.

I think that the shared cache is a lot better from a privacy point of view than including the resources from a 3rd party domain. So, although it allows some sort of tracking, it is still a step forward from just having all the websites linking to the same file on a CDN.

@mozfreddyb
Copy link
Collaborator

You're misreading. The main pushback is the security concern.

The privacy concern is already existing for CDNs and browsers are fighting it.
Safari is doing it, Firefox will: Resources (like CDN stuff) will land in a per-top-level-website ("first party") cache, that will make the bandwidth and speed wins from a CDN void.

Safari calls it "partitioned cache"
Firefox calls it First Party Isolation.
whatwg/fetch#904 has some standards-specific context.

I'm afraid this will never be.

@mozfreddyb
Copy link
Collaborator

@hillbrad wrote a great document outlining the privacy and security risks of shared caching: https://hillbrad.github.io/sri-addressable-caching/sri-addressable-caching.html

@w3c w3c locked as resolved and limited conversation to collaborators Jun 2, 2020
@w3c w3c unlocked this conversation Jun 2, 2020
@mozfreddyb
Copy link
Collaborator

(@annevk asked me to unlock the conversation. I'm not too hopeful about seeing new information in this 5 year old thread.)

@brillout
Copy link

brillout commented Jan 2, 2023

I don't think there is a fundamental problem here that makes this impossible?

For example, if browsers were to cache popular libraries such as React and Vue, then this wouldn't pose any problems, correct?

If we can find a technique ensuring that only popular library code is cached (instead of unique app code), then we solve the problem, right? (I'm assuming that Subresource Integrity Addressable Caching covers all known issues).

Could we maybe reopen this ticket? I'd argue that as long as we don't find a fundamental blocker, then having a cross-origin shared cache is still open for consideration.

The benefits would be huge... it seems very well worth it to further dig.

@mozfreddyb
Copy link
Collaborator

(I'm assuming that Subresource Integrity Addressable Caching covers all known issues).

The other comment just before my last one has a newish - imho fundamental - blocker. Browsers are already partitioning their cache per first-level site (eventually more granular. Maybe per origin, or per frame-tree).

This issue just turned 7 years old. I'll leave this issue closed because nobody has managed to come up with an idea since.

New issues are cheap. I'm still happy to discuss new and specific proposals - I just currently do not believe those to exist.

@mozfreddyb mozfreddyb pinned this issue Jan 3, 2023
@brillout
Copy link

brillout commented Jan 3, 2023

The other comment just before my last one has a new_ish_ - imho fundamental - blocker.

Do you mean https://terjanq.github.io/Bug-Bounty/Google/cache-attack-06jd2d2mz2r0/index.html?

In other words, the consequences are worse than initially thought: it's not only a privacy concern, but also a security concern. For example, it enables attackers to use brute-force attacks to guess private data such as a password saved in Google Keep. (Because Google Keep loads different assets depending on whether Google Keep's search returns 0 results, as explained in VI. Google Keep > Vulnerable resource.)

I also think it's a fundamental blocker for small assets such as individual images of an icon library.

That said, I can still see it to be possible to have a shared cache for widespread assets such as React or Angular. Just for the sake of the argument and regardless of feasibility, if websites can declare dependencies in a global manner (e.g. "Google Search" being able to say "I depend on Angular"), then AFAICT this doesn't pose any problems.

Another more interesting example: "Google Keep" can declare that it uses the font "Inter". The interesting thing here is that this doesn't suffer the security issue that I described above, because the dependency is defined globally instead of being defined page-by-page.

As for privacy, it's paramount that only widespread assets (e.g. React, Angular, Inter, ...) can be shared-cached. No code unique to a website/page should ever be shared-cached.

All-in-all I can't see any problems with such high-level goal of enabling websites to globally declare dependencies on widespread assets. Or am I missing something?

While it's challenging to find a concrete technique for implementing this high-level goal (e.g. I'm not sure how a "website" can "globally declare" its "dependency" on an "asset"), I think there is still hope.

Thanks for the discussion, I'm glad if we can bring everyone interested in this on the same page.

@annevk
Copy link
Member

annevk commented Jan 3, 2023

This has been investigated in depth by both Google and Mozilla. You're welcome to try again, but the bar at this point is indeed a concrete proposal. This was essentially only found to work if you bundle the libraries with the browser, which creates all kinds of ecosystem problems.

@alexshpilkin
Copy link

alexshpilkin commented Jan 3, 2023

@mozfreddyb I’m sorry, I’m confused. From latest to earliest of your comments, there is the one I’m replying to, then a procedural one, then one that links to Brad Hill’s writeup (which @brillout already mentioned, although without addressing the cross-origin laundering issue—the mention of Want-Digest upthread does that), then one that corrects a previous commenter, says security is the problem, and otherwise amounts to “browsers are switching to separated caches” (which is the one you linked, and it’s true, but I can’t see how it constitutes a “newish blocker”—or did you indeed mean to refer to cache probing with error events as a new security-relevant point?).

@mozfreddyb
Copy link
Collaborator

@alexshpilkin

@mozfreddyb I’m sorry, I’m confused.

Fair enough, I agree that I haven't been super clear throghout this thread and that I did not re-read through all of it for every comment I submitted.

did you indeed mean to refer to cache probing with error events as a new security-relevant point?).

By "new-ish" I meant "not captured in Brad Hill's doc". Does that answer your question?

-- @annevk said:

This has been investigated in depth by both Google and Mozilla. You're welcome to try again, but the bar at this point is indeed a concrete proposal. This was essentially only found to work if you bundle the libraries with the browser, which creates all kinds of ecosystem problems.

Indeed, the bar is "come up with a proposal that addresses all of these issues" AND either avoids bundling a library into the browser - or does it but then address the (imho) significant ecosystem issues that Anne mentioned. For those new to this, I also found Alex Russel's blog post Cache and Prizes a decent summary which involves the concerns with bundling.

To be extra clear, it's not my intention to be gatekeeping or blocking any progress here. I just want to share what we've discussed and considered, because we thought about it for a very long time. All in all, it could be nice if it was solved well.

For new, concrete proposals please open up a new issue.

@brillout
Copy link

brillout commented Jan 4, 2023

I'm glad to hear that there is still interest.

You're welcome to try again, but the bar at this point is indeed a concrete proposal.
For new, concrete proposals please open up a new issue.

Sounds good. Challenge accepted!

I'll be mulling over all of this in the next coming days. I've a couple of design ideas already.

I'll report back.

Thanks for giving me the opportunity to (maybe) make a dent here. I'd be honored.

Alex Russel's blog post Cache and Prizes a decent summary which involves the concerns with bundling

Yes, I agree. As an "underdog OSS developer" (I'm the author of vite-plugin-ssr), I'm particularly attached to foster innovation.

a shared cache would suddenly create disincentives to adopt anything but the last generation of "winner" libraries

That's an inherent drawback of any shared cache, but, considering other aspects, I actually see a shared cache to be a net win for innovation. (I'll elaborate if I manage to find a viable design.)

flat distribution of use among their top versions, the breadth of a shared cache will be much smaller than folks anticipate.

Yes, it's on my radar as well. While I doubt we can do much about the flat distribution aspect (since "don't break the web" is Job #1 for browsers), maybe we can do something about reducing disk usage.

Browser teams aggressively track and manage browser download size

Yes. I strongly believe a shared cache shouldn't be included in the initial browser download. (Or it should include very few things that have like a 99% chance of being downloaded upon the first few websites the user visits.) A shared cached should grow organically, otherwise it becomes a governance mess.

@brillout
Copy link

The following design (tentatively) addresses all timing attack concerns (regarding both privacy and security).

If the feedback is positive, we can create a proper RFC in a new GitHub ticket.

The Assets Manifest

A website https://my-domain.com can define an "assets manifest" at https://my-domain.com/assets.json declaring assets for the entire domain my-domain.com.

// https://my-domain.com/assets.json
{
  "assets": {
    "react": {
      "src": "https://some-cdn.com/react/18.2.0",
      "integrity": "sha384-oqVuAfXRKap7fdgcCY5uykM6+R9GqQ8K/uxy9rx7HNQlGYl1kPzQho1wx4JwY8wC"
      "type": "module"
    }
  }
}
<html>
  <head>
    <script name="react"></script>
  </head>
<html>

Having shared assets defined on a per-domain fashion solves the problem of timing attacks determining the user's activity on a given website. (Assets defined in /assets.json are loaded regardless of which page the users visits. Thererefore, the shared cache is completely decoupled from the user activity on a given website.)

Protect user privacy

In order to protect users from timing attacks retrieving the user's browsing history, the shared cache should behave as in the following example.

  1. User installs Firefox (the shared cache is empty).
  2. User visits facebook.com which defines facebook.com/assets.json that has the entry https://some-cdn.com/react/18.2.0.

    The browser loads react@18.2.0 and adds it to the local cache of facebook.com but doesn't add it to the shared cached (yet).

  3. User visits netflix.com which defines netflix.com/assets.json that also has the entry https://some-cdn.com/react/18.2.0.

    The browser loads react@18.2.0 again and adds it to both the local cache and the shared cache. The idea here is that, at this point, the browser knows that react@18.2.0 is being used by two different domains (facebook.com and netflix.com). This doesn't seem like a sufficient privacy protection at first but I'll elaborate in a moment why it actually is.

  4. User visits discord.com, which also defines discord.com/assets.json with the entry https://some-cdn.com/react/18.2.0.

    The browser uses the shared cache to get react@18.2.0

While it may seem surprising at first, the guarantee that a resource is added to the shared cache only after the browser knows it's used by two distinct domains is actually enough to protect users from privacy attacks. Let me elaborate.

Let's consider following example:

  1. User installs Firefox (the shared cache and browser history are empty).
  2. User visits google.com which loads Angular + the Inter font. (And defines these in its /assets.json.)
  3. User visits facebook.com which loads React + the Open Sans font. (And defines these in its /assets.json.)
  4. User visits discord.com which loads React + the Inter font. (And defines these in its /assets.json.)

This means that, at this point, all 4 assets (Angular, React, Open Sans font, Inter font) are in the shared cache, which seems like a glaring privacy leak. But it's actually not.

While it's true that the combination React + Inter uniquely identifies discord.com, a malicious website cannot use that fact to determine the user's browsing history. While the malicious website can use timing attacks to determine that the shared cache contains React and Inter, he cannot determine why React and Inter are in the shared cache. Are React and Inter in the shared cache because the user visited discord.com (loading both React + Inter), or because the user visited facebook.com (loading React) and google.com (loading Inter)? The malicious website cannot determine that.

To be on the safe side, I still think that the browser should be slightly be more conservative and add assets to the shared cache only after the user visted n domains that share the same resource, where n is a number we deem appropriate.

Increase shared cache effeciency

For a library https://some-cdn.com/react to be eligible for the shared cache, it needs to serve https://some-cdn.com/react/18.2.0/diff/18.1.0 which returns the diff between 18.2.0 and 18.1.0. Enabling browsers to quickly update React from 18.1.0 to 18.2.0.

Edge platforms such as Cloudflare Workers make it relatively easy to implement a performant https://some-cdn.com/${some-library}/${versionA}/diff/${versionB}.

Conclusion

I expect questions (especially around privacy) that I'm happy to answer.

If, after discussing this we come to the conclusion it's worthwhile to pursue this direction, we can create a proper RFC.

I'm very excited about the impact we may achieve, especially for low-end devices and low-end networks.

I'm very much looking forward to (critical) feedback.

@ArneBab
Copy link

ArneBab commented Mar 19, 2023

@brillout I’m not sure I like seeing that bound to a domain, but I do see two upsides:

  • This matches the usual deployment of service workers.
  • Whether you visit a domain can already be found out by your ISP, but not which pages you visit. Therefore this approach prevents leaking information that is properly hidden with HTTPS.

@arjunindia
Copy link

If a set of companies/domains, for example, alibaba.com and tencent.com and so on uses a library like antdesign-custom.js which is uniquely used by only them we could maybe uniquely identify if a user has visited those specific set of domains
So a solution I thought was - How about adding an allowed domains section? The resource antdesign-custom.js would only be cached for those domains only

@Summertime
Copy link

On that example, I can, across 2 domains (perhaps in an iframe or redirection chain), can do a timing attack to work out that facebook was visited or not (first domain full loads Open Sans, second domain may/may not), which then discloses if google or discord was visited in turn from the prior timing attack on React + Inter.

up to n domains for how ever many n is for the caching to take hold, discloses however many matches there were pre-existing. and this n can be directly recovered too, just use a library no one else uses.

@brillout
Copy link

@Summertime n should be counted towards meaningful user activity. For example, a mere HTTP request to facebook.com isn't counted, while the user spending a couple of hours on facebook.com with many mouse/keyboard events clearly counts as a visit.

@arjunindia Yes, we should take protective measures about this. Example of a very aggressive strategy:

  • Whitelisting of shared-cache CDNs. (I.e. browsers only add resources to the shared cache that are served by a predetermined list of CDNs.)
  • Shared-cache CDNs clearly communicate that only library authors should add resources.
  • The first bytes of the resource need to be:
    DANGER: DON'T EVER WRITE ME WITHOUT READING https://www.w3.org/warning-about-using-the-shared-cache
    So that website authors who may be tempted to add their assets to the CDN need to prepend that message to their assets and are therefore further warned.
  • Shared-cache CDNs automatically remove any resource that don't get a massive amount of requests.
  • When such resource is removed from the CDN, analyze what went wrong: what kind of resource was it? What website used it? Was it a malicious attempt to make a timing attack? Was it successful?

This is very conservative strategy and I don't think we need to go that far, but it shows that it's in the realm of the possible to address the issue.

I'd even argue that such strategy can be made so effective that we can skip the whole n technique (i.e. setting n to 1).

I chose the n technique for my "RFC seed" comment in order to clearly communicate the key and most important insight of the RFC seed: privacy concerns can be considered in isolation on a per resource basis (making the problem much easier to reason about and much easier to address).

The motivation of this RFC seed is to move the problem from the realm of "very unlikely to ever happen" towards the realm of "possible to implement and worth further investigation".

How about adding an allowed domains section? The resource antdesign-custom.js would only be cached for those domains only

I like that idea, although I'm thinking maybe we should discuss it in a separate "RFC extension". I'm inclined to keep the conversation about the RFC's core propositions and, at some point later and if we can establish confidence around the RFC, we extend the scope of the conversation.

@AviKav
Copy link

AviKav commented Mar 23, 2023

  • Shared-cache CDNs automatically remove any resource that don't get a massive amount of requests.

A weakness here is resources correlating with user interests (A clear-cut example being peer-to-peer libraries). What if a large site starts pinning a version? What if websites around a topic start pinning a version? What's the diversity of requests.

This means that, at this point, all 4 assets (Angular, React, Open Sans font, Inter font) are in the shared cache, which seems like a glaring privacy leak. But it's actually not.

While it's true that the combination React + Inter uniquely identifies discord.com, a malicious website cannot use that fact to determine the user's browsing history. While the malicious website can use timing attacks to determine that the shared cache contains React and Inter, he cannot determine why React and Inter are in the shared cache. Are React and Inter in the shared cache because the user visited discord.com (loading both React + Inter), or because the user visited facebook.com (loading React) and google.com (loading Inter)? The malicious website cannot determine that.

Even when you can't pinpoint a site with certainty, doesn't mean it doesn't leak information. How do you ensure that any entropy gained from the set of resources in cache is lost among the noise? How do you do this for all light, medium, and heavy users of the web, regardless of the individual?

To be on the safe side, I still think that the browser should be slightly be more conservative and add assets to the shared cache only after the user visted n domains that share the same resource, where n is a number we deem appropriate.

Required reading: Differential Privacy primer by minutephysics and the US Census Bureau: https://www.youtube.com/watch?v=pT19VwBAqKA

@ArneBab
Copy link

ArneBab commented Mar 24, 2023

There’s still the possibility not to automate it, but instead to have a central list of assets provided by the browser — similar to the decentral eyes extension — or maybe just using it: https://decentraleyes.org/

This sidesteps all the privacy issues and still brings large parts of the benefits. The only part it doesn’t provide is automatic inclusion of new versions, so it disincentivizes updates of common libraries.

@brillout
Copy link

@AviKav Yes, the shared cache may indeed leak information about the user if it contains a resource that is (almost) only used by one specific kind of websites, e.g. some JavaScript library for peer-to-peer websites such as The Pirate Bay, or some kind of font used primarily in websites of a certain age group (e.g. a Pokemon font).

Even though it's possible to take further protective measures (e.g. the shared-cache CDN should clearly communicate that topic specific libraries shouldn't be included, the CDN should enforce websites to specify their topic — e.g. with schema.org — and accordingly remove resources from the CDN if the resource correlates with a topic, collect further statistics about what ressources is used on what websites, etc.) I think at this point we put too much requirements on the shared-cache CDN.

Ideally a shared-cache CDN shouldn't be too complex.

Instead of adding further requirements to shared-cache CDNs, we can tackle the privacy problem from the other side: how can we reduce the opportunities for a malicious entity to time attack the shared cache?

For example, we can require the browser to use the shared cache only upon a user activity: if the user navigates to a website by a manual action such as moving her physical mouse over a link or by tapping her touch screen, then and only then are resources loaded from the shared cache. This drastically reduces the opportunity for timing attacks.

I'll elaborate more on this later.

@ArneBab The idea of whitelisting resources has already been brought up and has been disregarded so far (I believe rightfully so).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests