Add more anti-fingerprinting mitigations

domenic · domenic · commit 0d203b6d366e · 2025-03-19T16:48:12.000+09:00
This includes a couple updates to the algorithms, for download status masking and avoiding actual download cancelation, plus extensive discussion of those mitigations and others in the new "Privacy considerations" section. See webmachinelearning/translation-api#3 and webmachinelearning/translation-api#10.
diff --git a/README.md b/README.md
@@ -427,19 +427,9 @@ Finally, we intend to prohibit (in the specification) any use of user-specific i
 
 ### Detecting available options
 
-The [`availability()` API](#testing-available-options-before-creation) specified here provide some bits of fingerprinting information, since the availability status of each option and language can be one of four values, and those values are expected to be shared across a user's browser or browsing profile. In theory, this could be up to ~6.6 bits for the current set of summarizer options, plus an unknown number more based on the number of supported languages, and then this would be roughly tripled by including writer and rewriter. <!-- log_2(4 (availability) * 4 (type) * 3 (length) * 2 (format)) = ~6.6 -->
+The [`availability()` API](#testing-available-options-before-creation) specified here provide some bits of fingerprinting information, since the availability status of each option and language can be one of four values, and those values are expected to be shared across a user's browser or browsing profile.
 
-In practice, we expect the number of bits to be much smaller, as implementations will likely not have separate, independently-downloadable pieces of collateral for each option value. (For example, in Chrome's case, we anticipate having a single download for all three APIs.) But we need the API design to be robust to a variety of implementation choices, and have purposefully designed it to allow such independent-download architectures so as not to lock implementers into a single strategy.
-
-There are a variety of solutions here, with varying tradeoffs, such as:
-
-* Grouping downloads to reduce the number of bits, e.g. by ensuring that downloading the "formal" tone also downloads the "neutral" and "casual" tones. This costs the user slightly more bytes, but hopefully not many.
-* Partitioning downloads by top-level site, i.e. repeatedly downloading extra fine-tunings or similar and not sharing them across all sites. This could be feasible if the collateral necessary to support a given option is small; it would not generally make sense for the base language model.
-* Adding friction to the download with permission prompts or other user notifications, so that sites which are attempting to use these APIs for tracking end up looking suspicious to users.
-
-We'll continue to investigate the best solutions here. And the specification will at a minimum allow user agents to add prompts and UI, or reject downloads entirely, as they see fit to preserve privacy.
-
-It's also worth noting that a download cannot be evicted by web developers. Thus the availability states can only be toggled in one direction, from `"downloadable"` to `"downloading"` to `"available"`. And it doesn't provide an identifier that is very stable over time, as by browsing other sites, users will gradually toggle more and more of the availability states to `"availale"`.
+This privacy threat, and how the API mitigates it, is discussed in detail [in the specification](https://webmachinelearning.github.io/writing-assistance-apis/#privacy-availability).
 
 ## Stakeholder feedback
 
diff --git a/index.bs b/index.bs
@@ -1118,6 +1118,8 @@ enum RewriterLength { "as-is", "shorter", "longer" };
 
           1. Abort these steps.
 
+          <p class="advisement" id="warning-download-cancelation">The user agent should not actually cancel the underlying download, as explained in [[#privacy-availability-cancelation]]. It could fulfill this requirement by "pausing" the download, such that future calls to |getAvailability| given |options| return "{{Availability/downloading}}" instead of "{{Availability/downloadable}}", but this "pause" must persist even across user agent restarts.
+
         1. [=Initialize and return an AI model object=] given |promise|, |options|, a no-op algorithm, |initialize|, and |create|.
     </dl>
 
@@ -1445,6 +1447,8 @@ enum RewriterLength { "as-is", "shorter", "longer" };
 
     1. Let |availability| be the result of |compute| given |options|.
 
+    1. <span id="download-masking-step"></span>If |availability| is "{{Availability/available}}" or "{{Availability/downloading}}", the user agent should set |availability| to "{{Availability/downloadable}}" if doing so would improve the user's privacy, per the discussions in [[#privacy-availability-masking]].
+
     1. [=Queue a global task=] on the [=AI task source=] given |global| to perform the following steps:
 
       1. If |availability| is null, then [=reject=] |promise| with an "{{UnknownError}}" {{DOMException}}.
@@ -1501,3 +1505,54 @@ A <dfn export>quota exceeded error information</dfn> is a [=struct=] with the fo
 <h3 id="supporting-task-source">Task source</h3>
 
 [=Tasks=] queued by this specification use the <dfn export>AI task source</dfn>.
+
+<h2 id="privacy">Privacy considerations</h2>
+
+<h3 id="privacy-availability">Model availability</h3>
+
+For any of the APIs that use the infrastructure described in [[#supporting]], the exact download status of the AI model or fine-tuning data can present a fingerprinting vector. How many bits this vector provides depends on the options provided to the API creation, and how they influence the download.
+
+For example, if the user agent uses a single model, with no separately-downloadable fine-tunings, to support the summarizer, writer, and rewriter APIs, then the download status provides two bits (corresponding to the four {{Availability}} values) across all three APIs. In contrast, if the user agent uses separate fine-tunings for each value of {{SummarizerType}}, {{SummarizerFormat}}, and {{SummarizerLength}} on top of a base model, then the download status for those summarizer fine-tunings alone provides ~6.6 bits of entropy.
+<!-- log_2(4 (availability) * 4 (type) * 3 (length) * 2 (format)) = ~6.6 -->
+
+<h4 id="privacy-availability-masking">Download masking</h4>
+
+One of the specification's mitigations is to suggest that the user agent to mask the current download status, per the <a href="#download-masking-step">this step</a> in [=compute AI model availability=], which backs the `availability()` APIs.
+
+Because implementation strategies differ (e.g. in how many bits they expose), and other mitigations such as permission prompts are available, a specific masking scheme is not mandated. For APIs where the user agent believes such masking is necessary, a suggested heuristic is to mask by default, subject to a masking state that is established for each (API, options, [=storage key=]) tuple. This state can be set to "unmasked" once a web page in a given [=storage key=] calls the relevant `create()` method with a given set of options, and successfully starts a download or creates a model object. Since [=create an AI model object=] has stronger requirements (see [[#privacy-availability-creation]]), this ensures that web pages only get access to the true download status after taking a more costly and less-repeatable action.
+
+Implementations which use such a [=storage key=]-based masking scheme should ensure that the masking state is reset when other storage for that origin is reset, e.g., by treating it as a [=storage bucket=].
+
+<h4 id="privacy-availability-creation">Creation-time friction</h4>
+
+The mitigation described in [[#privacy-availability-masking]] works against attempts to silently fingerprint using the `availability()` methods. The specification also contains requirements to prevent `create()` from being used for fingerprinting, by introducing enough friction into the process to make it impractical:
+
+* [=Create an AI model object=] both requires and consumes [=user activation=], when it would initiate a download.
+* [=Create an AI model object=] allows the user agent to prompt the user for permission, or to implicitly reject download attempts based on previous signals (such as an observed pattern of abuse).
+* [=Create an AI model object=] is gated on an per-API [=policy-controlled feature=], which means that only top-level origins and their delegates can use the API.
+
+Also note that by nature, initiating the download process is more or less a one-time operation, so the availability status will only ever transition from "{{Availability/downloadable}}" to "{{Availability/downloading}}" to "{{Availability/available}}" via these guarded creation operations. That is, while `create()` can be used to read some of these fingerprinting bits, at the cost of the above friction, doing so will destroy the bits as well.
+
+<h4 id="privacy-availability-cancelation">Download cancelation</h4>
+
+An important part of making the download status into a less-useful fingerprinting vector is to ensure that the website cannot toggle the availability state back and forth by starting and canceling downloads. Doing so would allow sites much more fine-grained control over the possible fingerprinting bits, allowing them to read the bits via the `create()` methods without destroying them.
+
+The part of these APIs which, on the surface, gives developers control over the download process is the {{AbortSignal}} passed to the `create()` methods. This allows developers to signal that they are no longer interested in creating a model object, and immediately causes the promise returned by `create()` to become rejected. The specification has a "should"-level <a href="#warning-download-cancelation">requirement</a> that the user agent not actually cancel the underlying download when the {{AbortSignal}} is aborted. The web developer will still receive a rejected promise, but the download will continue in the background, and the availability status (as seen future calls to the `availability()` method) will update accordingly.
+
+User agents might be inclined to cancel the download in other situations not covered in the specification, such as when the page is unloaded. This needs to be handled with caution, as if the page can initiate these operations using JavaScript (for example, by navigating away to another origin) that would re-open the privacy hole. So, user agents should not cancel the download in response to any page-controlled actions. Canceling in response to user-controlled actions, however, is fine.
+
+<h4 id="privacy-availability-eviction">Download eviction</h4>
+
+Another ingredient in ensuring that websites cannot toggle the availability state back and forth is to ensure that user agents don't use a fixed quota system for the downloaded material. For example, if a user agent implemented the translator API with one download per language arc, supported 100 language arcs, and evicted all but the 30 most-recently-used language arcs, then web pages could toggle the readable-via-`create()` availability state of language arcs from "{{Availability/available}}" back to "{{Availability/downloadable}}" by creating translators for 30 new language arcs.
+
+The simplest mitigation to this is to avoid any API-specific quota, and instead rely on a per-user disk space-based quota. This specification does not mandate that particular solution, but does require that user agent should not implement systems which allow web pages to control the eviction of downloaded material.
+
+<h4 id="privacy-availability-alternatives">Alternate options</h4>
+
+While some of the above requirements, such as those on user activation or permissions policy, are specified using "must" language to ensure interoperability, most are specified using "should". The reason for this is that it's possible for implementations to use completely different strategies to preserve user privacy, especially for APIs that use small models. (For example, the language detector API.)
+
+The simplest of these is to treat model downloads like all other stored resources, including them in a [=storage bucket=] which is partitioned by [=storage key=]. This lets the origin model's existing privacy protections operate and obviates the need for anything more complicated. The downside is that this spends more of the user's time, bandwidth, and disk space redundantly downloading the same model across multiple sites.
+
+A slight variant of this is to re-download the model every time it is requested by a new [=storage key=], while re-using the on-disk storage. This still uses the uesr's time and bandwidth, but at least saves on disk space.
+
+Going further, a user agent could attempt to fake the download for new [=storage keys=] by just waiting for a similar amount of time as the real download originally took. This then only spends the user's time, sparing their bandwidth and disk space. However, this is less private than the above alternatives, due to the presence of network side channels. For example, a web page could attempt to detect the fake downloads by issuing network requests concurrent to the `create()` call, and noting that there is no change to network throughouput. The scheme of remembering the time the real download originally took can also be dangerous, as the first site to initiate the download could attempt to artificially inflate this time (using concurrent network requests) in order to communicate information to other sites that will initiate a fake download in the future, from which they can read the time taken. Nevertheless, something along these lines might be useful in some cases, implemented with caution and combined with other mitigations.
diff --git a/security-privacy-questionnaire.md b/security-privacy-questionnaire.md
@@ -14,7 +14,7 @@ The privacy implications of both of these are discussed [in the explainer](./REA
 > 02.  Do features in your specification expose the minimum amount of information
 >      necessary to implement the intended functionality?
 
-We believe so. It's possible that we could remove the exposure of the after-download vs. readily information. However, it would almost certainly be inferrable via timing side-channels. (I.e., if downloading a language model or fine-tuning is required, then the web developer can observe the creation of the summarizer/writer/rewriter object taking longer.)
+We believe so. It's possible that we could remove the exposure of the download status information. However, it would almost certainly be inferrable via timing side-channels. (I.e., if downloading a language model or fine-tuning is required, then the web developer can observe the creation of the summarizer/writer/rewriter object taking longer.)
 
 > 03.  Do the features in your specification expose personal information,
 >      personally-identifiable information (PII), or information derived from
@@ -69,7 +69,7 @@ None.
 
 We use permissions policy to disallow the usage of these features by default in third-party (cross-origin) contexts. However, the top-level site can delegate to cross-origin iframes.
 
-Otherwise, it's possible that some of the [anti-fingerprinting mitigations](./README.md#privacy-considerations) might involve partitioning download status, which is kind of like distinguishing between first- and third-party contexts.
+Otherwise, some of the possible [anti-fingerprinting mitigations](https://webmachinelearning.github.io/writing-assistance-apis/#privacy-availability) involve partitioning information across sites, which is kind of like distinguishing between first- and third-party contexts.
 
 > 14.  How do the features in this specification work in the context of a browser’s
 >      Private Browsing or Incognito mode?
@@ -81,9 +81,11 @@ Otherwise, we do not anticipate any differences.
 > 15.  Does this specification have both "Security Considerations" and "Privacy
 >      Considerations" sections?
 
-There is no specification yet, but there is a [privacy considerations](./README.md#privacy-considerations) section in the explainer.
+Not quite yet.
 
-We do not anticipate significant security risks for these APIs at this time.
+We have [privacy considerations](./README.md#privacy-considerations) section in the explainer, and the start of a [privacy considerations section](https://webmachinelearning.github.io/writing-assistance-apis/#privacy) in the specification. For now it covers only the fingerprinting issue, but we anticipate moving over more content from the explainer over time.
+
+We do not anticipate significant security risks for these APIs at this time, although we will add a section discussing some basics like how to avoid allowing websites to use up all of the user's disk space.
 
 > 16.  Do features in your specification enable origins to downgrade default
 >      security protections?