Skip to content

Conversation

nico-martin
Copy link
Collaborator

This should not be merged yet.

Instead its an experimantal implementation of the Cross-Origin Storage API that the Google Chrome Team is working on:
https://github.com/explainers-by-googlers/cross-origin-storage

To test is you need to install the Cross-Origin Storage API extension in your browser:
https://github.com/web-ai-community/cross-origin-storage-extension?tab=readme-ov-file

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@@ -0,0 +1,71 @@
class CrossOriginStorage {
static isAvailable = () => "crossOriginStorage" in navigator;
Copy link
Collaborator

@xenova xenova Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should use typeof check here, otherwise we get crashes in Node.js. For example, from the unit tests:

2025-10-17T14:21:29.0544102Z FAIL tests/pipelines.test.js
2025-10-17T14:21:29.0567084Z   ● Pipelines › Audio Classification › should be an instance of AudioClassificationPipeline
2025-10-17T14:21:29.0568045Z 
2025-10-17T14:21:29.0568875Z     The error below may be caused by using the wrong test environment, see https://jestjs.io/docs/configuration#testenvironment-string.
2025-10-17T14:21:29.0569870Z     Consider using the "jsdom" test environment.
2025-10-17T14:21:29.0570210Z 
2025-10-17T14:21:29.0570439Z     ReferenceError: navigator is not defined
2025-10-17T14:21:29.0570742Z 
2025-10-17T14:21:29.0571255Z     �[0m �[90m 1 |�[39m �[36mclass�[39m �[33mCrossOriginStorage�[39m {
2025-10-17T14:21:29.0572641Z     �[31m�[1m>�[22m�[39m�[90m 2 |�[39m   �[36mstatic�[39m isAvailable �[33m=�[39m () �[33m=>�[39m �[32m"crossOriginStorage"�[39m �[36min�[39m navigator�[33m;�[39m
2025-10-17T14:21:29.0573856Z      �[90m   |�[39m                                                      �[31m�[1m^�[22m�[39m
2025-10-17T14:21:29.0574386Z      �[90m 3 |�[39m
2025-10-17T14:21:29.0575040Z      �[90m 4 |�[39m   match �[33m=�[39m �[36masync�[39m (request) �[33m=>�[39m {
2025-10-17T14:21:29.0576260Z      �[90m 5 |�[39m     �[36mconst�[39m hashValue �[33m=�[39m �[36mawait�[39m �[36mthis�[39m�[33m.�[39m_getFileHash(request)�[33m;�[39m�[0m
2025-10-17T14:21:29.0577116Z 
2025-10-17T14:21:29.0577475Z       at Function.isAvailable (src/utils/CrossOriginStorage.js:2:54)
2025-10-17T14:21:29.0578105Z       at getModelFile (src/utils/hub.js:483:27)
2025-10-17T14:21:29.0578631Z       at getModelText (src/utils/hub.js:696:26)
2025-10-17T14:21:29.0579145Z       at getModelJSON (src/utils/hub.js:716:24)
2025-10-17T14:21:29.0579821Z       at Function.from_pretrained (src/models/auto/processing_auto.js:51:42)
2025-10-17T14:21:29.0580496Z       at loadItems (src/pipelines.js:3527:27)
2025-10-17T14:21:29.0581023Z       at pipeline (src/pipelines.js:3465:27)
2025-10-17T14:21:29.0581858Z       at Object.<anonymous> (tests/pipelines/test_pipelines_audio_classification.js:15:20)
2025-10-17T14:21:29.0582434Z 
2025-10-17T14:21:29.0583006Z   ● Pipelines › Audio Classification › batch_size=1 › default (top_k=5)

See

const IS_WEBGPU_AVAILABLE = typeof navigator !== 'undefined' && 'gpu' in navigator;

for example

if (!hashValue) {
return undefined;
}
const hash = { algorithm: "SHA-256", value: hashValue };
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know I hard-coded it, but future versions of COS may use other hashing algorithms. So to future-proof this, maybe make this stand out more by putting it in a constant at the top.

};

_getFileHash = async (url) => {
if (/\/resolve\/main\/onnx\//.test(url)) {
Copy link

@tomayac tomayac Oct 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is essentially scraping the website. Maybe leave the original comment from my code where this was linked to an explanation on the HF docs. Also see the comment above about future-proofing this for possible algorithm changes.

await writableStream.close();
};

_getFileHash = async (url) => {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't "see" the requests for the ORT Wasm files. Those should be 100% cached in COS for guaranteed cache hits as any Transformers.js or ONNX Runtime Web uses the same few files.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs to happen in ORT, or you can of course do it "by hand".

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Wasm file fetch might happen here (line 12), but not 100% sure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we have been meaning to "control" this on the Transformers.js side by loading and caching the binary, then pointing wasmPaths to this buffer.

Just need to get around to adding it :)

_getFileHash = async (url) => {
if (/\/resolve\/main\/onnx\//.test(url)) {
const rawUrl = url.replace(/\/resolve\//, "/raw/");
const text = await fetch(rawUrl).then((response) => response.text());
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This runs every time, which means you can't run fully offline. Instead, this should cache the mapping url=>hash and return the cached value. I had this in my initial implementation and remember there was some trickery needed to make it work with the actual URLs (I don't remember, but maybe it had to do with the post-redirect URLs that point at the CDN? Just copy what I had, this worked :-)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I did that deliberately. From my point of view, it's a question of separation of concerns/responsibilities. I dont think it is the responsibility of transformers.js to ensure that everything works offline. It is our responsibility to do our best to keep the download payload as little as possible. But here I dont this we need to cache this request since it is tiny.
On the other hand, we would risk that new versions of an ONNX file would not be loaded because the cached SHA value does not change. And it would not be obvious to the user or the app developer why.
In my opinion, if a developer wanted to have a fully offline solution they should solve the offline-caching on a ServiceWorker-level. We could help with that but we should not abstract it away by default.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. And stale-while-revalidate as a caching strategy for these "get SHA-256 hash" routes would work perfectly both for always being offline-capable and for never missing a new model. This should likely be added somewhere as a best practice in the docs, but for here: LGTM.

@xenova xenova marked this pull request as draft October 19, 2025 18:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants