-
-
Notifications
You must be signed in to change notification settings - Fork 2.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any way to split tesseract-core.wasm.js into multiple files? #732
Comments
Does this limitation only apply to JavaScript code and not other files? By default, Tesseract.js fetches |
Yes. When I upload my extension to the Firefox add-ons store, Firefox's error message is "This file is not binary and too large to parse: tesseract-core.wasm.js" I think this means that files other than JavaScript are not restricted to 4MB.
Browser extensions do not allow loading JavaScript files over the network, so I had to include tesseract-core.wasm.js in the package; however, |
Thank you for confirming this restriction only impacts the JavaScript file. Tesseract.js-core already includes versions that split up the JavaScript and Webassembly components. For example, the SIMD-enabled file we use by default is named tesseract-core-simd.wasm.js, and this file contains both the JavaScript as well as the .wasm code encoded as a string. It is 4.5MB. However, we also include files named tesseract-core-simd.js and tesseract-core-simd.wasm which only include the JavaScript and Webassembly code (respectively). Those files are 123KB and 3.3MB, respectively. It sounds like switching to using the versions that split the .js and .wasm code would resolve your issue. We do not use this by default as other users have reported issues implementing this, which is documented in #282. I have not ever used these files personally, however another user on that issue reported success, so it looks possible. |
Thank you for the information. I found three types of files in tesseract.js-core:
Of these, And before that, I was using Then I tried to set the corePath to |
Off the top of my head I'm not sure what modifications would be needed to be able to use the split Regarding the files with "simd" in the name--at a high-level the only difference is that these will run much faster but may not be supported on all devices. SIMD instructions are low-level functions for performing basic operations that run significantly faster than non-SIMD functions. Recently, most browsers started allowing webassembly to make use of these instructions, which lead to a major performance boost for Tesseract.js (for the LSTM model, which is the default). However, we still include a non-SIMD version for browsers/devices that do not support SIMD instructions yet (including iPhones and browsers that have not been updated recently). |
Thanks for the detailed explanation of SIMD, I've figured it out. Also, I dug deeper into the code of Search for So, the code that works now looks like this: const worker = await createWorker({
workerPath: 'my-worker.min.js', // As mentioned above, this file turns 13 into 15
corePath: 'tesseract-core.js',
workerBlobURL: false,
}) Some detailed explanations. First, I checked Then, in switch ((t.prev = t.next)) {
case 0:
if (void 0 !== r.g.TesseractCore) {
t.next = 15
break
}
if (
(o.progress({
status: 'loading tesseract core',
progress: 0,
}),
(f = e))
) {
t.next = 8
break
}
return (t.next = 6), a()
case 6:
;(u = t.sent),
(f = 'https://unpkg.com/tesseract.js-core@v'.concat(
s['tesseract.js-core'].substring(1),
u
? '/tesseract-core-simd.wasm.js'
: '/tesseract-core.wasm.js'
))
case 8:
if (
(r.g.importScripts(f),
void 0 === r.g.TesseractCoreWASM ||
'object' !==
('undefined' == typeof WebAssembly
? 'undefined'
: i(WebAssembly)))
) {
// ---------------------- here -------------------------
t.next = 13
break
}
;(r.g.TesseractCore = r.g.TesseractCoreWASM),
(t.next = 14)
break
case 13:
throw Error('Failed to load TesseractCore')
case 14:
o.progress({
status: 'loading tesseract core',
progress: 1,
})
case 15:
return t.abrupt('return', r.g.TesseractCore)
case 16:
case 'end':
return t.stop()
} In the code above, |
I guess this part of the code above should be generated from here: https://github.com/naptha/tesseract.js/blob/master/src/worker-script/browser/getCore.js |
Glad you figured this out. I believe the core issue is not with Tesseract.js but with Tesseract.js-core. Tesseract.js is expecting Tesseract.js-core to create a module named |
@lmk123 Can you clarify whether you were actually able to recognize text using your fixed version? When I tried this using a modified example in this repo the |
I am not experiencing this problem and confirm that I am able to use it correctly. Perhaps because I am using it in a browser extension, I have provided you with a minimal example available at: https://github.com/lmk123/tesseract-wasm |
Thanks for confirming. I have updated the master branch (075e918) such that you should be able to use it without modification. This will be reflected in the next release. |
I have upgraded to tessearct.js v4.0.5 and tessearct.js-core v4.0.4 and it works fine. Thanks again for your work. |
Is your feature request related to a problem? Please describe.
I developed a browser extension to recognize text in images, but when I uploaded it to the firefox add-ons store, Firefox rejected my extension because tesseract-core.wasm.js was over 4MB.
A suggestion was made to firefox to raise the file size limit from 4MB to 5MB, but it doesn't look like this proposal will be accepted by firefox anytime soon, see mozilla/addons-linter#4748
Describe the solution you'd like
Split tesseract-core.wasm.js into two files, so that each file can be smaller than 4MB.
The text was updated successfully, but these errors were encountered: