Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any way to split tesseract-core.wasm.js into multiple files? #732

Closed
lmk123 opened this issue Apr 10, 2023 · 12 comments
Closed

Is there any way to split tesseract-core.wasm.js into multiple files? #732

lmk123 opened this issue Apr 10, 2023 · 12 comments

Comments

@lmk123
Copy link

lmk123 commented Apr 10, 2023

Is your feature request related to a problem? Please describe.

I developed a browser extension to recognize text in images, but when I uploaded it to the firefox add-ons store, Firefox rejected my extension because tesseract-core.wasm.js was over 4MB.

A suggestion was made to firefox to raise the file size limit from 4MB to 5MB, but it doesn't look like this proposal will be accepted by firefox anytime soon, see mozilla/addons-linter#4748

Describe the solution you'd like

Split tesseract-core.wasm.js into two files, so that each file can be smaller than 4MB.

@Balearica
Copy link
Member

Does this limitation only apply to JavaScript code and not other files? By default, Tesseract.js fetches tesseract-core.wasm.js (~4.5MB) from a CDN. However, this file is small compared to the ~10-20MB .traineddata files also downloaded. Therefore, I would want to confirm that the size of tesseract-core.wasm.js is the only bottleneck before looking into anything here.

@lmk123
Copy link
Author

lmk123 commented Apr 11, 2023

Does this limitation only apply to JavaScript code and not other files?

Yes.

When I upload my extension to the Firefox add-ons store, Firefox's error message is "This file is not binary and too large to parse: tesseract-core.wasm.js"

I think this means that files other than JavaScript are not restricted to 4MB.

By default, Tesseract.js fetches tesseract-core.wasm.js (~4.5MB) from a CDN.

Browser extensions do not allow loading JavaScript files over the network, so I had to include tesseract-core.wasm.js in the package; however, .traineddata files are allowed to be loaded over the network.

@Balearica
Copy link
Member

Thank you for confirming this restriction only impacts the JavaScript file.

Tesseract.js-core already includes versions that split up the JavaScript and Webassembly components. For example, the SIMD-enabled file we use by default is named tesseract-core-simd.wasm.js, and this file contains both the JavaScript as well as the .wasm code encoded as a string. It is 4.5MB. However, we also include files named tesseract-core-simd.js and tesseract-core-simd.wasm which only include the JavaScript and Webassembly code (respectively). Those files are 123KB and 3.3MB, respectively.

It sounds like switching to using the versions that split the .js and .wasm code would resolve your issue. We do not use this by default as other users have reported issues implementing this, which is documented in #282. I have not ever used these files personally, however another user on that issue reported success, so it looks possible.

@lmk123
Copy link
Author

lmk123 commented Apr 13, 2023

Thank you for the information.

I found three types of files in tesseract.js-core:

  • tesseract-core.asm.js: asm is not the version I need, so ignore
  • tesseract-core.js, tesseract-core.wasm, tesseract-core.wasm.js
  • tesseract-core-simd.js, tesseract-core-simd.wasm, tesseract-core-simd.wasm.js

Of these, tesseract-core.wasm.js and tesseract-core-simd.wasm.js are files with a file size of more than 4MB that have wasm embedded into them.

And before that, I was using tesssearct-core.wasm.js, which is working properly in the browser extension, but is over 4MB in size. I also tried using the tesseract-core-simd.wasm.js file and it works fine, but I'm not sure what the difference is between it and the tesssearct-core.wasm.js file.

Then I tried to set the corePath to tesseract-core.js, tesseract-core-simd.js, tessearct-core.wasm, tessearct-core-simd.wasm, but they all report errors as follows: "Failed to load TesseractCore".

@Balearica
Copy link
Member

Off the top of my head I'm not sure what modifications would be needed to be able to use the split .js and .wasm` files. Tesseract.js is indeed not set up to run with these out of the box, however I'm guessing that there's some change that could be made to make it work.

Regarding the files with "simd" in the name--at a high-level the only difference is that these will run much faster but may not be supported on all devices.

SIMD instructions are low-level functions for performing basic operations that run significantly faster than non-SIMD functions. Recently, most browsers started allowing webassembly to make use of these instructions, which lead to a major performance boost for Tesseract.js (for the LSTM model, which is the default). However, we still include a non-SIMD version for browsers/devices that do not support SIMD instructions yet (including iPhones and browsers that have not been updated recently).

@lmk123
Copy link
Author

lmk123 commented Apr 13, 2023

Thanks for the detailed explanation of SIMD, I've figured it out.

Also, I dug deeper into the code of worker.min.js and tesseract-core.js and I finally fixed the issue.

Search for i(WebAssembly))){t.next=13;break} in node_modules/tesseract.js/dist/worker.min.js and change the 13 to 15 and it loads the wasm file properly!

So, the code that works now looks like this:

const worker = await createWorker({
  workerPath: 'my-worker.min.js', // As mentioned above, this file turns 13 into 15
  corePath: 'tesseract-core.js',
  workerBlobURL: false,
})

Some detailed explanations.

First, I checked tesseract-core.js and I found that it exports a TesseractCore function. Then, I found that although this file was loaded, the TesseractCore function was not executed, so I guessed that the problem was in the file that loaded tesseract-core.js, which is worker.min.js.

Then, in worker.min.js, I found the part of the code that loads tesseract-core.js based on the corePath parameter:

                    switch ((t.prev = t.next)) {
                      case 0:
                        if (void 0 !== r.g.TesseractCore) {
                          t.next = 15
                          break
                        }
                        if (
                          (o.progress({
                            status: 'loading tesseract core',
                            progress: 0,
                          }),
                          (f = e))
                        ) {
                          t.next = 8
                          break
                        }
                        return (t.next = 6), a()
                      case 6:
                        ;(u = t.sent),
                          (f = 'https://unpkg.com/tesseract.js-core@v'.concat(
                            s['tesseract.js-core'].substring(1),
                            u
                              ? '/tesseract-core-simd.wasm.js'
                              : '/tesseract-core.wasm.js'
                          ))
                      case 8:
                        if (
                          (r.g.importScripts(f),
                          void 0 === r.g.TesseractCoreWASM ||
                            'object' !==
                              ('undefined' == typeof WebAssembly
                                ? 'undefined'
                                : i(WebAssembly)))
                        ) {
// ---------------------- here -------------------------
                          t.next = 13
                          break
                        }
                        ;(r.g.TesseractCore = r.g.TesseractCoreWASM),
                          (t.next = 14)
                        break
                      case 13:
                        throw Error('Failed to load TesseractCore')
                      case 14:
                        o.progress({
                          status: 'loading tesseract core',
                          progress: 1,
                        })
                      case 15:
                        return t.abrupt('return', r.g.TesseractCore)
                      case 16:
                      case 'end':
                        return t.stop()
                    }

In the code above, 13 means the code will throw an error Failed to load TesseractCore, while 15 will run the TesseractCore function in tesseract-core.js, so I tried changing 13 to 15, and it worked!

@lmk123
Copy link
Author

lmk123 commented Apr 13, 2023

I guess this part of the code above should be generated from here:

https://github.com/naptha/tesseract.js/blob/master/src/worker-script/browser/getCore.js

@Balearica
Copy link
Member

Balearica commented Apr 14, 2023

Glad you figured this out.

I believe the core issue is not with Tesseract.js but with Tesseract.js-core. Tesseract.js is expecting Tesseract.js-core to create a module named TesseractCoreWASM, which is indeed what it is named if you load the .wasm.js file. However, for whatever reason, the module is named TesseractCore (not TesseractCoreWASM) in the .js file, which is why you were able to fix simply by bypassing the error message. Should be quick for me to fix in the next version.

@Balearica
Copy link
Member

@lmk123 Can you clarify whether you were actually able to recognize text using your fixed version? When I tried this using a modified example in this repo the tesseract-core.js file was unable to load the tesseract-core.wasm file as it did not have the full file path. This could be easily resolved by editing the file path within tesseract-core.js but I'm just wondering if you encountered this.

@lmk123
Copy link
Author

lmk123 commented Apr 15, 2023

I am not experiencing this problem and confirm that I am able to use it correctly.

Perhaps because I am using it in a browser extension, I have provided you with a minimal example available at: https://github.com/lmk123/tesseract-wasm

Balearica pushed a commit that referenced this issue Apr 17, 2023
@Balearica
Copy link
Member

Thanks for confirming. I have updated the master branch (075e918) such that you should be able to use it without modification. This will be reflected in the next release.

@lmk123
Copy link
Author

lmk123 commented May 8, 2023

I have upgraded to tessearct.js v4.0.5 and tessearct.js-core v4.0.4 and it works fine.

Thanks again for your work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants