-
Notifications
You must be signed in to change notification settings - Fork 662
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tesseract worker usage improvements #101
Conversation
Add limit to the tesseract workers.
node-zerox/src/utils.ts
Outdated
@@ -22,6 +25,28 @@ const defaultLLMParams: LLMParams = { | |||
topP: 1, // OpenAI defaults to 1 | |||
}; | |||
|
|||
const addWorker = async () => { | |||
const _worker = await Tesseract.createWorker("eng"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have an underscore here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, forgot to change it. Good catch.
node-zerox/src/index.ts
Outdated
@@ -35,6 +39,12 @@ export const zerox = async ({ | |||
const aggregatedMarkdown: string[] = []; | |||
const startTime = new Date(); | |||
|
|||
if (maxTesseractWorkers !== -1 && maxTesseractWorkers < NUM_STARTING_WORKERS) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's move this to below the validation. No reason to start up workers if one of the inputs is missing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe to convertPdfToImages
since that is where it is getting used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll move it below the validation here.
Here is a bit more detail on the workings and the idea behind this:
The worker creation is broken up into two places. Here, the goal is to have X number of workers set up by default so that when it's needed, these workers are ready to go.
I'm not using await here or in the other place because if we wait, then we'll lose any gains we made. Instead, this code will be running in parallel, and there are tasks in between that'll take sufficiently long for it to complete (for eg. pdf download).
If the PDF is larger and requires more workers, then we add those in convertPdfToImages
, though all of the workers might not have been added by the time the Tesseract is called. Thats still not that big of a worry because as soon as a worker is added to the scheduler, it is assigned a job. So, using it earlier just makes sure that we hit the ground running, while saving time by not "awaiting" and if it needs more workers, we'll add those.
Does this make sense?
node-zerox/src/utils.ts
Outdated
} | ||
|
||
// Add more workers if needed | ||
if (numNewWorkers > 0) initTesseractScheduler(numNewWorkers); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to await
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, the goal is to have this run in parallel. The scheduler assigns a job to the worker as soon as its added (if a job is available).
This PR adds functionality that:
maxTesseractWorkers
option.