Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tesseract worker usage improvements #101

Merged
merged 4 commits into from
Nov 21, 2024
Merged

Conversation

ZeeshanZulfiqarAli
Copy link
Contributor

This PR adds functionality that:

  • Reuses tesseract workers, and
  • Add a limit to the number of Tesseract workers, which is set to be disabled by default but can be configured using the maxTesseractWorkers option.

Add limit to the tesseract workers.
@@ -22,6 +25,28 @@ const defaultLLMParams: LLMParams = {
topP: 1, // OpenAI defaults to 1
};

const addWorker = async () => {
const _worker = await Tesseract.createWorker("eng");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have an underscore here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, forgot to change it. Good catch.

@@ -35,6 +39,12 @@ export const zerox = async ({
const aggregatedMarkdown: string[] = [];
const startTime = new Date();

if (maxTesseractWorkers !== -1 && maxTesseractWorkers < NUM_STARTING_WORKERS) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's move this to below the validation. No reason to start up workers if one of the inputs is missing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe to convertPdfToImages since that is where it is getting used.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move it below the validation here.

Here is a bit more detail on the workings and the idea behind this:

The worker creation is broken up into two places. Here, the goal is to have X number of workers set up by default so that when it's needed, these workers are ready to go.

I'm not using await here or in the other place because if we wait, then we'll lose any gains we made. Instead, this code will be running in parallel, and there are tasks in between that'll take sufficiently long for it to complete (for eg. pdf download).

If the PDF is larger and requires more workers, then we add those in convertPdfToImages, though all of the workers might not have been added by the time the Tesseract is called. Thats still not that big of a worry because as soon as a worker is added to the scheduler, it is assigned a job. So, using it earlier just makes sure that we hit the ground running, while saving time by not "awaiting" and if it needs more workers, we'll add those.

Does this make sense?

}

// Add more workers if needed
if (numNewWorkers > 0) initTesseractScheduler(numNewWorkers);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to await here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nope, the goal is to have this run in parallel. The scheduler assigns a job to the worker as soon as its added (if a job is available).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants