Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accounting for new lines in OCR feature #472

Open
mattakamatsu opened this issue Feb 24, 2024 · 0 comments
Open

Accounting for new lines in OCR feature #472

mattakamatsu opened this issue Feb 24, 2024 · 0 comments

Comments

@mattakamatsu
Copy link

The OCR feature is terrific, with one exception: whenever there is a new line, the OCR does not include a space between words on subsequent lines. For example:

tilted at +10-20 degrees.Based on the degree of invagination, CCSs were classified into threecategories.

Can we add a space for words between new lines? I asked GPT4 how to do this, and here's what it suggested:

// Inside the tesseractImage.onload = async () => { ... }

const {
  data: { text },
} = await worker.recognize(canvas);
await worker.terminate();

const textBullets = text.split("\n");
const bullets = [];
let currentText = "";
for (let b = 0; b < textBullets.length; b++) {
  const s = textBullets[b].trim(); // Trim to remove leading and trailing whitespaces
  if (s) {
    if (currentText && !currentText.match(/[\.,!?\)\]\:;\-]$/)) {
      // Add a space before the new text if the last character is not a punctuation mark that typically does not follow a space
      currentText += " ";
    }
    currentText += s;
  } else if (currentText) {
    // Push the currentText into bullets when encountering an empty string (newline), and reset currentText
    bullets.push(
      currentText.startsWith("* ") ||
      currentText.startsWith("- ") ||
      currentText.startsWith("— ")
        ? currentText.substring(2)
        : currentText
    );
    currentText = "";
  }
}
if (currentText) {
  // Ensure any remaining text is also pushed into bullets
  bullets.push(
    currentText.startsWith("* ") ||
    currentText.startsWith("- ") ||
    currentText.startsWith("— ")
      ? currentText.substring(2)
      : currentText
  );
}

// The rest of your logic to create blocks from bullets remains unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant