Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Gosseract is much slower than Pytesseract #250

Closed
frytoli opened this issue Jun 16, 2022 · 2 comments
Closed

Gosseract is much slower than Pytesseract #250

frytoli opened this issue Jun 16, 2022 · 2 comments

Comments

@frytoli
Copy link

frytoli commented Jun 16, 2022

Summary

Firstly, thank you for the great work on this repo! It made it really easy to incorporate Tesseract into my project.

I'm finding that OCR-ing images with gosseract to be much slower than with Pytesseract, and I would expect the opposite. One image that I tested took 1m19.203891157s to OCR with gosseract and only 3.1102023124694824 seconds with Pytesseract. My testing has shown that in the go function I've provided below, the client.Text() call is the culprit (no real surprise there). Has anyone else come across this issue? Is this expected behavior? I do have my code running in a Docker container, and I'm wondering if that might also be a part of this puzzle.

Reproducible Dockerfile

There's a lot going on in my Dockerfile, but here are the relevant parts:

RUN apt-get install -y \
  libtesseract-dev \
  libleptonica-dev
ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/
RUN apt-get install -y tesseract-ocr-eng

My go code:

func OCR(imgs []image.Image) string {
  // New Tesseract client
  client := gosseract.NewClient()
  defer client.Close()

  // Iterate over images and save text
  allText := []string{}
  for _, img := range imgs {
    // Convert image to byte array
    buf := new(bytes.Buffer)
    err := tiff.Encode(buf, img, nil)
    if err != nil {
        ErrorLogger.Printf("Error encoding tiff image and converting to byte array: %s\n", err)
        return ""
    }
    // Send bytes to Tesseract engine
    err = client.SetImageFromBytes(buf.Bytes())
    if err != nil {
        ErrorLogger.Printf("Error sending image byte array to Tesseract: %s\n", err)
        return ""
    }
    // Get OCRed text
    text, _ := client.Text()
    // Save
    allText = append(allText, text)
  }

  // Join text and return
  return strings.Join(allText[:], " ")
}

Environment

  • go version go1.17 linux/amd64
  • tesseract 4.0.0-beta.1
  • leptonica-1.75.3
@FriedRiceWithEggs
Copy link

Some things you can try:

@frytoli
Copy link
Author

frytoli commented Nov 28, 2022

Thanks for your response! I'll look into this again when I get some.

@frytoli frytoli closed this as completed Nov 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants