Gosseract is much slower than Pytesseract #250

frytoli · 2022-06-16T19:00:03Z

Summary

Firstly, thank you for the great work on this repo! It made it really easy to incorporate Tesseract into my project.

I'm finding that OCR-ing images with gosseract to be much slower than with Pytesseract, and I would expect the opposite. One image that I tested took 1m19.203891157s to OCR with gosseract and only 3.1102023124694824 seconds with Pytesseract. My testing has shown that in the go function I've provided below, the client.Text() call is the culprit (no real surprise there). Has anyone else come across this issue? Is this expected behavior? I do have my code running in a Docker container, and I'm wondering if that might also be a part of this puzzle.

Reproducible Dockerfile

There's a lot going on in my Dockerfile, but here are the relevant parts:

RUN apt-get install -y \
  libtesseract-dev \
  libleptonica-dev
ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/
RUN apt-get install -y tesseract-ocr-eng

My go code:

func OCR(imgs []image.Image) string {
  // New Tesseract client
  client := gosseract.NewClient()
  defer client.Close()

  // Iterate over images and save text
  allText := []string{}
  for _, img := range imgs {
    // Convert image to byte array
    buf := new(bytes.Buffer)
    err := tiff.Encode(buf, img, nil)
    if err != nil {
        ErrorLogger.Printf("Error encoding tiff image and converting to byte array: %s\n", err)
        return ""
    }
    // Send bytes to Tesseract engine
    err = client.SetImageFromBytes(buf.Bytes())
    if err != nil {
        ErrorLogger.Printf("Error sending image byte array to Tesseract: %s\n", err)
        return ""
    }
    // Get OCRed text
    text, _ := client.Text()
    // Save
    allText = append(allText, text)
  }

  // Join text and return
  return strings.Join(allText[:], " ")
}

Environment

go version go1.17 linux/amd64
tesseract 4.0.0-beta.1
leptonica-1.75.3

The text was updated successfully, but these errors were encountered:

FriedRiceWithEggs · 2022-11-28T14:14:48Z

Some things you can try:

Are you running gosseract in a docker container and pytesseract without docker? Check your memory limit with docker stats. Maybe pytesseract is running without ressource limits and your container has limited ressources.
Disable openmp. Since tesseract 4.1.0, openmp is disabled by default. See:

frytoli · 2022-11-28T16:06:04Z

Thanks for your response! I'll look into this again when I get some.

frytoli closed this as completed Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Gosseract is much slower than Pytesseract #250

Gosseract is much slower than Pytesseract #250

frytoli commented Jun 16, 2022 •

edited

Loading

FriedRiceWithEggs commented Nov 28, 2022

frytoli commented Nov 28, 2022

Gosseract is much slower than Pytesseract #250

Gosseract is much slower than Pytesseract #250

Comments

frytoli commented Jun 16, 2022 • edited Loading

Summary

Reproducible Dockerfile

Environment

FriedRiceWithEggs commented Nov 28, 2022

frytoli commented Nov 28, 2022

frytoli commented Jun 16, 2022 •

edited

Loading