You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Firstly, thank you for the great work on this repo! It made it really easy to incorporate Tesseract into my project.
I'm finding that OCR-ing images with gosseract to be much slower than with Pytesseract, and I would expect the opposite. One image that I tested took 1m19.203891157s to OCR with gosseract and only 3.1102023124694824 seconds with Pytesseract. My testing has shown that in the go function I've provided below, the client.Text() call is the culprit (no real surprise there). Has anyone else come across this issue? Is this expected behavior? I do have my code running in a Docker container, and I'm wondering if that might also be a part of this puzzle.
Reproducible Dockerfile
There's a lot going on in my Dockerfile, but here are the relevant parts:
RUN apt-get install -y \
libtesseract-dev \
libleptonica-dev
ENV TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata/
RUN apt-get install -y tesseract-ocr-eng
My go code:
funcOCR(imgs []image.Image) string {
// New Tesseract clientclient:=gosseract.NewClient()
deferclient.Close()
// Iterate over images and save textallText:= []string{}
for_, img:=rangeimgs {
// Convert image to byte arraybuf:=new(bytes.Buffer)
err:=tiff.Encode(buf, img, nil)
iferr!=nil {
ErrorLogger.Printf("Error encoding tiff image and converting to byte array: %s\n", err)
return""
}
// Send bytes to Tesseract engineerr=client.SetImageFromBytes(buf.Bytes())
iferr!=nil {
ErrorLogger.Printf("Error sending image byte array to Tesseract: %s\n", err)
return""
}
// Get OCRed texttext, _:=client.Text()
// SaveallText=append(allText, text)
}
// Join text and returnreturnstrings.Join(allText[:], " ")
}
Environment
go version go1.17 linux/amd64
tesseract 4.0.0-beta.1
leptonica-1.75.3
The text was updated successfully, but these errors were encountered:
Are you running gosseract in a docker container and pytesseract without docker? Check your memory limit with docker stats. Maybe pytesseract is running without ressource limits and your container has limited ressources.
Disable openmp. Since tesseract 4.1.0, openmp is disabled by default. See:
Summary
Firstly, thank you for the great work on this repo! It made it really easy to incorporate Tesseract into my project.
I'm finding that OCR-ing images with gosseract to be much slower than with Pytesseract, and I would expect the opposite. One image that I tested took
1m19.203891157s
to OCR with gosseract and only3.1102023124694824
seconds with Pytesseract. My testing has shown that in the go function I've provided below, theclient.Text()
call is the culprit (no real surprise there). Has anyone else come across this issue? Is this expected behavior? I do have my code running in a Docker container, and I'm wondering if that might also be a part of this puzzle.Reproducible Dockerfile
There's a lot going on in my Dockerfile, but here are the relevant parts:
My go code:
Environment
go version go1.17 linux/amd64
tesseract 4.0.0-beta.1
leptonica-1.75.3
The text was updated successfully, but these errors were encountered: