-
Notifications
You must be signed in to change notification settings - Fork 188
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Page level images #7
Comments
Unfortunately, not yet. we are working on something in this direction to align the fulltexts from the German Text Archive with the corresponding images. Hopefully, I can get back to you soon with some tool. |
Thanks. It will be a useful tool. I am trying to use some ocropus tools to split the page into line images. I will either ocr the line images to create text to be corrected for ground truth, or type it fully, |
@Shreeshrii , you could try this approach:
One problem with this approach is that segmentation errors (e.g. a line gets cut in two, a few words at the beginning/end are missing, etc) lead to false positives. |
I want to use it for Devanagari script. I had looked at ocropus quite sometime back. I am not sure if ocropus/kraken supports Devanagari. Do you know if it has support for complex scripts? |
@Shreeshrii There are some papers with text recognition results with Ocropus on Devanagari script. However, I am not aware of any shared model you could reuse. You can find some models for Ocropus here https://github.com/tmbdev/ocropy/wiki/Models However, instead of 1.+2. you can also use tesseract for creating a hocr output and then use hocr-extract-images to create the line images and texts. Moreover, if you have the ground truth in hocr format you can use hocr-eval for the evaluation with your recognition format. Or, do you have the ground truth only as a text with the geometric information? |
@zuphilip I have also read about Devanagari training for ocropus but the models are not available (I had looked couple of years ago or so). Thank you for the link to specific HOCR tools. I will give them a try. The ground truth I have are plain text files matching the scanned images without any positional info. I was able to use them to eval OCR accuracy by comparing to recognized output. |
https://github.com/Shreeshrii/imagessan/tree/master/groundtruthimages Sanskrit language samples in Devanagari script. |
Ping @adnanulhasan who may still have some sources from the Ocropus training with Deviangari script texts. |
@zuphilip Thank you. I was able to use it for Devanagari script files also. The commands which worked for me (it took a little experimenting to get it right).
|
The other option which I had used was
And then running tesseract to get text and correcting it. |
In case it is helpful to others looking for a solution, posting below a bash script I use for -
The ground truth needs to be updated manually, if there is an existing page level ground truth file, copy line by line into the lines ground truth.
|
Occasionally, the line images are a bit wider than the text and so they catch the letters from the preceding or the subsequent lines. Is this a problem for the training (i.e. should such images be fixed to ensure that they do not contain top/bottom of the neighbouring lines)? |
I think this is a problem. It would be great if you could provide a corresponding example, maybe in a specific GitHub issue. Many thanks in advance! |
Please see tesseract-ocr/tesseract#2231 for the WordStr format box files. |
@bertsky: Concerning the comment by @SultanOrazbayev, clipping may help here, right? Is it possible, to get polygonal line shapes from tesseract? |
It is possible to get polygon-based segmentation from Tesseract: with But even without polygon masked line images you could try clipping to rid of the intrusions from neighbours, yes. Or alternatively, do resegmentation (i.e. increase coherence via another line segmentation). Both methods are already available as OCR-D processors, as is Tesseract region segmentation (optionally with polygons). But you want line segmentation with polygons here, right? I am afraid Tesseract's API does not offer that – only for the "block" level! Should I give details (what/where/how) on using clipping and resegmentation? |
Hi, |
@kabilankiruba This is clearly not related to this thread. Pls. consider to contact the Tesseract user group. |
Is there any tool which will display the line images and gt.txt side by side for easy correction after generating the files from HOCR output (as suggested here). I do not want to run a web server to do this. Can it be done via javascript/html - show an image and its gt.txt - save corrected gt.txt and have an arrow/option to display next image and gt.txt. Basically, i would like to run this on my windows10 desktop. |
https://github.com/OpenArabic/OCR_GS_Data/blob/master/_doublecheck_viewer.py creates HTML5 based webpage for Reviewing OCR Training/Testing Data. |
Both kraken's and ocropy's transcription do that. the hocrjs viewer has an option to make items |
Thank you. I think the following workflow will do the trick.
writing Transfer and browse correction.html on Windows. Add the ground truth text for each line image. Save HTML as complete webpage. Transfer file back to Linux.
|
@Shreeshrii could you please clarify how you match the extracted ground truth txt files from ocropy/ocropus with the line level images obtained with your script? After using ocropus-nlbin, the original filename is "lost" (ocropus uses numerical increasing values). Using tesstrain, I assume that you don't train tesseract on the line level images and gt obtained with ocropus? These images are slightly different compared to the line images obtained with your script (which uses tesseract directly) because of preprocessing with ocropus-nlbin. But please correct me if I am wrong. I am confused what the current workflow is to correct the extracted ground truth:
|
@fjp These are two different approaches. |
Hello, I was facing similar requirements for generation of training data in a windows-env, which ended up in a small Script that extracts both coords and textdata from an existing ALTO-file and writes training-data-pairs. |
@M3ssman This would be a great contribution. Especially, since it opens up a way to use Aletheia-created GT with tesstrain. |
@wrznr I must confess: There are some caveats. |
Could you please explain what each line does. I want to run it on my system but am confused on what to change @Shreeshrii |
Assuming that you have tesseract and hocr-tools installed, put your image (png) files in ./myfiles/ folder. Change lang=san in the bash script to whichever language you need eg. lang=eng for each image file After this the *.gt.txt files need to be manually corrected to match the line images. |
Thank You. It has solved some issues but still a problem persists. I'm attaching a screenshot. Please look into the matter @Shreeshrii |
Do you have tesseract and hocr-tools installed correctly? It is not finding hocr config file. Is your tessdata_prefix directory setup correctly? Are hocr-tools working fine? Change the paths based on your setup. |
i installed hocr-tools using "sudo pip3 install hocr-tools". And as of tesseract i cloned the tesstrain repo aur used make leptonica tesseract since i had to train tesseract manually on data. |
Take one image file. Run tesseract on it, see if you get text output. Try again with pdf at end of command and see if you get pdf output. Then try with hocr. Similarly test the hocr-tools. Check that you can run the hocr-extract-images command. Once you can do this for one file, use the appropriate commands in a for loop for all files. |
./usr/share/tessdata Check the files and folders in that directory. Do you have a newer set of files under /usr/share/tessdata/4.00 |
both tesseract and hocr are working. |
@rraina97 Please open an issue at https://github.com/tmbdev/hocr-tools for help on invoking It looks like |
In my case to make it run, have made some minor changes in @Shreeshrii script, I put the image file of page in myfiles and run the script with bash generate_training_data.sh
Special thanks to Shreeshri! |
Nitpick:
should be
|
Courtesy of [@Shreeshrii](tesseract-ocr/tesstrain#7 (comment))
For a simpler and more efficient way, I recommend gnu parallel. The above stuff becomes 2 lines. First generate the hocr files: Even faster (around 10% for me) if you recompile tesseract without openMP (./configure --disable-openmp) |
If we only have page images and page ground truth text, can we use them to train tesseract instead of line images and line ground truth? I imagine page images/text are closer to the tesseract input/output format? |
@whisere The question is whether your page images/texts are aligned on line-level. I.e. for each text line the coordinates of the corresponding part of the page image have to be annotated. If not, training Tesseract with your data is not possible. |
Thanks, That's not good, There is no text line information in page texts at all.. only multiple blocks with |
How about block images and block ground truth text? |
You would have to align them manually or semi-automatically (i.e. you could try to OCR the images to get the line segmentation and than heuristically match the text on the lines) on the line level. Tesseract text recognition has to be trained on the level of lines. No other way (cf. e.g. https://ieeexplore.ieee.org/abstract/document/6628705). |
Many thanks for the information! |
This is a helpful script. Thank you. However, it ends with "run ocr-d train to create box and lstmf files". Can someone tell me how to do this? Thanks. |
The script works for line level images.
I have a number of scanned page images with ground truth files.
Does OCR-D project have any tools to segment it to line images with corresponding ground truth text?
The text was updated successfully, but these errors were encountered: