-
Notifications
You must be signed in to change notification settings - Fork 259
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is this the correct way to generate tokens for a new example? #121
Comments
@thiagodma Thank you very much for your response. According to this issue reply, we need additional information apart from text and bounding boxes to make a csv. Have you been successful with just this text and bounding boxes? Thank you again. |
I'm using TATR to get HTML but I think you need the same additional info for CSV. Here's my code to get it working: import easyocr
reader = easyocr.Reader(['en'])
# img is a numpy array for your RGB image
ocr_result = reader.readtext(img, width_ths=.03)
tokens = []
for i, res in enumerate(ocr_result):
tokens.append({
"bbox": list(map(int, [res[0][0][0], res[0][0][1], res[0][2][0], res[0][2][1]])),
"text": res[1],
"flags": 0,
"span_num": i,
"line_num": 0,
"block_num": 0
}) |
@thiagodma Can you please share the script to extract the bounding boxes for text-based PDFs ? |
@Nikhilsonawane07 Here's a simplified version of my code. Obs: the 'dpi' parameter is the dpi that you used to convert your PDF page to an image. You have to inform it here so that the bounding boxes are proportional to the size of the image (witch depends on the dpi). import fitz
from pathlib import Path
flags = fitz.TEXT_INHIBIT_SPACES & ~fitz.TEXT_PRESERVE_IMAGES
dpi = 100
pdf_path = Path("<< path to your pdf >>")
pdf = fitz.open(stream=pdf_path.read_bytes(), filetype="pdf")
page = pdf[0] # gets the first page of the PDF
words = page.get_text(option="words", flags=flags)
# converting 'words' to a list of dicts instead of a list of tuples
tokens = []
for word_meta in words:
tokens.append({
# times (dpi / 72) is to make sure the bounding boxes are in the same scale as the generated image
"bbox": list(map(lambda x: int(x * (dpi / 72)), [word_meta[0], word_meta[1], word_meta[2], word_meta[3]])),
"text": word_meta[4],
"flags": 0,
"block_num": word_meta[5],
"line_num": word_meta[6],
"span_num": word_meta[7]
}) |
Hey @thiagodma Thanks. Its working |
I am trying to generate tokens according to the Inference.MD.
Method 1:
I got inspiration from this issue to use CRAFT and get bounding boxes are the text. Given the bounding boxes, I would get each of the boxes, crop the area of the image then do text recognition and fill
a dictionary with the bounding box and extracted text.
Now with this list of dictionaries, I pass the image and token list to TATR. But I end up with empty cells in the end.
Method 2:
I thought, maybe the text bounding boxes don't match with the output from TATR. So I tried running TATR twice,
First to get bounding boxes and then to crop the image, then do text recognition to make the token dictionary.
Second passing the token dictionary with image to TATR to hopefully generate a csv in the end. But again my output was empty.
Is my logic sound?
Code for method 2 (similar to method 1):
P.S.
Thank you for this very useful library, really looking forward to the progress of this research.
The text was updated successfully, but these errors were encountered: