Is this the correct way to generate tokens for a new example? #121

lionely · 2023-06-26T05:03:21Z

I am trying to generate tokens according to the Inference.MD.

Method 1:
I got inspiration from this issue to use CRAFT and get bounding boxes are the text. Given the bounding boxes, I would get each of the boxes, crop the area of the image then do text recognition and fill
a dictionary with the bounding box and extracted text.

Now with this list of dictionaries, I pass the image and token list to TATR. But I end up with empty cells in the end.

Method 2:
I thought, maybe the text bounding boxes don't match with the output from TATR. So I tried running TATR twice,
First to get bounding boxes and then to crop the image, then do text recognition to make the token dictionary.
Second passing the token dictionary with image to TATR to hopefully generate a csv in the end. But again my output was empty.

Is my logic sound?

Code for method 2 (similar to method 1):

encoding = feature_extractor(image, return_tensors="pt")
with torch.no_grad():
    outputs = tatr_model(**encoding)
target_sizes = [image.size[::-1]]
results = feature_extractor.post_process_object_detection(outputs, threshold=0.6, target_sizes=target_sizes)[0]

tokens = []
for obj in results['boxes']:
    obj = obj.numpy()
    tmp_dict = {
        'bbox': obj,
        'text': ''}
    crop_image = resized_image.crop(box=obj)
    pixel_values = processor(crop_image, return_tensors="pt").pixel_values
    generated_ids = trocr_model.generate(pixel_values)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    tmp_dict['text'] = generated_text
    tokens.append(tmp_dict)

P.S.

Thank you for this very useful library, really looking forward to the progress of this research.

thiagodma · 2023-06-28T14:27:36Z

@lionely I think you should either use pdf packages to extract the bounding boxes for text-based PDFs or use an OCR for image-based PDFs. Easyocr is a great OCR that would get the bounding boxes with the texts really easy

lionely · 2023-06-29T16:11:54Z

@thiagodma Thank you very much for your response. According to this issue reply, we need additional information apart from text and bounding boxes to make a csv. Have you been successful with just this text and bounding boxes? Thank you again.

thiagodma · 2023-06-29T16:45:54Z

I'm using TATR to get HTML but I think you need the same additional info for CSV. Here's my code to get it working:

import easyocr

reader = easyocr.Reader(['en'])

# img is a numpy array for your RGB image
ocr_result = reader.readtext(img, width_ths=.03)

tokens = []
for i, res in enumerate(ocr_result):
    tokens.append({
        "bbox": list(map(int, [res[0][0][0], res[0][0][1], res[0][2][0], res[0][2][1]])),
        "text": res[1],
        "flags": 0,
        "span_num": i,
        "line_num": 0,
        "block_num": 0
    })

Nikhilsonawane07 · 2023-09-20T04:53:12Z

@thiagodma Can you please share the script to extract the bounding boxes for text-based PDFs ?

thiagodma · 2023-09-20T11:22:25Z

@Nikhilsonawane07 Here's a simplified version of my code.

Obs: the 'dpi' parameter is the dpi that you used to convert your PDF page to an image. You have to inform it here so that the bounding boxes are proportional to the size of the image (witch depends on the dpi).

import fitz
from pathlib import Path

flags = fitz.TEXT_INHIBIT_SPACES & ~fitz.TEXT_PRESERVE_IMAGES
dpi = 100

pdf_path = Path("<< path to your pdf >>")
pdf = fitz.open(stream=pdf_path.read_bytes(), filetype="pdf")
page = pdf[0]  # gets the first page of the PDF

words = page.get_text(option="words", flags=flags)

# converting 'words' to a list of dicts instead of a list of tuples
tokens = []
for word_meta in words:
    tokens.append({
        # times (dpi / 72) is to make sure the bounding boxes are in the same scale as the generated image
        "bbox": list(map(lambda x: int(x * (dpi / 72)), [word_meta[0], word_meta[1], word_meta[2], word_meta[3]])),
        "text": word_meta[4],
        "flags": 0,
        "block_num": word_meta[5],
        "line_num": word_meta[6],
        "span_num": word_meta[7]
    })

Nikhilsonawane07 · 2023-10-06T07:18:21Z

Hey @thiagodma Thanks. Its working

lionely changed the title ~~How do we generate tokens for a new example?~~ Is this the correct way to generate tokens for a new example? Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is this the correct way to generate tokens for a new example? #121

Is this the correct way to generate tokens for a new example? #121

lionely commented Jun 26, 2023 •

edited

Loading

thiagodma commented Jun 28, 2023

lionely commented Jun 29, 2023

thiagodma commented Jun 29, 2023

Nikhilsonawane07 commented Sep 20, 2023 •

edited

Loading

thiagodma commented Sep 20, 2023

Nikhilsonawane07 commented Oct 6, 2023

Is this the correct way to generate tokens for a new example? #121

Is this the correct way to generate tokens for a new example? #121

Comments

lionely commented Jun 26, 2023 • edited Loading

thiagodma commented Jun 28, 2023

lionely commented Jun 29, 2023

thiagodma commented Jun 29, 2023

Nikhilsonawane07 commented Sep 20, 2023 • edited Loading

thiagodma commented Sep 20, 2023

Nikhilsonawane07 commented Oct 6, 2023

lionely commented Jun 26, 2023 •

edited

Loading

Nikhilsonawane07 commented Sep 20, 2023 •

edited

Loading