Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is this the correct way to generate tokens for a new example? #121

Open
lionely opened this issue Jun 26, 2023 · 6 comments
Open

Is this the correct way to generate tokens for a new example? #121

lionely opened this issue Jun 26, 2023 · 6 comments

Comments

@lionely
Copy link

lionely commented Jun 26, 2023

I am trying to generate tokens according to the Inference.MD.

Method 1:
I got inspiration from this issue to use CRAFT and get bounding boxes are the text. Given the bounding boxes, I would get each of the boxes, crop the area of the image then do text recognition and fill
a dictionary with the bounding box and extracted text.

Now with this list of dictionaries, I pass the image and token list to TATR. But I end up with empty cells in the end.

Method 2:
I thought, maybe the text bounding boxes don't match with the output from TATR. So I tried running TATR twice,
First to get bounding boxes and then to crop the image, then do text recognition to make the token dictionary.
Second passing the token dictionary with image to TATR to hopefully generate a csv in the end. But again my output was empty.

Is my logic sound?

Code for method 2 (similar to method 1):

encoding = feature_extractor(image, return_tensors="pt")
with torch.no_grad():
    outputs = tatr_model(**encoding)
target_sizes = [image.size[::-1]]
results = feature_extractor.post_process_object_detection(outputs, threshold=0.6, target_sizes=target_sizes)[0]

tokens = []
for obj in results['boxes']:
    obj = obj.numpy()
    tmp_dict = {
        'bbox': obj,
        'text': ''}
    crop_image = resized_image.crop(box=obj)
    pixel_values = processor(crop_image, return_tensors="pt").pixel_values
    generated_ids = trocr_model.generate(pixel_values)
    generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    tmp_dict['text'] = generated_text
    tokens.append(tmp_dict)

P.S.

Thank you for this very useful library, really looking forward to the progress of this research.

@lionely lionely changed the title How do we generate tokens for a new example? Is this the correct way to generate tokens for a new example? Jun 28, 2023
@thiagodma
Copy link

@lionely I think you should either use pdf packages to extract the bounding boxes for text-based PDFs or use an OCR for image-based PDFs. Easyocr is a great OCR that would get the bounding boxes with the texts really easy

@lionely
Copy link
Author

lionely commented Jun 29, 2023

@thiagodma Thank you very much for your response. According to this issue reply, we need additional information apart from text and bounding boxes to make a csv. Have you been successful with just this text and bounding boxes? Thank you again.

@thiagodma
Copy link

I'm using TATR to get HTML but I think you need the same additional info for CSV. Here's my code to get it working:

import easyocr

reader = easyocr.Reader(['en'])

# img is a numpy array for your RGB image
ocr_result = reader.readtext(img, width_ths=.03)

tokens = []
for i, res in enumerate(ocr_result):
    tokens.append({
        "bbox": list(map(int, [res[0][0][0], res[0][0][1], res[0][2][0], res[0][2][1]])),
        "text": res[1],
        "flags": 0,
        "span_num": i,
        "line_num": 0,
        "block_num": 0
    })

@Nikhilsonawane07
Copy link

Nikhilsonawane07 commented Sep 20, 2023

@thiagodma Can you please share the script to extract the bounding boxes for text-based PDFs ?

@thiagodma
Copy link

@Nikhilsonawane07 Here's a simplified version of my code.

Obs: the 'dpi' parameter is the dpi that you used to convert your PDF page to an image. You have to inform it here so that the bounding boxes are proportional to the size of the image (witch depends on the dpi).

import fitz
from pathlib import Path

flags = fitz.TEXT_INHIBIT_SPACES & ~fitz.TEXT_PRESERVE_IMAGES
dpi = 100

pdf_path = Path("<< path to your pdf >>")
pdf = fitz.open(stream=pdf_path.read_bytes(), filetype="pdf")
page = pdf[0]  # gets the first page of the PDF

words = page.get_text(option="words", flags=flags)

# converting 'words' to a list of dicts instead of a list of tuples
tokens = []
for word_meta in words:
    tokens.append({
        # times (dpi / 72) is to make sure the bounding boxes are in the same scale as the generated image
        "bbox": list(map(lambda x: int(x * (dpi / 72)), [word_meta[0], word_meta[1], word_meta[2], word_meta[3]])),
        "text": word_meta[4],
        "flags": 0,
        "block_num": word_meta[5],
        "line_num": word_meta[6],
        "span_num": word_meta[7]
    })

@Nikhilsonawane07
Copy link

Hey @thiagodma Thanks. Its working

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants