How do I generate a dataframe after identifying the table structure? #112

hyshandler · 2023-04-26T15:48:21Z

I'm trying to generate a dataframe from a table in an image. After running the following code I was able to return the original image with (mostly) correct grid lines drawn on the table. How can I turn the coordinates in the image / gridlines and boxes into a pandas dataframe or csv? Thanks in advance!

model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-structure-recognition") feature_extractor = DetrFeatureExtractor()
encoding = feature_extractor(image, return_tensors="pt")
with torch.no_grad(): outputs = model(**encoding)
target_sizes = [image.size[::-1]]
results = feature_extractor.post_process_object_detection(outputs, threshold=0.8, target_sizes=target_sizes)[0]

The text was updated successfully, but these errors were encountered:

WalidHadri-Iron · 2023-04-28T20:21:17Z

@hyshandler I guess either you use the implementation in this repo directly, because you have all the post-processing steps available until the construction of the dataframe, or you combine the output of that feature_extractor from huggingface with the post-processing function that you will find in this repo. Unfortunately, the post-processing steps to build the dataframe do not come with the huggingface transformers.

hyshandler · 2023-05-01T13:34:48Z

Thanks @WalidHadri-Iron for the response. Can you point me to the specific functions to construct that process? I'm having trouble finding the right ones in this repo. Thanks!

WalidHadri-Iron · 2023-05-02T08:36:10Z

@hyshandler the post-processing code is here https://github.com/microsoft/table-transformer/blob/main/src/postprocess.py and the steps are grouped here https://github.com/microsoft/table-transformer/blob/main/src/inference.py .

What you have in "results" is a bunch of bounding boxes with labels and scores, you need to post-process those bounding boxes based on their score, their label and the position of text you have in the image.

The function objects_to_structures groups almost the whole post-processsing

table-transformer/src/inference.py

Line 295 in 235ad51

def objects_to_structures(objects, tokens, class_thresholds):

Then you have the three functions to get the format of outputs you want

table-transformer/src/inference.py

Line 359 in 235ad51

def structure_to_cells(table_structure, tokens):

table-transformer/src/inference.py

Line 540 in 235ad51

def cells_to_html(cells):

table-transformer/src/inference.py

Line 512 in 235ad51

def cells_to_csv(cells):

Ashwani-Dangwal · 2023-06-06T09:31:46Z

Can the results from hugging face model be passed into structure_to_cells. If yes then how can i find those tokens that needs to be passed as parameters and how can i have them in the desired data structure.

amish1706 · 2023-06-19T08:56:24Z

Can the results from hugging face model be passed into structure_to_cells. If yes then how can i find those tokens that needs to be passed as parameters and how can i have them in the desired data structure.

You can keep the tokens argument as None if you don't have the text data of the cells. It works without tokens too.

Ashwani-Dangwal · 2023-06-21T06:42:49Z

@amish1706
I did keep the tokens as None but it was throwing me an error that tokens cannot be none. What is the work around for it?.
Also if i dont pass in the tokens then how can be the text extracted on its own?
I cant see any ocr in the code that will do the text extraction.

amish1706 · 2023-06-21T10:15:21Z

from transformers import TableTransformerForObjectDetection
from transformers import DetrImageProcessor
from inference import * 
from PIL import Image
import matplotlib.pyplot as plt

image = Image.open(img_path).convert("RGB")
encoding = feature_extractor(image, return_tensors="pt")
tokens = []

with torch.no_grad():
    outputs = model(**encoding)
outputs['pred_logits'] = outputs['logits']

objects = outputs_to_objects(outputs, image.size, model.config.id2label)
crops = objects_to_crops(image, tokens=tokens, objects=objects, class_thresholds={model.config.id2label[i]:0.5 for i in range(6)})
structures = objects_to_structures(objects, tokens, class_thresholds={model.config.id2label[i]:0.5 for i in range(6)})[0]
cells = structure_to_cells(structures, tokens)
visualize_cells(image, cells[0], f"outputs/{img_path.split('/')[-1]}")

Change in original inference code

Ashwani-Dangwal · 2023-06-21T10:35:27Z

@amish1706
Maybe i was not able to clear my question.
I wanted to know how can i convert the table from image into data frame as a csv output file without passing tokens. Because tokens are used everywhere in the code and its not feasible to change the entire code likewise.

I managed to get the text and its bounding boxes using pytesseract and now i am able to get the results in csv file but sometimes the entries from two rows are being concatenated together as a single row entry of 2 or 3 columns. I manually looked up the bounding boxes of those text and there was significant difference in the location of the text so i am not able to understand why it happened.

If there is a way that can do the entire work of creating the csv file without passing the tokens then it will be very much helpful. Else i wanted to know a work around as to how can i solve the above error of concatenation some entries.

amish1706 · 2023-06-21T10:50:11Z

I managed to get the text and its bounding boxes using pytesseract and now i am able to get the results in csv file but sometimes the entries from two rows are being concatenated together as a single row entry of 2 or 3 columns. I manually looked up the bounding boxes of those text and there was significant difference in the location of the text so i am not able to understand why it happened.

From what I understand and correct me if I'm wrong, you are saying that you have the bounding box of the text and the ocr results from Pytesseract. Now you want to use it with the inference code of the table-transformer.

I have only used it to get the table cells and the structure from it so I might not be able to help properly.

I found this inside the code, you'll have to provide the tokens files in this format. Once check the PubTables -1m dataset which was used in the paper. Look for samples and convert your results in that format.

Ashwani-Dangwal · 2023-06-21T10:59:18Z

I managed to get the text and its bounding boxes using pytesseract and now i am able to get the results in csv file but sometimes the entries from two rows are being concatenated together as a single row entry of 2 or 3 columns. I manually looked up the bounding boxes of those text and there was significant difference in the location of the text so i am not able to understand why it happened.

From what I understand and correct me if I'm wrong, you are saying that you have the bounding box of the text and the ocr results from Pytesseract. Now you want to use it with the inference code of the table-transformer.

I have only used it to get the table cells and the structure from it so I might not be able to help properly.

I found this inside the code, you'll have to provide the tokens files in this format. Once check the PubLayNet dataset which was used in the paper. Look for samples and convert your results in that format.

I did manage to pass in the tokens too in sorted format as mentioned by the author.
I am sending you some pics and its output as i cannot share them here and then you can understand it more clearly.

lionely · 2023-06-26T03:07:28Z

@Ashwani-Dangwal Hi I'm trying to use generate the tokens list in the format described on the Inference.MD.

There it says we only need the bounding box and text, not span. Were you able to figure this out? I generated in the format suggested but I end up with empty cells.

Ashwani-Dangwal · 2023-06-26T06:46:30Z

@lionely Other than the bounding box and the text you also need span_num, block_num and line_num.
If you have scanned images then you can use pytesseract's image_to_data function to get all the values and if you have doc format file then you can simply use PyMuPDF to get the details.
Hope this helps.

lionely · 2023-06-27T12:50:20Z

@Ashwani-Dangwal Thank you so much for your reply. I see how to get the block_num, line_num using pytesseract. But which one would be the span_num? Thanks again!

Ashwani-Dangwal · 2023-06-27T13:50:25Z

@lionely span_num would be the word_num in pytessearct. And you would also make the co-ordinates of the bounding boxes in xmin, ymin, xmax and ymax format since pytesseract gives the bounding boxes in x,y ,w and h format.

mahmoudshaddad · 2023-11-14T12:09:23Z

I am also trying to use this but what is class_thresholds
i have this:
results_list = [] for i in range(len(results['scores'])): label = results['labels'][i].item() score = results['scores'][i].item() bbox = results['boxes'][i].tolist() result_dict = {'label': label, 'score': score, 'bbox': bbox} results_list.append(result_dict)
table_structures = objects_to_structures(results_list, results['boxes'], class_thresholds)

amish1706 mentioned this issue Jun 21, 2023

Not able to download pre-trained weights #116

Closed

lionely mentioned this issue Jun 29, 2023

Is this the correct way to generate tokens for a new example? #121

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I generate a dataframe after identifying the table structure? #112

How do I generate a dataframe after identifying the table structure? #112

hyshandler commented Apr 26, 2023 •

edited

Loading

WalidHadri-Iron commented Apr 28, 2023

hyshandler commented May 1, 2023

WalidHadri-Iron commented May 2, 2023

Ashwani-Dangwal commented Jun 6, 2023

amish1706 commented Jun 19, 2023

Ashwani-Dangwal commented Jun 21, 2023

amish1706 commented Jun 21, 2023 •

edited

Loading

Ashwani-Dangwal commented Jun 21, 2023 •

edited

Loading

amish1706 commented Jun 21, 2023 •

edited

Loading

Ashwani-Dangwal commented Jun 21, 2023

lionely commented Jun 26, 2023

Ashwani-Dangwal commented Jun 26, 2023

lionely commented Jun 27, 2023

Ashwani-Dangwal commented Jun 27, 2023

mahmoudshaddad commented Nov 14, 2023

How do I generate a dataframe after identifying the table structure? #112

How do I generate a dataframe after identifying the table structure? #112

Comments

hyshandler commented Apr 26, 2023 • edited Loading

WalidHadri-Iron commented Apr 28, 2023

hyshandler commented May 1, 2023

WalidHadri-Iron commented May 2, 2023

Ashwani-Dangwal commented Jun 6, 2023

amish1706 commented Jun 19, 2023

Ashwani-Dangwal commented Jun 21, 2023

amish1706 commented Jun 21, 2023 • edited Loading

Ashwani-Dangwal commented Jun 21, 2023 • edited Loading

amish1706 commented Jun 21, 2023 • edited Loading

Ashwani-Dangwal commented Jun 21, 2023

lionely commented Jun 26, 2023

Ashwani-Dangwal commented Jun 26, 2023

lionely commented Jun 27, 2023

Ashwani-Dangwal commented Jun 27, 2023

mahmoudshaddad commented Nov 14, 2023

hyshandler commented Apr 26, 2023 •

edited

Loading

amish1706 commented Jun 21, 2023 •

edited

Loading

Ashwani-Dangwal commented Jun 21, 2023 •

edited

Loading

amish1706 commented Jun 21, 2023 •

edited

Loading