Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How do I generate a dataframe after identifying the table structure? #112

Open
hyshandler opened this issue Apr 26, 2023 · 15 comments
Open

Comments

@hyshandler
Copy link

hyshandler commented Apr 26, 2023

I'm trying to generate a dataframe from a table in an image. After running the following code I was able to return the original image with (mostly) correct grid lines drawn on the table. How can I turn the coordinates in the image / gridlines and boxes into a pandas dataframe or csv? Thanks in advance!

model = TableTransformerForObjectDetection.from_pretrained("microsoft/table-transformer-structure-recognition") feature_extractor = DetrFeatureExtractor()
encoding = feature_extractor(image, return_tensors="pt")
with torch.no_grad(): outputs = model(**encoding)
target_sizes = [image.size[::-1]]
results = feature_extractor.post_process_object_detection(outputs, threshold=0.8, target_sizes=target_sizes)[0]

@WalidHadri-Iron
Copy link

@hyshandler I guess either you use the implementation in this repo directly, because you have all the post-processing steps available until the construction of the dataframe, or you combine the output of that feature_extractor from huggingface with the post-processing function that you will find in this repo. Unfortunately, the post-processing steps to build the dataframe do not come with the huggingface transformers.

@hyshandler
Copy link
Author

Thanks @WalidHadri-Iron for the response. Can you point me to the specific functions to construct that process? I'm having trouble finding the right ones in this repo. Thanks!

@WalidHadri-Iron
Copy link

@hyshandler the post-processing code is here https://github.com/microsoft/table-transformer/blob/main/src/postprocess.py and the steps are grouped here https://github.com/microsoft/table-transformer/blob/main/src/inference.py .

What you have in "results" is a bunch of bounding boxes with labels and scores, you need to post-process those bounding boxes based on their score, their label and the position of text you have in the image.

The function objects_to_structures groups almost the whole post-processsing

def objects_to_structures(objects, tokens, class_thresholds):

Then you have the three functions to get the format of outputs you want

def structure_to_cells(table_structure, tokens):

def cells_to_html(cells):

def cells_to_csv(cells):

@Ashwani-Dangwal
Copy link

Can the results from hugging face model be passed into structure_to_cells. If yes then how can i find those tokens that needs to be passed as parameters and how can i have them in the desired data structure.

@amish1706
Copy link

Can the results from hugging face model be passed into structure_to_cells. If yes then how can i find those tokens that needs to be passed as parameters and how can i have them in the desired data structure.

You can keep the tokens argument as None if you don't have the text data of the cells. It works without tokens too.

@Ashwani-Dangwal
Copy link

@amish1706
I did keep the tokens as None but it was throwing me an error that tokens cannot be none. What is the work around for it?.
Also if i dont pass in the tokens then how can be the text extracted on its own?
I cant see any ocr in the code that will do the text extraction.

@amish1706
Copy link

amish1706 commented Jun 21, 2023

from transformers import TableTransformerForObjectDetection
from transformers import DetrImageProcessor
from inference import * 
from PIL import Image
import matplotlib.pyplot as plt

image = Image.open(img_path).convert("RGB")
encoding = feature_extractor(image, return_tensors="pt")
tokens = []

with torch.no_grad():
    outputs = model(**encoding)
outputs['pred_logits'] = outputs['logits']

objects = outputs_to_objects(outputs, image.size, model.config.id2label)
crops = objects_to_crops(image, tokens=tokens, objects=objects, class_thresholds={model.config.id2label[i]:0.5 for i in range(6)})
structures = objects_to_structures(objects, tokens, class_thresholds={model.config.id2label[i]:0.5 for i in range(6)})[0]
cells = structure_to_cells(structures, tokens)
visualize_cells(image, cells[0], f"outputs/{img_path.split('/')[-1]}")

Change in original inference code
image

@Ashwani-Dangwal
Copy link

Ashwani-Dangwal commented Jun 21, 2023

@amish1706
Maybe i was not able to clear my question.
I wanted to know how can i convert the table from image into data frame as a csv output file without passing tokens. Because tokens are used everywhere in the code and its not feasible to change the entire code likewise.

I managed to get the text and its bounding boxes using pytesseract and now i am able to get the results in csv file but sometimes the entries from two rows are being concatenated together as a single row entry of 2 or 3 columns. I manually looked up the bounding boxes of those text and there was significant difference in the location of the text so i am not able to understand why it happened.

If there is a way that can do the entire work of creating the csv file without passing the tokens then it will be very much helpful. Else i wanted to know a work around as to how can i solve the above error of concatenation some entries.

@amish1706
Copy link

amish1706 commented Jun 21, 2023

I managed to get the text and its bounding boxes using pytesseract and now i am able to get the results in csv file but sometimes the entries from two rows are being concatenated together as a single row entry of 2 or 3 columns. I manually looked up the bounding boxes of those text and there was significant difference in the location of the text so i am not able to understand why it happened.

From what I understand and correct me if I'm wrong, you are saying that you have the bounding box of the text and the ocr results from Pytesseract. Now you want to use it with the inference code of the table-transformer.

I have only used it to get the table cells and the structure from it so I might not be able to help properly.
image

I found this inside the code, you'll have to provide the tokens files in this format. Once check the PubTables -1m dataset which was used in the paper. Look for samples and convert your results in that format.

@Ashwani-Dangwal
Copy link

I managed to get the text and its bounding boxes using pytesseract and now i am able to get the results in csv file but sometimes the entries from two rows are being concatenated together as a single row entry of 2 or 3 columns. I manually looked up the bounding boxes of those text and there was significant difference in the location of the text so i am not able to understand why it happened.

From what I understand and correct me if I'm wrong, you are saying that you have the bounding box of the text and the ocr results from Pytesseract. Now you want to use it with the inference code of the table-transformer.

I have only used it to get the table cells and the structure from it so I might not be able to help properly. image

I found this inside the code, you'll have to provide the tokens files in this format. Once check the PubLayNet dataset which was used in the paper. Look for samples and convert your results in that format.

I did manage to pass in the tokens too in sorted format as mentioned by the author.
I am sending you some pics and its output as i cannot share them here and then you can understand it more clearly.

@lionely
Copy link

lionely commented Jun 26, 2023

@Ashwani-Dangwal Hi I'm trying to use generate the tokens list in the format described on the Inference.MD.

There it says we only need the bounding box and text, not span. Were you able to figure this out? I generated in the format suggested but I end up with empty cells.

@Ashwani-Dangwal
Copy link

@lionely Other than the bounding box and the text you also need span_num, block_num and line_num.
If you have scanned images then you can use pytesseract's image_to_data function to get all the values and if you have doc format file then you can simply use PyMuPDF to get the details.
Hope this helps.

@lionely
Copy link

lionely commented Jun 27, 2023

@Ashwani-Dangwal Thank you so much for your reply. I see how to get the block_num, line_num using pytesseract. But which one would be the span_num? Thanks again!

@Ashwani-Dangwal
Copy link

@lionely span_num would be the word_num in pytessearct. And you would also make the co-ordinates of the bounding boxes in xmin, ymin, xmax and ymax format since pytesseract gives the bounding boxes in x,y ,w and h format.

@mahmoudshaddad
Copy link

I am also trying to use this but what is class_thresholds
i have this:
results_list = [] for i in range(len(results['scores'])): label = results['labels'][i].item() score = results['scores'][i].item() bbox = results['boxes'][i].tolist() result_dict = {'label': label, 'score': score, 'bbox': bbox} results_list.append(result_dict)
table_structures = objects_to_structures(results_list, results['boxes'], class_thresholds)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants