Unpredictable order of data in documents fixed by OCR #279

adamjanovsky · 2022-11-08T11:36:34Z

Describe the bug

On ocassion completely unrelated to this project, I needed to run some OCR. I leveraged the function ocr_pdf_file

Lines 39 to 64 in 4d4a1b6

    
           def ocr_pdf_file(pdf_path: Path) -> str: 
        
               """ 
        
               OCR a PDF file and return its text contents, uses `pdftoppm` and `tesseract`. 
        
               :param pdf_path: The PDF file to OCR. 
        
               :return: The text contents. 
        
               """ 
        
               with TemporaryDirectory() as tmpdir: 
        
                   tmppath = Path(tmpdir) 
        
                   ppm = subprocess.run( 
        
                       ["pdftoppm", pdf_path, tmppath / "image"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL 
        
                   ) 
        
                   if ppm.returncode != 0: 
        
                       raise ValueError(f"pdftoppm failed: {ppm.returncode}") 
        
                   for ppm_path in map(Path, glob.glob(str(tmppath / "image*.ppm"))): 
        
                       base = ppm_path.with_suffix("") 
        
                       tes = subprocess.run( 
        
                           ["tesseract", "-l", "eng+deu+fra", ppm_path, base], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL 
        
                       ) 
        
                       if tes.returncode != 0: 
        
                           raise ValueError(f"tesseract failed: {tes.returncode}") 
        
                   contents = "" 
        
                   for txt_path in map(Path, glob.glob(str(tmppath / "image*.txt"))): 
        
                       with txt_path.open("r", encoding="utf-8") as f: 
        
                           contents += f.read() 
        
               return contents

and noticed that the resulting documents can have order of the pages somewhat random (sometimes not). I suppose that for the sake of the data processing we don't mind. But as soon as we leverage NLP, the snipplets that are close to page endings / beginnings can make no sense.

Expected behavior

Well, the data should be ordered.

Fix

In my code, I fixed that with

image_paths = [x for x in tmp_dir_path.iterdir() if x.is_file() and x.suffix == ".ppm"]
image_paths = [(x, int(x.stem.split("-")[1].split(".")[0])) for x in image_paths]
image_paths = sorted(image_paths, key=lambda x: x[1])
image_paths = [x[0] for x in image_paths]

same procedure is to be applied on .txt files. Then iterate over these lists.

I also notice that we don't reconstruct the OCR into the original pdf file. Maybe that would be of some benefit? For that case, from PyPDF2 import PdfMerger is fairly easy to use (example below).

merger = PdfMerger()
for pdf_path in pdf_paths:
    merger.append(pdf_path)

The text was updated successfully, but these errors were encountered:

J08nY · 2022-11-09T12:19:38Z

I also notice that we don't reconstruct the OCR into the original pdf file. Maybe that would be of some benefit? For that case, from PyPDF2 import PdfMerger is fairly easy to use (example below).

While tesseract allows for creation of PDFs with the OCRed text overlaid on top of the source image, we do not use this feature for a reason. If we did use it and merged the output single-page PDFs back into a whole PDF again we would somehow need to track whether the report/st PDF is the original one or was replaced by the OCR one. Or we would need a way to store the OCRed PDFs separately and track that, and it is just added complexity. I do not see much benefit to justify that.

adamjanovsky added the bug Something isn't working label Nov 8, 2022

adamjanovsky assigned J08nY Nov 8, 2022

J08nY mentioned this issue Nov 9, 2022

Fix OCR page order issue. #280

Merged

J08nY closed this as completed in #280 Nov 9, 2022

J08nY closed this as completed in 3629ade Nov 9, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unpredictable order of data in documents fixed by OCR #279

Unpredictable order of data in documents fixed by OCR #279

adamjanovsky commented Nov 8, 2022 •

edited

Loading

J08nY commented Nov 9, 2022

Unpredictable order of data in documents fixed by OCR #279

Unpredictable order of data in documents fixed by OCR #279

Comments

adamjanovsky commented Nov 8, 2022 • edited Loading

J08nY commented Nov 9, 2022

adamjanovsky commented Nov 8, 2022 •

edited

Loading