Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unpredictable order of data in documents fixed by OCR #279

Closed
adamjanovsky opened this issue Nov 8, 2022 · 1 comment · Fixed by #280
Closed

Unpredictable order of data in documents fixed by OCR #279

adamjanovsky opened this issue Nov 8, 2022 · 1 comment · Fixed by #280
Assignees
Labels
bug Something isn't working

Comments

@adamjanovsky
Copy link
Collaborator

adamjanovsky commented Nov 8, 2022

Describe the bug

On ocassion completely unrelated to this project, I needed to run some OCR. I leveraged the function ocr_pdf_file

def ocr_pdf_file(pdf_path: Path) -> str:
"""
OCR a PDF file and return its text contents, uses `pdftoppm` and `tesseract`.
:param pdf_path: The PDF file to OCR.
:return: The text contents.
"""
with TemporaryDirectory() as tmpdir:
tmppath = Path(tmpdir)
ppm = subprocess.run(
["pdftoppm", pdf_path, tmppath / "image"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
)
if ppm.returncode != 0:
raise ValueError(f"pdftoppm failed: {ppm.returncode}")
for ppm_path in map(Path, glob.glob(str(tmppath / "image*.ppm"))):
base = ppm_path.with_suffix("")
tes = subprocess.run(
["tesseract", "-l", "eng+deu+fra", ppm_path, base], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL
)
if tes.returncode != 0:
raise ValueError(f"tesseract failed: {tes.returncode}")
contents = ""
for txt_path in map(Path, glob.glob(str(tmppath / "image*.txt"))):
with txt_path.open("r", encoding="utf-8") as f:
contents += f.read()
return contents

and noticed that the resulting documents can have order of the pages somewhat random (sometimes not). I suppose that for the sake of the data processing we don't mind. But as soon as we leverage NLP, the snipplets that are close to page endings / beginnings can make no sense.

Expected behavior

Well, the data should be ordered.

Fix

In my code, I fixed that with

image_paths = [x for x in tmp_dir_path.iterdir() if x.is_file() and x.suffix == ".ppm"]
image_paths = [(x, int(x.stem.split("-")[1].split(".")[0])) for x in image_paths]
image_paths = sorted(image_paths, key=lambda x: x[1])
image_paths = [x[0] for x in image_paths]

same procedure is to be applied on .txt files. Then iterate over these lists.

I also notice that we don't reconstruct the OCR into the original pdf file. Maybe that would be of some benefit? For that case, from PyPDF2 import PdfMerger is fairly easy to use (example below).

merger = PdfMerger()
for pdf_path in pdf_paths:
    merger.append(pdf_path)
@adamjanovsky adamjanovsky added the bug Something isn't working label Nov 8, 2022
@J08nY
Copy link
Member

J08nY commented Nov 9, 2022

I also notice that we don't reconstruct the OCR into the original pdf file. Maybe that would be of some benefit? For that case, from PyPDF2 import PdfMerger is fairly easy to use (example below).

While tesseract allows for creation of PDFs with the OCRed text overlaid on top of the source image, we do not use this feature for a reason. If we did use it and merged the output single-page PDFs back into a whole PDF again we would somehow need to track whether the report/st PDF is the original one or was replaced by the OCR one. Or we would need a way to store the OCRed PDFs separately and track that, and it is just added complexity. I do not see much benefit to justify that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants