You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
and noticed that the resulting documents can have order of the pages somewhat random (sometimes not). I suppose that for the sake of the data processing we don't mind. But as soon as we leverage NLP, the snipplets that are close to page endings / beginnings can make no sense.
same procedure is to be applied on .txt files. Then iterate over these lists.
I also notice that we don't reconstruct the OCR into the original pdf file. Maybe that would be of some benefit? For that case, from PyPDF2 import PdfMerger is fairly easy to use (example below).
I also notice that we don't reconstruct the OCR into the original pdf file. Maybe that would be of some benefit? For that case, from PyPDF2 import PdfMerger is fairly easy to use (example below).
While tesseract allows for creation of PDFs with the OCRed text overlaid on top of the source image, we do not use this feature for a reason. If we did use it and merged the output single-page PDFs back into a whole PDF again we would somehow need to track whether the report/st PDF is the original one or was replaced by the OCR one. Or we would need a way to store the OCRed PDFs separately and track that, and it is just added complexity. I do not see much benefit to justify that.
Describe the bug
On ocassion completely unrelated to this project, I needed to run some OCR. I leveraged the function
ocr_pdf_file
sec-certs/sec_certs/utils/pdf.py
Lines 39 to 64 in 4d4a1b6
and noticed that the resulting documents can have order of the pages somewhat random (sometimes not). I suppose that for the sake of the data processing we don't mind. But as soon as we leverage NLP, the snipplets that are close to page endings / beginnings can make no sense.
Expected behavior
Well, the data should be ordered.
Fix
In my code, I fixed that with
same procedure is to be applied on
.txt
files. Then iterate over these lists.I also notice that we don't reconstruct the OCR into the original pdf file. Maybe that would be of some benefit? For that case,
from PyPDF2 import PdfMerger
is fairly easy to use (example below).The text was updated successfully, but these errors were encountered: