Replies: 4 comments 2 replies
-
Why do you want to do that? |
Beta Was this translation helpful? Give feedback.
-
I just want to speed up the process. I can handle multiple files in parallel, but with a large file with lots of images, I don't see a way to parallelise the workload within this file. |
Beta Was this translation helpful? Give feedback.
-
I replace the images with compressed versions of the images to shrink the size of the PDF file. example_094.pdf I would like to process pages or at least images in parallel to speed up processing time. |
Beta Was this translation helpful? Give feedback.
-
Here is a minimal non-working example for the parallelization attempt: from pypdf import PdfReader, PdfWriter
import os
from time import time
from pathlib import Path
import concurrent.futures
def process_image(img_obj):
img_obj.replace(img_obj.image, quality=30)
def process_page(page):
# candidate for multiprocessing/multi-threading?
for img_obj in page.images:
process_image(img_obj)
def process_pdf(input_pdf):
reader = PdfReader(input_pdf, strict=False)
writer = PdfWriter()
writer.clone_document_from_reader(reader)
# candidate for multiprocessing/multi-threading?
for page in writer.pages:
process_page(page)
filename = Path(input_pdf)
output_pdf = Path(f'./'
f'{filename.stem}'
'_processed'
f'{filename.suffix}')
with open(output_pdf, 'wb') as f:
writer.write(f)
writer.close()
return f'{input_pdf} -> {output_pdf}'
def main():
start_time = time()
home = Path(os.path.expanduser("~"))
input_path = home / 'PATH/TO/PDF/FILES'
file_list = [entry.path for entry in os.scandir(input_path)
if entry.is_file()]
with concurrent.futures.ProcessPoolExecutor() as executor:
futures = [executor.submit(process_pdf,
file) for file in file_list]
for future in concurrent.futures.as_completed(futures):
print(f'{future.result()}')
total_time = time() - start_time
print(f'Elapsed time: {total_time:.2f}s')
if __name__ == '__main__':
main() |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Is there a way to process pages of a PDF file in parallel (multiprocessing or -threading)? As far as I know, the page object is not picklable. Is there any other solution to utilize multiple CPU cores when processing the file?
Beta Was this translation helpful? Give feedback.
All reactions