How to process pages in parallel? #2033

mnmtz · 2023-07-28T06:31:10Z

mnmtz
Jul 28, 2023

Is there a way to process pages of a PDF file in parallel (multiprocessing or -threading)? As far as I know, the page object is not picklable. Is there any other solution to utilize multiple CPU cores when processing the file?

MartinThoma · 2023-07-28T20:36:25Z

MartinThoma
Jul 28, 2023
Maintainer

Why do you want to do that?

0 replies

mnmtz · 2023-07-30T07:54:03Z

mnmtz
Jul 30, 2023
Author

I just want to speed up the process. I can handle multiple files in parallel, but with a large file with lots of images, I don't see a way to parallelise the workload within this file.

1 reply

MartinThoma Jul 30, 2023
Maintainer

What do you do with the files? Do you have an example file?

mnmtz · 2023-07-30T08:25:51Z

mnmtz
Jul 30, 2023
Author

I replace the images with compressed versions of the images to shrink the size of the PDF file.
Because of confidentiality I can't send you a working example, but any multi-page, multi-image PDF file can be used:

example_094.pdf
example_081.pdf

I would like to process pages or at least images in parallel to speed up processing time.

1 reply

MartinThoma Jul 30, 2023
Maintainer

Interesting application! I haven't thought about that so far. Thanks for sharing the use case 🙏

pypdf doesn't bring any mutiprocessing capabilities out of the box. You can handle different pages / images in parallel, but that's nothing where pypdf needs to bring.

pypdf should allow constant-time access to pages, meaning I want pypdf to just read the trailer / metadata, but only parse the pages once they are accessed. I'm currently not sure if we do/have that.

Do you have any example we can use where just iterating over the pages takes long and single-page index access as well? That would be an indicator that we need to refactor something

mnmtz · 2023-07-30T15:35:04Z

mnmtz
Jul 30, 2023
Author

Here is a minimal non-working example for the parallelization attempt:

from pypdf import PdfReader, PdfWriter
import os
from time import time
from pathlib import Path
import concurrent.futures


def process_image(img_obj):
    
    img_obj.replace(img_obj.image, quality=30)


def process_page(page):

    # candidate for multiprocessing/multi-threading?
    for img_obj in page.images:
        process_image(img_obj)


def process_pdf(input_pdf):

    reader = PdfReader(input_pdf, strict=False)
    writer = PdfWriter()

    writer.clone_document_from_reader(reader)

    # candidate for multiprocessing/multi-threading?
    for page in writer.pages:
        process_page(page)

    filename = Path(input_pdf)
    output_pdf = Path(f'./'
                      f'{filename.stem}'
                      '_processed'
                      f'{filename.suffix}')

    with open(output_pdf, 'wb') as f:
        writer.write(f)

    writer.close()

    return f'{input_pdf} -> {output_pdf}'


def main():
    start_time = time()

    home = Path(os.path.expanduser("~"))

    input_path = home / 'PATH/TO/PDF/FILES'

    file_list = [entry.path for entry in os.scandir(input_path)
                 if entry.is_file()]

    with concurrent.futures.ProcessPoolExecutor() as executor:
        futures = [executor.submit(process_pdf,
                                   file) for file in file_list]

        for future in concurrent.futures.as_completed(futures):
            print(f'{future.result()}')

    total_time = time() - start_time
    print(f'Elapsed time: {total_time:.2f}s')


if __name__ == '__main__':
    main()

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to process pages in parallel? #2033

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 4 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

How to process pages in parallel? #2033

Uh oh!

Uh oh!

mnmtz Jul 28, 2023

Replies: 4 comments · 2 replies

Uh oh!

MartinThoma Jul 28, 2023 Maintainer

Uh oh!

mnmtz Jul 30, 2023 Author

Uh oh!

MartinThoma Jul 30, 2023 Maintainer

Uh oh!

mnmtz Jul 30, 2023 Author

Uh oh!

MartinThoma Jul 30, 2023 Maintainer

Uh oh!

Uh oh!

mnmtz Jul 30, 2023 Author

mnmtz
Jul 28, 2023

Replies: 4 comments 2 replies

MartinThoma
Jul 28, 2023
Maintainer

mnmtz
Jul 30, 2023
Author

MartinThoma Jul 30, 2023
Maintainer

mnmtz
Jul 30, 2023
Author

MartinThoma Jul 30, 2023
Maintainer

mnmtz
Jul 30, 2023
Author