Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ContentStream._readInlineImage is really slow on large inline images #330

Closed
sekrause opened this issue Feb 17, 2017 · 0 comments · Fixed by #740
Closed

ContentStream._readInlineImage is really slow on large inline images #330

sekrause opened this issue Feb 17, 2017 · 0 comments · Fixed by #740
Labels
nf-performance Non-functional change: Performance

Comments

@sekrause
Copy link
Contributor

sekrause commented Feb 17, 2017

I recently had a PDF that look hours to be processed by PyPDF2. The reason is that this PDF had multiple large inline images (up to 15 MB uncompressed) and ContentStream._readInlineImage is really inefficient:

  • The last while-loop only reads one byte at a time.
  • In each iteration this single byte is added to data. Since data is immutable, a complete copy has to be created in memory.

So when the inline image has a size of MB, a multi-MB large data has to be copied in memory millions of times. This takes ages.

You can easily create such a PDF with Pillow and reportlab with a large PNG like this one:

from PIL import Image
from reportlab.lib.pagesizes import A4
from reportlab.pdfgen.canvas import Canvas

logo = Image.open('inline-image.png')
canvas = Canvas('inline-image', pagesize=A4)
canvas.drawInlineImage(logo, 10, 10)
canvas.showPage()
canvas.save()

Then try to load the inline image:

import sys

from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.pdf import ContentStream

with open(sys.argv[1], 'rb') as f:
    pdf = PdfFileReader(f, strict=False)
    for page in pdf.pages:
        contentstream = ContentStream(page.getContents(), pdf)
        for operands, command in contentstream.operations:
            if command == b'INLINE IMAGE':
                data = operands['data']
                print(len(data))

I will soon prepare a pull request that fixes this issue.

@MartinThoma MartinThoma added the nf-performance Non-functional change: Performance label Apr 9, 2022
MartinThoma pushed a commit that referenced this issue Apr 15, 2022
Closes #329 - potential infinite loop (SEC)
Closes #330 - performance issue of ContentStream._readInlineImage (PERF)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
nf-performance Non-functional change: Performance
Projects
None yet
2 participants