Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Page.to_image() leaks file descriptors #1089

Closed
dhdaines opened this issue Feb 5, 2024 · 7 comments
Closed

Page.to_image() leaks file descriptors #1089

dhdaines opened this issue Feb 5, 2024 · 7 comments
Labels

Comments

@dhdaines
Copy link
Contributor

dhdaines commented Feb 5, 2024

Describe the bug

When extracting images from a PDF file with a large number of pages, we eventually run out of file descriptors on Mac. This probably leaks file descriptors everywhere but Linux has less strict ulimits. It may also be related to #1072 as Windows has silly restrictions on opening the same file multiple times.

This is because pypdfium2 is holding onto a file descriptor somewhere. The exception thrown by the code below:

Traceback (most recent call last):
  File "/Users/dhd/work/vdsa/alexi/download/pdfbug.py", line 5, in <module>
    img = page.to_image()
          ^^^^^^^^^^^^^^^
  File "/Users/dhd/work/vdsa/pdfplumber/pdfplumber/page.py", line 585, in to_image
    return PageImage(
           ^^^^^^^^^^
  File "/Users/dhd/work/vdsa/pdfplumber/pdfplumber/display.py", line 88, in __init__
    self.original = get_page_image(
                    ^^^^^^^^^^^^^^^
  File "/Users/dhd/work/vdsa/pdfplumber/pdfplumber/display.py", line 56, in get_page_image
    pdfium_page = pypdfium2.PdfDocument(
                  ^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/dhd/work/vdsa/venv/lib/python3.12/site-packages/pypdfium2/_helpers/document.py", line 62, in __init__
    input = input.expanduser().resolve()
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/pathlib.py", line 1240, in resolve
    s = self._flavour.realpath(self, strict=strict)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "<frozen posixpath>", line 436, in realpath
  File "<frozen posixpath>", line 423, in abspath
OSError: [Errno 24] Too many open files

Code to reproduce the problem

import pdfplumber
with pdfplumber.open("20231213-Codification-administrative-Rgl-1314-2021-Z.pdf") as pdf:
    for page in pdf.pages:
        img = page.to_image()

PDF file

https://ville.sainte-adele.qc.ca/upload/documents/20231213-Codification-administrative-Rgl-1314-2021-Z.pdf (485 pages, enough to hit the ulimit on a Mac!)

Environment

  • pdfplumber from develop branch (commit 07d9997)
  • MacOS 12.7.3
  • Python 3.12.1
@dhdaines dhdaines added the bug label Feb 5, 2024
@dhdaines
Copy link
Contributor Author

dhdaines commented Feb 5, 2024

In theory there is a parameter autoclose=True that should make pypdfium2 close the file (why this isn't the default, only Google knows...) but it doesn't really seem to work. Calling close explicitly on the PdfDocument does work, though. PR coming shortly.

@jsvine
Copy link
Owner

jsvine commented Feb 10, 2024

Fixed by @dhdaines in #1090

@jsvine jsvine closed this as completed Feb 10, 2024
@mara004
Copy link

mara004 commented Mar 19, 2024

In theory there is a parameter autoclose=True that should make pypdfium2 close the file (why this isn't the default, only Google knows...) but it doesn't really seem to work. Calling close explicitly on the PdfDocument does work, though.

Let me shine some light on this:

  • autoclose=True is not default because we don't know whether the caller might want to continue using the FD after having closed the PdfDocument.1
  • The reason why you perceived it "doesn't really work" is basically the same as when you open a file descriptor without explicitly closing it (e.g. via a with or try/finally statement) -- then closing is deferred to garbage collection, which does not have to happen immediately after the object falls unreferenced. There can be an arbitrary delay until the finalizer is actually called. (I'm not aware of a way to prioritize garbage, but feel free to give me a pointer if you find some Python C API to do this.)
  • So autoclose does not relieve you from the task to add an explicit close call; it just merges FD closing into PdfDocument closing.

Maybe we should just remove the parameter with the next major version to avoid confusion? WDYT?

Footnotes

  1. I believe it's a convention that an FD be managed by the caller that opened it, and that receivers should not close unless asked for.

@dhdaines
Copy link
Contributor Author

  • The reason why you perceived it "doesn't really work" is basically the same as when you open a file descriptor without explicitly closing it (e.g. via a with or try/finally statement) -- then closing is deferred to garbage collection, which does not have to happen immediately after the object falls unreferenced. There can be an arbitrary delay until the finalizer is actually called. (I'm not aware of a way to prioritize garbage, but feel free to give me a pointer if you find some Python C API for this.)
  • So autoclose does not relieve you from the task to add an explicit close call; it just merges FD closing into PdfDocument closing.

Maybe we should just remove the parameter with the next major version to avoid confusion? WDYT?

Thanks for clarifying this - I feel silly for not remembering the bit about garbage collection, that's the reason why context managers exist after all, sorry about that!

So yes, autoclose should ideally be replaced with __enter__ and __exit__ methods on PdfDocument, I think.

@mara004
Copy link

mara004 commented Mar 19, 2024

No problem.

So yes, autoclose should ideally be replaced with enter and exit methods on PdfDocument, I think.

I guess you're right we should provide context manager functionality.
Though that doesn't conflict with autoclose -- in fact, I believe it could help merge two with nesting layers into one.
... but maybe the current autoclose needs a better name.

@dhdaines
Copy link
Contributor Author

Yes, context manager functionality would be great! pdfplumber will still have to explicitly call close, obviously (it really isn't your bug).

@mara004
Copy link

mara004 commented Mar 19, 2024

I believe it could help merge two with nesting layers into one.

Wait, I just remembered there are 2-in-1 with-blocks, of course:

>>> import pypdfium2 as pdfium
>>>
>>> class PdfWithCtx (pdfium.PdfDocument):
...     def __enter__(self):
...             return self
...     def __exit__(self, *_):
...             self.close()
... 
>>> with open(".../doc.pdf", "rb") as fh, PdfWithCtx(fh) as pdf:
...     print(pdf[0].get_textpage().get_text_bounded())

So I guess autoclose is more or less obsolete.
Thinking about it a second time, python should also close the buffer on garbage collection on its own, so supposedly there's no real advantage in binding it into our finalizer.

I think I might remove the parameter with the next major version (if that ever happens 😅).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants