Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leaks occur when saving each page of a PDF as an image #1430

Closed
JoeanAmier opened this issue Nov 30, 2021 · 7 comments
Closed

Memory leaks occur when saving each page of a PDF as an image #1430

JoeanAmier opened this issue Nov 30, 2021 · 7 comments
Assignees

Comments

@JoeanAmier
Copy link

Describe the bug (mandatory)

Memory leaks occur when saving each page of a PDF as an image, I wrote this operation as a function that increases memory each time the loop runs, It doesn't seem to recycle memory

To Reproduce (mandatory)

def pdf():
    doc = fitz.open('xxx.pdf')
    for i in range(doc.page_count):
        img = doc[i].get_pixmap(matrix=fitz.Matrix(0.2, 0.2), alpha=False)
        img.save("%s.png")
    doc.close()

for _ in range(5):
    pdf() # Each execution adds a certain amount of memory

If you comment the code that saves the image, there is no memory problem

def pdf():
    doc = fitz.open('xxx.pdf')
    doc.close()

Your configuration (mandatory)

3.9.7 (default, Sep 16 2021, 16:59:28) [MSC v.1916 64 bit (AMD64)]
win32

PyMuPDF 1.19.1: Python bindings for the MuPDF 1.19.0 library.
Version date: 2021-10-23 00:00:10.
Built for Python 3.9 on win32 (64-bit).

@JorjMcKie
Copy link
Collaborator

I have tested this on Windows and Linux:
Cannot confirm a difference between (a) creating a pixmap alone, and (b) also saving the pixmap.
Dealing with pixmaps as such does create some storage consumption - which is enevitable, because of MuPDF's internal caching of images.

@JoeanAmier
Copy link
Author

Is there no way to release the cache? Or to achieve the preservation of images of other programs, the current situation in the processing of multiple large PDF documents is easy to take up all the memory

@JorjMcKie
Copy link
Collaborator

Is there no way to release the cache? Or to achieve the preservation of images of other programs, the current situation in the processing of multiple large PDF documents is easy to take up all the memory

Yes, there is: execute fitz.TOOLS.store_shrink(100). The parameter is the percentage.
When processing the old Adobe manual (1310 pages) five times in a row, the maximum memory usage difference to start of program was 80 MB.
When executing the said instruction after each page, that difference went down to 70 MB.
This benefit was higher with other files.

You can also try the MuPDF CLI tool like this: mutool draw -o p-%d.png -L ... file.pdf.
This has the argument -L to request a low-memory execution.
This does the same said instruction plus it uses a parameter to suppress caching.

Caching suppression is not yet available in PyMuPDF, but I will make sure to include it in the next version.

All this will of course have an adverse effect on performance.

Other considerations to alleviate this problem include using Python multiprocessing as explained in the documentation.

@JorjMcKie
Copy link
Collaborator

If you run mutool draw -o p-%d.png ... adobe.pdf you will probably observe a similar maximum memory usage of 80 MB. If using the -L option, this goes down to at most 50 MB.
I do expect to achieve this too by suppressing the caching ...

@JorjMcKie JorjMcKie added enhancement and removed bug labels Dec 1, 2021
@JorjMcKie
Copy link
Collaborator

In any case, after being done with a pixmap (i.e. after saving it), set it to None to enforce freeing up its storage.

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Dec 2, 2021

I found the following logix best to keep intermediate memory under control while also delivering acceptable speed.
Tested this with Adobe manual. The maximum memory usage at any time was below 50 MB:

# process the file in segments / intervals
doc = fitz.open("adobe.pdf")
interval = 50
pc = doc.page_count
pno = 0

while pno < pc:
    limit = min(pc, pno + interval)
    for page in doc.pages(pno, limit, 1):
        pix = page.get_pixmap()
        pix = None  # <== important!

    if limit >= pc:
        break
    pno += interval
    doc.close()  # release file and its resources
    fitz.TOOLS.store_shrink(100)  # empty MuPDF cache
    doc = fitz.open(doc.name)  # recycle document

doc.close()

@JoeanAmier
Copy link
Author

The solution you described has improved the memory problem, thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants