Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in page.getPixmap() #130

Closed
thefloe1 opened this issue Jan 29, 2018 · 22 comments
Closed

Memory leak in page.getPixmap() #130

thefloe1 opened this issue Jan 29, 2018 · 22 comments

Comments

@thefloe1
Copy link

I'm using pymupdf to loop though all pages in a PDF and extracting pages from it:

doc = fitz.open(filename)
for num,page in enumerate(doc):
	pix = page.getPixmap(m)
	data = pix.getPNGData()	
	# doing some magic here

Using memory profiler I see that every loop getPixmap allocates 20MB of data which are not freed. I tried different things like del pix or gc.collect but the memory usage increases every run. As I'm handling quiet some pages, my scripts runs out of memory...

getPixmap also shops the following warning/error:

error: Unable to read ICC workflow
warning: Attempt to read Output Intent failed
@JorjMcKie
Copy link
Collaborator

I am on a journey abroad, so I am limited to dig into this very much. I'll be back Thirsday next week.

Anyway, this sounds like a problem that I thought was fixed. Please provide me with your system parameters: OS, bitness, Python version and PyMuPDF version.
I will at least be able to look into the source.

Are you sure that the memory accumulation is caused by pixmap creation - and not by the getPNGdata method?

@thefloe1
Copy link
Author

Hi,
It's definitely on the getPixmap.

OS is Win 10, 64bit,
Phyton 2.7.12 (32bit),
PyMuPDF 1.12.2: Python bindings for the MuPDF 1.12.0 library

I see the same memory behavior when running the demo pdf viewer (PDFdisplay.py).

@JorjMcKie
Copy link
Collaborator

ok, thanks for the info. That's pixmap creation then.
As I said, give me some time / opportunity to look into that ...

@JorjMcKie
Copy link
Collaborator

I have found the cause for the issue, and I am testing the fix.
I will upload it next week, probably Thursday.

@thefloe1
Copy link
Author

thefloe1 commented Feb 4, 2018

Thanks a lot

@JorjMcKie
Copy link
Collaborator

Just uploaded a fix into branch 1.12.2.
Please check that your issue is being resolved.

@thefloe1
Copy link
Author

Memory consumption stays now below 400MB and does not gradually increase. I guess still some stuff stays in memory as when the process finishes still 312MB are used by process.

@JorjMcKie
Copy link
Collaborator

What is your OS and PyMuPDF version time stamp?
I am on Win 10 x64 and used the following session (Python 3.6.4 x64).

  • I created pixmaps for all 1310 pages of the Adobe manual.
  • after open I had 40 MB memory usage:
    start
  • at the end, this had increased to 94 MB:
    stop-1
  • deleting the leftover page and pixmap objects from the loop, this dropped to 92.5 MB:
    stop-2
  • after closing the document, memory usage went down to 26.4 MB (below my start value 40 MB after open):
    stop-3

So from my end, I find your 400 MB confusing. I will now go ahead and also create PNG images from each pixmap to see what that impact is ...

@JorjMcKie
Copy link
Collaborator

Hmm, no big change either:

  • I am going up to 97 MB now
    stop-1

@JorjMcKie
Copy link
Collaborator

Maybe I should be using an example with larger pages than this one? Let me try more documents ...

@JorjMcKie
Copy link
Collaborator

With larger page sizes (100 A4 pages, complex graphics) I am also staying around 100 MB.
It still is hard to explain, why I should have an increase at all after the first pixmap ...

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Feb 12, 2018

I did some MuPDF store memory analysis: Obviously, their strategy is to maintain stuff in memory as much as possible for performance reasons.
For example, any font or image once encountered is held in memory. This explains why memory usage goes up when more pages are processed - even though a pixmap is properly deleted.

There is an upper limit for the so-called global context (256 MB), which is specified in fitz_wrap.c. The storable memory is part of this global context.

There do exist low-level functions to exert some control, too. Like deleting some storable memory by percentage value and so on. Todate not implemented in PyMuPDF.

@JorjMcKie
Copy link
Collaborator

To complete my investigation, I rendered the file mentioned above (100 A4 pages, complex graphics ...) with MuPDF's mutool draw utility:
The memory usage behavior was exactly equal to that of my interactive session I sent you previously (slowly increasing to a 97 MB peak usage, ...).

The mentioned utility has an additional feature: one can empty the storable memory cache after each page has been processed. This keeps memory usage low (below 20 MB in this case) - of course at the cost of processing speed (first measurements point to 60% longer runtime).

@JorjMcKie
Copy link
Collaborator

Since yesterday I've been trying a few things to make sure that we no longer have a memory leak in PyMuPDF.
To this end, I implemented a new class Tools for testing purposes (not uploaded to the repo). It has methods store_size() and empty_store() among others.
I then used the following 2 scripts for a number of PDFs:

import time, fitz, sys
fname = sys.argv[1]
doc = fitz.open(fname)

t0 = time.clock()
for page in doc:
    pix = page.getPixmap()

t1 = time.clock()
print("time without storage reset %g" % (t1-t0))
import time, fitz, sys
fname = sys.argv[1]
doc = fitz.open(fname)
tools = fitz.Tools()

t0 = time.clock()
for page in doc:
    tools.empty_store()
    pix = page.getPixmap()
    
t1 = time.clock()
print("time with storage reset %g" % (t1-t0))

The second script empties the store before each new pixmap. This keeps memory usage down to the minimum required by the opened document itself, let's say in the ballpark of 10 MB. After closing / deleting the document, memory usage dropped again close to the start value.

The second script needed a considerably increases runtime: my science magazine (the 100 pager) needed 20% longer, and the Adobe manual more than 100%.
Certainly, method Tools.store_size() (very fast - just a check of a field) could be used to only empty the store, when it exceeds a threshold - thus decreasing the runtime impact.

I would however argue, that there already exists such a threshold: the aforementioned 256 MB built into PyMuPDF. If you wanted, you could change that value in file fitz_wrap.c in statement gctx = fz_new_context(NULL, NULL, FZ_STORE_DEFAULT); (close to the end of the file) and replace FZ_STORE_DEFAULT with a value of your liking.

Please let me know your reaction.

@wave-DmP
Copy link

Hi, I'm getting the same warning and error for a 50 page pdf when I run page.getText() on v1.13.20...

@JorjMcKie
Copy link
Collaborator

JorjMcKie commented Dec 21, 2018

Hi,
this is a fairly long and old comment chain ... so I don't know what exactly you are referring to.
Just did a page.getText() for all 1310 pages of the Adobe manual without complaints ...
Best you send me the script and data to reconstruct your case.

@JorjMcKie
Copy link
Collaborator

if you mean the icc related messages, you can just ignore them.
If you upgrade to v1.14, those messages will also be suppressed.

If you however are getting memory leaks (the actual topic of this issue!), then I am indeed alarmed.

@wave-DmP
Copy link

I've tested the script and i'm getting the ICC warning for each page that getText() is called on. I annotate each page subsequently and that doesn't give any warnings. The warnings don't appear for a 2 page document though. Sending data is no possible unfortunately.

Since there is no reduction in performance or function, I suspect it may be just the warnings.
Is it possible to manually suppress the ICC warnings from the script?

@JorjMcKie
Copy link
Collaborator

Yes, these are warnings.
Please upgrade to PyMuPDF v1.14.x -- this version captures messages like these, coming from the underlying C-library MuPDF.

I was a bit frightened at first, but I now understand you picked the wrong issue - your observation has nothing to do with memory leaks obviously ...

@wave-DmP
Copy link

fair enough, it was the first and only search result for the warning text and pymupdf :)

I'll upgrade and see if the issue is solved, thx!

@wave-DmP
Copy link

confirmed, upgrade to 1.14.x solved the warnings, thx!

@buptyyf
Copy link

buptyyf commented Dec 5, 2022

What is your OS and PyMuPDF version time stamp? I am on Win 10 x64 and used the following session (Python 3.6.4 x64).

  • I created pixmaps for all 1310 pages of the Adobe manual.
  • after open I had 40 MB memory usage:
    start
  • at the end, this had increased to 94 MB:
    stop-1
  • deleting the leftover page and pixmap objects from the loop, this dropped to 92.5 MB:
    stop-2
  • after closing the document, memory usage went down to 26.4 MB (below my start value 40 MB after open):
    stop-3

So from my end, I find your 400 MB confusing. I will now go ahead and also create PNG images from each pixmap to see what that impact is ...

I must call doc.close() at the end of use pdf doc? I see memory not recycle after I use fitz.open().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants