-
Notifications
You must be signed in to change notification settings - Fork 708
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
'_io.BytesIO' object has no attribute 'name' #124
Comments
Fix #124 opening PDF with bytes stream
Hi, I just wanna tell that PR #179 breaks the |
So I pinpointed the issue. The issue is related to resolution and amount of pages.
The This will let GS run for appr. a minute with full load on one core, return and start eating all my memory while having huge CPU usage. This won't happen in the old implementation, there above test function exits after appr. 5 seconds on the same machine. |
The difference comes in
|
The PR jsvine#179 leads to a CPU load and memory leak. The problem is documented here jsvine#124
Thank you for flagging this, @ubmarco! I'm not terribly familiar with Wand's internals, so may have to do some additional research. But, in the meantime, what do you think of this short-term solution?:
|
Hi @jsvine yes that would be an option. We're working on a fork anyway because we needed some further adaptations so I just reverted the PR on our side. No need to rush. However this problem might affect others and CPU/memory leak does not obviously correlate to the binary reading of the PDF. It's definitely worth to further investigate. |
Thanks again, @ubmarco. I checked and noticed I was seeing the same memory problems. (I hadn't noticed it in the test suite because the tested PDFs are intentionally small/short.) In v0.5.20, just now released, Still, the CPU/memory leak when using bytes is not ideal. If anyone has suggestions on how to resolve that in pdfplumber (or whether it requires changes to ImageMagick or Wand), I'd be very interested to hear them. Thanks in advance. |
Thanks for that quick fix @jsvine, I will try it out. Defaulting to Wand/ImageMagick file reading is a good option for me. And generally thanks for being so responsive. |
Thank you for the very clear bug reports! |
Closing this issue on the realization that the core bug here has been fixed and that the CPU / memory leak issue is being tracked in #193 |
The
to_image()
method does not seem to work if thepdfplumber.PDF
object was created using aBytesIO
stream. The rest of the functionality seems unaffected.The problem seems to arise in the call to
wand.image.Image()
in theget_page_image()
function indisplay.py
. This image function have the ability to take file objects using thefile
argument explained here butget_page_image()
only ever uses thefilename
parameter. Line 42 of thePageImage
class is also looking for the name of the stream, but BytesIO objects do not have a name. Extracting characters, rectangles etc. can still be done with these BytesIO objects.The MWE:
Gives the error:
Not sure how best to fix this issue.
The text was updated successfully, but these errors were encountered: