-
Notifications
You must be signed in to change notification settings - Fork 932
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Removing Pycryptodome Dependency: Massive bloat (extra 260 MB) #429
Comments
Hi @corneliusroemer, thanks for this issue! I'm focussing on the pycryptodome stuff in this response. If you experience the install size of pdfminer.six as a problem, then you should create a separate issue for that. On the pypi page of pycryptodome it is explicitly mentioned that it is a self-contained package:
Do you know why you need to install these c libraries then? Impact The SHA256 and AES algorithms are only used when they are successfully imported. And the ARC4 has a backup implementation in arcfour.py. So it looks like pdfminer.six already supports running without pycryptodome. SHA256 can be replaced by Things we could do
|
He has to install the compiler toolchains and dependent C libs because he's using Alpine linux. Most linux distributions use the GNU version of the standard C library (glibc). Alpine linux uses musl. So any pre-compiled C code (pycryptodome C extensions in a wheel package) that links against glibc won't work there. The result is that pycryptodome's C extensions must be compiled according to the rules of its setup.py. I'm assuming that's what is using the tools he mentioned he had to install. All that may be the case, but most python libraries with C extensions are going to have this problem. It's just the pain of using Alpine. Some however have precompiled packages specifically for Alpine (like cryptography). |
Proposal:
This removes any dependency on pycryptome and adds a dependency on cryptography. |
@pietermarsman That seems like the appropriate approach if you don't want to support AES encrypted pdfs by default. I would suggest moving the arc4 routines to use cryptography as well, and to remove the arcfour.py module. This would be faster and more consistent. I can do this in my associated PR if you want. W.r.t arcfour; Normally there would be a security concern of using your self-implemented version of the encryption routine but you're only decrypting. There are still edge cases like an attacker that has access to the original encrypted document and attackable side-channel access to the machine doing decryption. Prima facia it doesn't seem like a big deal, but my advice would be along the lines of "better safe than sorry" with encryption. |
Some clarification surrounding my claim of 260MB bloat when installing pdfminer: |
Feature request
Trying to replace PyPDF2 with pdfminer.six for PDF text extraction in an Alpine Linux Python base container (python:alpine3.7) I hit a significant roadblock because of pdfminer's dependency on
pycryptodome
.Pycryptodome installation requires gcc and a whole load of other dependencies that aren't present in the light base container (90 MB). It took me a while to figure out how to circumvent this problem, which is another reason to try to get rid of the dependency on pycryptodome (https://stackoverflow.com/a/35713086/7483211).
Installation of
apk add gcc g++ make libffi-dev openssl-dev
(requirements for pycryptodome) bloats the image from a mere 90MB to 270 MB.Then, installing pdfminer.six on top bloats to 351MB. So just for pdfminer.six to be installed almost quadruples storage space.
PyPDF2 didn't really any noticeable extra storage.
I'm not sure what pycryptodome is being used for, if it's not mission-critical, it may be helpful to add it as an optional module, to be installed when actually used, rather than having it ship by default. I guess most people never need it.
I'm also not sure why the pdfminer.six installation is so big itself (seems to be around 80MB in my case). Maybe this can also be optimised further.
The text was updated successfully, but these errors were encountered: