Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Removing Pycryptodome Dependency: Massive bloat (extra 260 MB) #429

Closed
corneliusroemer opened this issue May 18, 2020 · 5 comments · Fixed by #456
Closed

Removing Pycryptodome Dependency: Massive bloat (extra 260 MB) #429

corneliusroemer opened this issue May 18, 2020 · 5 comments · Fixed by #456
Assignees

Comments

@corneliusroemer
Copy link

Feature request

Trying to replace PyPDF2 with pdfminer.six for PDF text extraction in an Alpine Linux Python base container (python:alpine3.7) I hit a significant roadblock because of pdfminer's dependency on pycryptodome.

Pycryptodome installation requires gcc and a whole load of other dependencies that aren't present in the light base container (90 MB). It took me a while to figure out how to circumvent this problem, which is another reason to try to get rid of the dependency on pycryptodome (https://stackoverflow.com/a/35713086/7483211).

Installation of apk add gcc g++ make libffi-dev openssl-dev (requirements for pycryptodome) bloats the image from a mere 90MB to 270 MB.

Then, installing pdfminer.six on top bloats to 351MB. So just for pdfminer.six to be installed almost quadruples storage space.

PyPDF2 didn't really any noticeable extra storage.

I'm not sure what pycryptodome is being used for, if it's not mission-critical, it may be helpful to add it as an optional module, to be installed when actually used, rather than having it ship by default. I guess most people never need it.

I'm also not sure why the pdfminer.six installation is so big itself (seems to be around 80MB in my case). Maybe this can also be optimised further.

@pietermarsman
Copy link
Member

Hi @corneliusroemer, thanks for this issue!

I'm focussing on the pycryptodome stuff in this response. If you experience the install size of pdfminer.six as a problem, then you should create a separate issue for that.

On the pypi page of pycryptodome it is explicitly mentioned that it is a self-contained package:

PyCryptodome is a self-contained Python package of low-level cryptographic primitives.

Do you know why you need to install these c libraries then?

Impact
As far as I can tell the Crypto package (that is installed if you install pycryptodome) is only used in pdfdocument.py. It imports 3 different algorithms: SHA256, ARC4 and AES. These are used for decryption of the document (e.g. when it is locked with a password).

The SHA256 and AES algorithms are only used when they are successfully imported. And the ARC4 has a backup implementation in arcfour.py. So it looks like pdfminer.six already supports running without pycryptodome.

SHA256 can be replaced by hashlib.sha256. AES can be replaced with cryptography.fernet.Fernet or with rijndael.py (in pdfminer.six but never used).

Things we could do

  1. Not install pycryptodome by default and raise an error that explains the situation when it must be used.
    • Easy since pdfminer.six already asumes that the package is optional
  2. Replace pycryptodome with other (pure python?) implementations of the encryption algorithms.
    • We have test pdf's with aes and arc4 encryption so it's easy to test.
    • I prefer to using the standard most commonly used encryption package for python. Not sure if that is pycryptodome or cryptography.

@lithiumFlower
Copy link
Contributor

lithiumFlower commented Jul 10, 2020

Do you know why you need to install these c libraries then?

He has to install the compiler toolchains and dependent C libs because he's using Alpine linux. Most linux distributions use the GNU version of the standard C library (glibc). Alpine linux uses musl. So any pre-compiled C code (pycryptodome C extensions in a wheel package) that links against glibc won't work there.

The result is that pycryptodome's C extensions must be compiled according to the rules of its setup.py. I'm assuming that's what is using the tools he mentioned he had to install.

All that may be the case, but most python libraries with C extensions are going to have this problem. It's just the pain of using Alpine. Some however have precompiled packages specifically for Alpine (like cryptography).

@pietermarsman
Copy link
Member

Proposal:

  • don't install any cryptography package by default
  • use extras_require in setup.py to specify encryption packages
  • use cryptography package for aes, hashlib for sha256, use arc4 from arcfour.py

This removes any dependency on pycryptome and adds a dependency on cryptography.

@lithiumFlower
Copy link
Contributor

lithiumFlower commented Jul 12, 2020

@pietermarsman That seems like the appropriate approach if you don't want to support AES encrypted pdfs by default.

I would suggest moving the arc4 routines to use cryptography as well, and to remove the arcfour.py module. This would be faster and more consistent. I can do this in my associated PR if you want.

W.r.t arcfour; Normally there would be a security concern of using your self-implemented version of the encryption routine but you're only decrypting. There are still edge cases like an attacker that has access to the original encrypted document and attackable side-channel access to the machine doing decryption. Prima facia it doesn't seem like a big deal, but my advice would be along the lines of "better safe than sorry" with encryption.

@corneliusroemer
Copy link
Author

Some clarification surrounding my claim of 260MB bloat when installing pdfminer:
#456 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants