Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance issue with preparedHTML due to bs4 encoding detection #452

Open
digidigital opened this issue Feb 16, 2025 · 2 comments
Open

Comments

@digidigital
Copy link

digidigital commented Feb 16, 2025

Bug Metadata

  • Version of extract_msg: 0.53.1
  • Your python version: Python 3.10.12
  • How did you launch extract_msg?
    • [X ] I used the extract_msg package

Describe the bug
If you convert msg files that have a lot of big images embedded within the html body the performance of extract-msg degrades if you use the prepared HTML option .

I was able to track this down (i guess) to parsing the html multiple times with bs4 after injecting the images.
Since bs4 lacks the information of the HTML charset it tries to figure it out each time you call bs4.
With the injected images the file get's really large and it seems the character detection is a byte-by-byte process that is taking minutes.

msg.getSaveHtmlBody(prepared=True)

It seems the first time bs4 is called is in injectHtmlHeader -> self.htmlBodyPrepared
This step is fast since bs4 just parses the (small) html and the images are injected as base64 encoded strings after the encoding was detected. Now the HTML is way larger than before.

validateHTML parses the HTML again (now the large one)

  • If the validation fails the HTML is parsed a third time!

After validation (or correction of the HTML) getSaveHtml parses the whole HTML a third (or fourth) time if you are in "prepared" mode.

Possible Solution (?)
A possible solution could be to add a self.original_encoding=None to MessageBase init and extend all calls to bs4 with from_encoding=self.original_encoding (as well as validateHTML).

The self.original_encoding could be set to the encoding that is detected in the the first call to bs4 when self.htmlBodyPrepared is called

That way the detection

  • only runs once
  • on the small HTML
@TheElementalOfDestruction
Copy link
Collaborator

Sorry it took so long to get back to you, I'm working on this right now. I'm just looking for a way to fetch the encoding from the BeautifulSoup parser

TheElementalOfDestruction added a commit that referenced this issue Mar 13, 2025
@TheElementalOfDestruction
Copy link
Collaborator

I've pushed some code to next-release that should fix the issue. Please let me know if this fixes the performance issues for you and if it does I'll go ahead and release the next version

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants