Performance issue with preparedHTML due to bs4 encoding detection #452

digidigital · 2025-02-16T17:01:33Z

Bug Metadata

Version of extract_msg: 0.53.1
Your python version: Python 3.10.12
How did you launch extract_msg?
- [X ] I used the extract_msg package

Describe the bug
If you convert msg files that have a lot of big images embedded within the html body the performance of extract-msg degrades if you use the prepared HTML option .

I was able to track this down (i guess) to parsing the html multiple times with bs4 after injecting the images.
Since bs4 lacks the information of the HTML charset it tries to figure it out each time you call bs4.
With the injected images the file get's really large and it seems the character detection is a byte-by-byte process that is taking minutes.

msg.getSaveHtmlBody(prepared=True)

It seems the first time bs4 is called is in injectHtmlHeader -> self.htmlBodyPrepared
This step is fast since bs4 just parses the (small) html and the images are injected as base64 encoded strings after the encoding was detected. Now the HTML is way larger than before.

validateHTML parses the HTML again (now the large one)

If the validation fails the HTML is parsed a third time!

After validation (or correction of the HTML) getSaveHtml parses the whole HTML a third (or fourth) time if you are in "prepared" mode.

Possible Solution (?)
A possible solution could be to add a self.original_encoding=None to MessageBase init and extend all calls to bs4 with from_encoding=self.original_encoding (as well as validateHTML).

The self.original_encoding could be set to the encoding that is detected in the the first call to bs4 when self.htmlBodyPrepared is called

That way the detection

only runs once
on the small HTML

TheElementalOfDestruction · 2025-03-13T14:52:55Z

Sorry it took so long to get back to you, I'm working on this right now. I'm just looking for a way to fetch the encoding from the BeautifulSoup parser

TheElementalOfDestruction · 2025-03-13T15:22:44Z

I've pushed some code to next-release that should fix the issue. Please let me know if this fixes the performance issues for you and if it does I'll go ahead and release the next version

TheElementalOfDestruction added a commit that referenced this issue Mar 13, 2025

Fix a few bugs and started working on #452

f1752de

TheElementalOfDestruction added a commit that referenced this issue Mar 13, 2025

Add fix for #452

42991e5

TheElementalOfDestruction mentioned this issue Mar 14, 2025

Version 0.53.2 #455

Merged

This was referenced Mar 17, 2025

Bump extract-msg from 0.52.0 to 0.53.2 ropable/prs#214

Closed

Bump extract-msg from 0.52.0 to 0.53.2 dbca-wa/prs#457

Closed

dependabot bot mentioned this issue Mar 24, 2025

Bump extract-msg from 0.52.0 to 0.54.0 ropable/prs#216

Closed

dmarkow mentioned this issue Mar 24, 2025

Very slow conversion to HTML of some files #459

Closed

4 tasks

dependabot bot mentioned this issue Mar 24, 2025

Bump extract-msg from 0.52.0 to 0.54.0 dbca-wa/prs#458

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance issue with preparedHTML due to bs4 encoding detection #452

Performance issue with preparedHTML due to bs4 encoding detection #452

digidigital commented Feb 16, 2025 •

edited

Loading

TheElementalOfDestruction commented Mar 13, 2025

TheElementalOfDestruction commented Mar 13, 2025

Performance issue with preparedHTML due to bs4 encoding detection #452

Performance issue with preparedHTML due to bs4 encoding detection #452

Comments

digidigital commented Feb 16, 2025 • edited Loading

TheElementalOfDestruction commented Mar 13, 2025

TheElementalOfDestruction commented Mar 13, 2025

digidigital commented Feb 16, 2025 •

edited

Loading