-
-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue with preparedHTML due to bs4 encoding detection #452
Comments
TheElementalOfDestruction
added a commit
that referenced
this issue
Mar 13, 2025
Sorry it took so long to get back to you, I'm working on this right now. I'm just looking for a way to fetch the encoding from the BeautifulSoup parser |
TheElementalOfDestruction
added a commit
that referenced
this issue
Mar 13, 2025
I've pushed some code to next-release that should fix the issue. Please let me know if this fixes the performance issues for you and if it does I'll go ahead and release the next version |
Merged
This was referenced Mar 17, 2025
4 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Bug Metadata
Describe the bug
If you convert msg files that have a lot of big images embedded within the html body the performance of extract-msg degrades if you use the prepared HTML option .
I was able to track this down (i guess) to parsing the html multiple times with bs4 after injecting the images.
Since bs4 lacks the information of the HTML charset it tries to figure it out each time you call bs4.
With the injected images the file get's really large and it seems the character detection is a byte-by-byte process that is taking minutes.
msg.getSaveHtmlBody(prepared=True)
It seems the first time bs4 is called is in injectHtmlHeader -> self.htmlBodyPrepared
This step is fast since bs4 just parses the (small) html and the images are injected as base64 encoded strings after the encoding was detected. Now the HTML is way larger than before.
validateHTML parses the HTML again (now the large one)
After validation (or correction of the HTML) getSaveHtml parses the whole HTML a third (or fourth) time if you are in "prepared" mode.
Possible Solution (?)
A possible solution could be to add a self.original_encoding=None to MessageBase init and extend all calls to bs4 with from_encoding=self.original_encoding (as well as validateHTML).
The self.original_encoding could be set to the encoding that is detected in the the first call to bs4 when self.htmlBodyPrepared is called
That way the detection
The text was updated successfully, but these errors were encountered: