Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mb_detect_encoding is very slow #90

Open
1 of 3 tasks
schlessera opened this issue Mar 4, 2021 · 3 comments
Open
1 of 3 tasks

mb_detect_encoding is very slow #90

schlessera opened this issue Mar 4, 2021 · 3 comments
Assignees
Labels

Comments

@schlessera
Copy link
Collaborator

schlessera commented Mar 4, 2021

Profiling the current test suite showed that the use of mb_detect_encoding() made up more than 40% of the execution time of the entire test suite.

We should:

  • ensure we only use it if really necessary to decrease the amount of times it runs
  • document that it is slow and that documents should therefore use a proper charset meta tag to begin with
  • make sure a charset is set in integration like the WP plugin to avoid its use altogether in those cases
@schlessera
Copy link
Collaborator Author

It seems like this will become even more of a problem with PHP 8.1+, as the detection mechanism has changed to be more precise.

Re:

ensure we only use it if really necessary to decrease the amount of times it runs

@ediamin Can you please document the exact scenarios here where mb_detect_encoding is being used, and then see whether its usage can be reduced in some way?

@schlessera schlessera assigned schlessera and ediamin and unassigned schlessera Oct 19, 2021
@ediamin
Copy link
Collaborator

ediamin commented Nov 2, 2021

Currently we are using mb_detect_encoding only in DocumentEncoding document filter. It'll use to detect the encoding of the document if:

  1. There is no charset meta tag present in head and we do not set any charset when we use the Document class. For example,
$source = '<!DOCTYPE html><html><head></head><body>hello world</body></html>';
$charset = '';
$document = Document::fromHtml($source, $charset);
  1. We set auto as the charset either in source or Document param,
$source = '<!DOCTYPE html><html><head></head><body>hello world</body></html>';
$charset = 'auto';
// or
$source = '<!DOCTYPE html><html><head><meta charset="auto"></head><body>hello world</body></html>';
$charset = '';

Additionally if we do not set charset as utf-8 which is required by the AMP, the source will convert to utf-8 from the provided encoding. So for the optimized performance, we should always use the UTF-8 charset in the source or Document param, or at least provide a valid charset other than the auto.

@schlessera
Copy link
Collaborator Author

This is something we should keep in mind and actually flag as an issue in the PXE analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants