mb_detect_encoding is very slow #90

schlessera · 2021-03-04T18:14:53Z

Profiling the current test suite showed that the use of mb_detect_encoding() made up more than 40% of the execution time of the entire test suite.

We should:

ensure we only use it if really necessary to decrease the amount of times it runs
document that it is slow and that documents should therefore use a proper charset meta tag to begin with
make sure a charset is set in integration like the WP plugin to avoid its use altogether in those cases

The text was updated successfully, but these errors were encountered:

schlessera · 2021-10-19T15:14:55Z

It seems like this will become even more of a problem with PHP 8.1+, as the detection mechanism has changed to be more precise.

Re:

ensure we only use it if really necessary to decrease the amount of times it runs

@ediamin Can you please document the exact scenarios here where mb_detect_encoding is being used, and then see whether its usage can be reduced in some way?

ediamin · 2021-11-02T10:16:23Z

Currently we are using mb_detect_encoding only in DocumentEncoding document filter. It'll use to detect the encoding of the document if:

There is no charset meta tag present in head and we do not set any charset when we use the Document class. For example,

$source = '<!DOCTYPE html><html><head></head><body>hello world</body></html>';
$charset = '';
$document = Document::fromHtml($source, $charset);

We set auto as the charset either in source or Document param,

$source = '<!DOCTYPE html><html><head></head><body>hello world</body></html>';
$charset = 'auto';
// or
$source = '<!DOCTYPE html><html><head><meta charset="auto"></head><body>hello world</body></html>';
$charset = '';

Additionally if we do not set charset as utf-8 which is required by the AMP, the source will convert to utf-8 from the provided encoding. So for the optimized performance, we should always use the UTF-8 charset in the source or Document param, or at least provide a valid charset other than the auto.

schlessera · 2021-11-02T14:21:41Z

This is something we should keep in mind and actually flag as an issue in the PXE analysis.

schlessera added Performance DOM labels Mar 4, 2021

schlessera added the Good First Issue Good for newcomers label Apr 28, 2021

06romix mentioned this issue May 6, 2021

Add comment about mb_detect_encoding performance #186

Merged

schlessera assigned schlessera and ediamin and unassigned schlessera Oct 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mb_detect_encoding is very slow #90

mb_detect_encoding is very slow #90

schlessera commented Mar 4, 2021 •

edited

Loading

schlessera commented Oct 19, 2021

ediamin commented Nov 2, 2021

schlessera commented Nov 2, 2021

mb_detect_encoding is very slow #90

mb_detect_encoding is very slow #90

Comments

schlessera commented Mar 4, 2021 • edited Loading

schlessera commented Oct 19, 2021

ediamin commented Nov 2, 2021

schlessera commented Nov 2, 2021

schlessera commented Mar 4, 2021 •

edited

Loading