Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dependence on mbstring extension #221

Open
wants to merge 12 commits into
base: main
Choose a base branch
from

Conversation

JKingweb
Copy link
Contributor

This patch removes reliance on the mbstring extension in three places:

  1. The unicodeTrim function used mbstring to generate a UTF-8 encoding of a U+00A0 NO-BREAK SPACE character. This has been replaced by a sequence of byte escapes instead. The trimming has also been corrected not to substitute no-break spaces in the middle of the string
  2. The nestedMfPropertyNamesFromClass function used mbstring to remove a prefix from the start of a string. All prefixes to remove are ASCII, however, making the use of mbstring unnecessary. The calls to mbstring functions have simply been replaced with their standard PHP equivalents
  3. The unicodeToHtmlEntities function uses mbstring to work around decoding problems in the DOMDocument class, according to commit d1c70ad. The commit in question does not elaborate on what these decoding problems are, nor is this functionality tested in the test suite. It is unclear if these problems even exist in modern PHP versions. I have elected to make the conversion optional, using mbstring if available.

The classname prefixes are always ASCII, so mb_ functions are not needed
It's unclear what problems the function was originallymeant to solve;
it is equally unclear the problems still exist
@JKingweb
Copy link
Contributor Author

Note this would address issue #136.

@Zegnat
Copy link
Member

Zegnat commented Sep 12, 2020

To address 3: yes, I believe this is still a problem even in the latest PHP stable. This is because it is not a PHP problem but a libxml problem. By today's standard, using libxml for HTML parsing really is not cutting it. But it is the only somewhat-globally available parser PHP has ever had.

There is a little more information about it here on StackOverflow which also mentions the solution php-mf2 is using.

(This is just one of many things you need to think about when doing HTML parsing in PHP.)

I feel like we really need to add tests where we parse non-UTF-8 HTML input, to see whether the removal of the mb_ functions breaks anything.

@JKingweb
Copy link
Contributor Author

Thanks for clarifying. Sounds to me the answer is to handle character encoding properly, detecting encoding either from the HTTP header or pre-scan, then either prepending an XML declaration or modifying any existing one.

I did write an implementation of the HTML charset pre-scan algorithm last year for my own HTML parser. It shouldn't be that hard to backport to PHP 5.6, but it would add a lot of code (about 500 lines, with comments).

@Zegnat
Copy link
Member

Zegnat commented Sep 13, 2020

That might work. This also may already be handled if you use the userland HTML5 parser we recommend. I just think we do not have any tests for it.

At some point we will have to do the value calculation as well whether it is worth putting character encoding checking in our code (and having to maintain that) or if it is better to make something like the Masterminds HTML5 parser mandatory as a dependency.

I'd love to hear the input from the other maintainers here. But my instinct is to err on the side of "if it ain't broke, don't fix it". It would be really nice if we could get rid of mb_ function dependencies, but without the full test coverage of why they are in place, I am a little scared of merging something like this PR as it stands 😓

Lets keep this PR open to keep discussing what our alternatives are though!

@JKingweb
Copy link
Contributor Author

In an effort to better understand the problem, I ran DOMDocument ::loadHTML() through html5lib's encoding-detection test suite (comprising 82 tests), inferring the encoding based on reading out some data I inserted into the test input. It failed 76 of the 82 tests, though because of how the tests are written the failure rate is perhaps higher than it could be. In any case, one can assert the following:

  1. It does not attempt heuristic detection before using its fallback encoding in the absence of a charset declaration (which is not, it should be noted, a requirement)
  2. Its fallback encoding is indeed ISO 8859-1, not Windows-1252 as it should be
  3. It does not support the modern <meta charset="..."> charset declaration (this is the source of most failures)
  4. Parsing fails in the presence of malformed characters rather than generating replacement characters

Given all that (especially the third point) I am forced to the agree that the patch under discussion would be insufficient. However, the current arrangement isn't great, either. It has its own set of problems:

  1. At least on my system, mb_detect_order() returns ["ASCII", "UTF-8"], so while it will detect UTF-8, it won't detect anything else, returning false for most encodings (or for some, "UTF-8" incorrectly).
  2. Passing false as the third parameter of mb_convert_encoding() issues a warning, so if mb_detect_encoding() does fail, mf2 will currently issue said warning
  3. If the document is not in fact UTF-8, non-ASCII characters will be dropped. For example, feeding it the byte A5 (yen sign in windows-1252 encoding) yields the nonsense character reference &#2013266085;, which DOMDocument ignores after issuing a warning (which mf2 suppresses)
  4. Since mb_convert_encoding() is called unconditionally, mf2 may be clobbering perfectly good documents with charset declarations DOMDocument actually understands

I suggest, therefore, that in the interim commit 6fb321e essentially be reverted, and that a detection order (only UTF-8?) be specified for consistency, as well as using strict mode.

@JKingweb
Copy link
Contributor Author

JKingweb commented Apr 7, 2021

I've now taken a different tack to fixing this, looking for <meta charset="..."> after parsing the document, and re-parsing with a byte order mark added if necessary. It necessitated altering tests which assumed heuristic detection, but it seems to work well otherwise. I can add more rigorous tests for the charset detection itself if this is deemed a suitable solution in general.

@JKingweb
Copy link
Contributor Author

Any comment on this, @Zegnat?

@Zegnat
Copy link
Member

Zegnat commented Nov 7, 2021

I think I am still with my previous comment and would like someone else to weight in besides me:

I'd love to hear the input from the other maintainers here. But my instinct is to err on the side of "if it ain't broke, don't fix it". It would be really nice if we could get rid of mb_ function dependencies, but without the full test coverage of why they are in place, I am a little scared of merging something like this PR as it stands 😓

I still do not know why we were forcing everything through a mb_convert_encoding in the first place. But if I am reading your PR correctly, that is dropped in favour of either adding a BOM or leaving the input completely untouched? Is there anything we are breaking by leaving the input untouched. e.g. when the input is coming in some non-latin encoding? Something that was previously fixed by running it through the conversion function?

Having a hard time evaluating the effects of this PR.

As far as the implementation itself, I think it is some really nice work! The addition of scanning the <meta> tag is close to how browsers would handle it. I wonder if it is also worth keeping watch for an encoding in Content-Type HTTP headers, if the user is letting the mf2 parser do the fetching.

Those different inputs are also mentioned by the HTML encoding sniffing algorithm. And this is also where I feel we might be heading out on thin ice. I would not want to have to maintain encoding sniffing in the mf2 parser library. Or if it does end up in here, it should have very strict unit tests on its own. (Maybe there is an existing implementation we can pull in instead?)

I am of many minds when it comes to this PR ...

@JKingweb
Copy link
Contributor Author

I still do not know why we were forcing everything through a mb_convert_encoding in the first place.

The intent was obviously to bypass PHP DOM's character encoding detection, which does not support <meta charset="..."> as is commonly used on modern HTML documents (which would garble UTF-8 documents, which are now overwhelmingly common). It did this by converting all non-ASCII characters to HTML character references based on the heuristically-detected encoding according to mb_detect_encoding. This left only ASCII characters, so whatever encoding PHP DOM does or does not detect is irrelevant.

But if I am reading your PR correctly, that is dropped in favour of either adding a BOM or leaving the input completely untouched?

Correct. If <meta charset="..."> is found in the document and the charset is some valid label for UTF-8, a UTF-8 BOM is added, which PHP DOM does recognize, and the document is re-parsed correctly. If no such meta-element is found the document is left as-is with the assumption that it has an old-style encoding declaration and is fine.

Is there anything we are breaking by leaving the input untouched. e.g. when the input is coming in some non-latin encoding? Something that was previously fixed by running it through the conversion function?

In fact if the input is not UTF-8 the current way is almost certainly breaking things, as mb_detect_encoding by default only answers with either UTF-8 or iso-8859-1; documents in other encodings (like Shift_JIS which is still somewhat common in Japan) will parse out with lots of garbage. As-is this PR would break UTF-8 documents with no encoding declaraction (or BOM) whatsoever.

As far as the implementation itself, I think it is some really nice work! The addition of scanning the <meta> tag is close to how browsers would handle it. I wonder if it is also worth keeping watch for an encoding in Content-Type HTTP headers, if the user is letting the mf2 parser do the fetching.

Thanks! Yes, checking the HTTP header is definitely a good idea, but it's not trivial to actually do correctly. I wrote an implementation of a header parser myself, but using it would bump php-mf2's platform requirement to PHP 7.1.

Those different inputs are also mentioned by the HTML encoding sniffing algorithm. And this is also where I feel we might be heading out on thin ice. I would not want to have to maintain encoding sniffing in the mf2 parser library. Or if it does end up in here, it should have very strict unit tests on its own. (Maybe there is an existing implementation we can pull in instead?)

There is! I wrote one! Again, though, it requires PHP 7.1, and it's a full parser, which is pretty slow. You can use the character detection functions manually, though. The encoding detection portion has 202 tests with full coverage (the parser as whole has over 18,000 tests).

I wrote this patch long before my parser was finished, and I know that you have good reasons to continue supporting ancient PHP, otherwise I probably wouldn't have gone to the trouble of finding a good-enough alternate solution. :)

@barnabywalters
Copy link
Collaborator

barnabywalters commented Feb 25, 2022

Hi @JKingweb, I finally took the time to look at this PR in more detail, getting slightly distracted by checking out your HTML5 parser along the way — looks like very nice work!

Some background about the heavy-handed usage of mbstring: php-mf2 was my first ever open source project, started in my mid teenage years. From what I recall, converting the entire document to ASCII HTML entities was the best solution to DOMDocument’s poor UTF-8 support which I could find at the time — it didn’t occur to me to try adding a UTF-8 BOM (a solution with which I’ve regrettably become very familiar with since then, due to having to create unicode CSV files which work with SPSS on windows — but that’s another story). It worked well enough that I never thought to look for new solutions.

I must admit that I’m still a little confused about the state of DOMDocument’s character encoding detection support and the apparent need to implement our own meta charset decetion, given that it does in fact appear to support <meta charset="utf-8">, and will even complain about invalid charsets:

This seems to contradict your assertion that “PHP DOM's character encoding detection, [•••] does not support ” — please let me know if I’m missing something obvious here!

Assuming I’m not missing anything, my conclusion would be the following:

  • PHP DOMDocument handles documents containing a UTF-8 BOM and/or <meta charset> just fine by itself.
  • UTF-8 difficulties arise in any of the many cases where a string is encoded as UTF-8 but has neither BOM nor <meta charset>, such as:
    • Document fragments parsed for testing
    • Documents where the content encoding is only communicated via Content-type: charset= headers, which are currently either unavailable to php-mf2, or ignored.
    • Documents which lack any indication of encoding and are assumed to be UTF-8

The first case is easy enough to solve internally, as we can simply add the UTF-8 BOM to any HTML fragments used in testing. The others are harder. I have a few ideas about potential solutions, but would like to hear back from you and the other maintainers before trying to come up with anything in more detail.

@JKingweb
Copy link
Contributor Author

JKingweb commented Apr 4, 2023

I seem to have missed your reply, @barnabywalters, which is quite unfortunate. Apologies about that.

I must admit that I’m still a little confused about the state of DOMDocument’s character encoding detection support and the apparent need to implement our own meta charset decetion, given that it does in fact appear to support , and will even complain about invalid charsets:

I too am confused. I had run a battery of tests and had confirmed the situation at the time, but maybe it was simply a matter of my system being misconfigured, as I cannot reproduce it now, either. I had been running PHP in Windows, which is a somewhat unusual arrangement to begin with, and maybe I still had an ancient libxml hanging around which PHP decided to use rather than its own newer version. In any case, it seems not to be a problem after all.

The first case is easy enough to solve internally, as we can simply add the UTF-8 BOM to any HTML fragments used in testing. The others are harder. I have a few ideas about potential solutions, but would like to hear back from you and the other maintainers before trying to come up with anything in more detail.

The HTTP header case is relatively simple: add a <meta charset=whatever> to the top of the document. The only complication is that this should be done after any DOCTYPE in case the libxml parser has any quirks-mode handling that might be relevant. Of course you have to also parse the Content-Type header-field, and the specification is somewhat exacting with this, but as we're only interesting in charset parameters and any valid value there is ASCII, there are a few shortcuts which can be taken.

The last case is trickier, though probably vanishingly rare. Nevertheless, one could do the equivalent of mb_detect_encoding for UTF-8 by looking for any non-ASCII bytes which don't match the UTF-8 byte pattern using preg_match, and if the number of matches is low, we can conclude the document is UTF-8.

@JKingweb
Copy link
Contributor Author

JKingweb commented Apr 5, 2023

I ran some fresh tests on what is recognized by libxml in PHP versions across time as an encoding declaration, using this test program:

<?php
$uni = "\xC3\xA9";              // U+00E9 LATIN SMALL LETTER E WITH ACUTE in UTF-8
$iso = "\xE9";                  // U+00E9 LATIN SMALL LETTER E WITH ACUTE in ISO 8859-1
$bom = "\xEF\xBB\xBF";          // UTF-8 byte order mark
$mojibake = "\xC3\x83\xC2\xA9"; // U+00E9 LATIN SMALL LETTER E WITH ACUTE in UTF-8 decoded as ISO 8859-1 and re-encoded into UTF-8

function mkUtf16be($s) {
    return preg_replace("/[\x{01}-\x{7F}]/s", "\x00$0", $s);
};
function mkUtf16le($s) {
    return preg_replace("/[\x{01}-\x{7F}]/s", "$0\x00", $s);
};

// sample documents with various forms of encoding detection
$input = array(
    array($iso, $uni),
    array($uni, $mojibake),
    array($bom.$uni, $uni),
    array("<meta charset='utf-8'>$uni", $uni),
    array("<meta charset='iso8859-1'>$iso", $uni),
    array("<meta http-equiv='content-type' content='text/html;charset=utf-8'>$uni", $uni),
    array("<meta http-equiv='content-type' content='text/html;charset=iso8859-1'>$iso", $uni),
    array("<meta http-equiv='content-type' content='text/html;charset=utf-8'><meta charset='iso8859-1'>$uni", $uni),
    array("<meta http-equiv='content-type' content='text/html;charset=iso8859-1'><meta charset='utf-8'>$iso", $uni),
    array("<?xml version='1.0' encoding='UTF-8'?>$iso", $uni),
    array("<?xml version='1.0' encoding='iso8859-1'?>$iso", $uni),
    array("<?xml version='1.0' encoding='iso8859-1'?><meta charset='iso8859-1'>$iso", $uni),
    array("<?xml version='1.0' encoding='iso8859-1'?><meta http-equiv='content-type' content='text/html;charset=iso8859-1'>$iso", $uni),
    array("$bom<meta http-equiv='content-type' content='text/html;charset=iso8859-1'>$uni", $mojibake),
    array(mkUtf16be("<meta charset='utf-16be'>\x00$iso"), ""),
    array(mkUtf16le("<meta charset='utf-16le'>$iso\x00"), ""),
    array(mkUtf16be("<meta charset='utf-16'>$iso\x00"), ""),
    array(mkUtf16be("\xFE\xFF<meta charset='utf-16be'>\x00$iso"), $uni),
    array(mkUtf16le("\xFF\xFE<meta charset='utf-16le'>$iso\x00"), $uni),
    array(mkUtf16be("\xFE\xFF\x00$iso"), $uni),
    array(mkUtf16le("\xFF\xFE$iso\x00"), $uni),
    array(mkUtf16be("\xFE\xFF<meta charset='utf-8'>\x00$iso"), $uni),
    array(mkUtf16le("\xFF\xFE<meta charset='utf-8'>$iso\x00"), $uni),
);
echo "PHP ".PHP_VERSION." libxml ".LIBXML_DOTTED_VERSION."\n";
foreach ($input as $k => $test) {
    list($in, $exp) = $test;
    $k = str_pad((string) ++$k, strlen((string) sizeof($input)), " ", STR_PAD_LEFT);
    $d = new DOMDocument;
    @$d->loadHTML($in);
    if (($exp === "" && !$d->documentElement) || $d->documentElement->textContent === $exp) {
        echo "$k. PASS\n";
    } elseif ($d->documentElement) {
        echo "$k. FAIL ".bin2hex((string) $d->documentElement->textContent)."\n";
    } else {
        echo "$k. FAIL\n";
    }
}

I used Windows versions of PHP as these are known to be bundled with contemporaneous versions of libxml. My findings were thus:

  • <meta charset="..."> became supported with libxml 2.8.0; the first version of PHP where this works is PHP 5.5, bundled with libxml 2.9.1
  • <?xml version="1.0" encoding="..."?> erroneously signals UTF-8 encoding prior to libxml 2.9.11, regardless of which encoding is actually listed in the declaration. This was obviously a bug. PHP 8.1 was the first version to ship with the fix as it bundled libxml 2.9.12
  • A UTF-8 BOM will be ignored if a valid meta element is in the document
  • A buggy XML declaration will also be ignored if a valid meta element is present
  • New-style meta and old-style meta have equal weight
  • Handling of UTF-16 BOMs is buggy prior to PHP 8.0, evidently due to a PHP bug rather than a libxml bug
  • The only completely reliable way to signal encoding is an old-style meta tag, but a new-style one is sufficient for the new target of PHP 5.6

This is the result of extension testing against PHP 5.6 through
PHP 8.2 in Linux and Windows to determine how encoding detection
actually works in DOMDocument
As UTF-8 is now heuristically detected, no explicit encoding declaration
is required
Neither is likely to appear in a real HTML document; x-user-defined is
not actually supported properly, but it's unlikely to be used for HTML
@JKingweb
Copy link
Contributor Author

JKingweb commented Apr 8, 2023

And now, a third attempt. The patch as it now stands is much more complex, but it has the following advantages:

  1. Encoding from Content-Type can be wired up easily (modifying the API to do so should probably be a separate patch, however)
  2. Encoding detection is otherwise done as a browser would do
  3. If no encoding information is included at all, UTF-8 encoding is still inferred from the byte pattern of the input string, just as mb_detect_encoding would do
  4. We avoid the situation where mb_detect_encoding detects something other than UTF-8 or ISO 8859-1 as ISO 8859-1 thereby corrupting the text
  5. The current approach is based on extensive testing of what PHP 5.6 through 8.2 does on both Linux and Windows: in addition to the test script in my comment of three days ago, another test script was used to figure out which encoding labels DOMDocument understands, rather than just blindly using the standard names
  6. The logic for parsing Content-Type headers is a slight modification of existing code which had been rigurously tested

Though it is more than 500 added lines, much of that is either static data or detailed comments. All existing tests pass without modification, and if there is interest in merging this patch I will of course write as many tests as needed to cover the new logic.

All new lines are covered, and an effort was made to cover as many
permutations of PCRE matches as practical
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants