-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove dependence on mbstring extension #221
base: main
Are you sure you want to change the base?
Conversation
The classname prefixes are always ASCII, so mb_ functions are not needed
It's unclear what problems the function was originallymeant to solve; it is equally unclear the problems still exist
Note this would address issue #136. |
To address 3: yes, I believe this is still a problem even in the latest PHP stable. This is because it is not a PHP problem but a libxml problem. By today's standard, using libxml for HTML parsing really is not cutting it. But it is the only somewhat-globally available parser PHP has ever had. There is a little more information about it here on StackOverflow which also mentions the solution php-mf2 is using. (This is just one of many things you need to think about when doing HTML parsing in PHP.) I feel like we really need to add tests where we parse non-UTF-8 HTML input, to see whether the removal of the |
Thanks for clarifying. Sounds to me the answer is to handle character encoding properly, detecting encoding either from the HTTP header or pre-scan, then either prepending an XML declaration or modifying any existing one. I did write an implementation of the HTML charset pre-scan algorithm last year for my own HTML parser. It shouldn't be that hard to backport to PHP 5.6, but it would add a lot of code (about 500 lines, with comments). |
That might work. This also may already be handled if you use the userland HTML5 parser we recommend. I just think we do not have any tests for it. At some point we will have to do the value calculation as well whether it is worth putting character encoding checking in our code (and having to maintain that) or if it is better to make something like the Masterminds HTML5 parser mandatory as a dependency. I'd love to hear the input from the other maintainers here. But my instinct is to err on the side of "if it ain't broke, don't fix it". It would be really nice if we could get rid of Lets keep this PR open to keep discussing what our alternatives are though! |
In an effort to better understand the problem, I ran
Given all that (especially the third point) I am forced to the agree that the patch under discussion would be insufficient. However, the current arrangement isn't great, either. It has its own set of problems:
I suggest, therefore, that in the interim commit 6fb321e essentially be reverted, and that a detection order (only UTF-8?) be specified for consistency, as well as using strict mode. |
I've now taken a different tack to fixing this, looking for |
Any comment on this, @Zegnat? |
I think I am still with my previous comment and would like someone else to weight in besides me:
I still do not know why we were forcing everything through a Having a hard time evaluating the effects of this PR. As far as the implementation itself, I think it is some really nice work! The addition of scanning the Those different inputs are also mentioned by the HTML encoding sniffing algorithm. And this is also where I feel we might be heading out on thin ice. I would not want to have to maintain encoding sniffing in the mf2 parser library. Or if it does end up in here, it should have very strict unit tests on its own. (Maybe there is an existing implementation we can pull in instead?) I am of many minds when it comes to this PR ... |
The intent was obviously to bypass PHP DOM's character encoding detection, which does not support
Correct. If <
In fact if the input is not UTF-8 the current way is almost certainly breaking things, as
Thanks! Yes, checking the HTTP header is definitely a good idea, but it's not trivial to actually do correctly. I wrote an implementation of a header parser myself, but using it would bump php-mf2's platform requirement to PHP 7.1.
There is! I wrote one! Again, though, it requires PHP 7.1, and it's a full parser, which is pretty slow. You can use the character detection functions manually, though. The encoding detection portion has 202 tests with full coverage (the parser as whole has over 18,000 tests). I wrote this patch long before my parser was finished, and I know that you have good reasons to continue supporting ancient PHP, otherwise I probably wouldn't have gone to the trouble of finding a good-enough alternate solution. :) |
Hi @JKingweb, I finally took the time to look at this PR in more detail, getting slightly distracted by checking out your HTML5 parser along the way — looks like very nice work! Some background about the heavy-handed usage of mbstring: php-mf2 was my first ever open source project, started in my mid teenage years. From what I recall, converting the entire document to ASCII HTML entities was the best solution to DOMDocument’s poor UTF-8 support which I could find at the time — it didn’t occur to me to try adding a UTF-8 BOM (a solution with which I’ve regrettably become very familiar with since then, due to having to create unicode CSV files which work with SPSS on windows — but that’s another story). It worked well enough that I never thought to look for new solutions. I must admit that I’m still a little confused about the state of DOMDocument’s character encoding detection support and the apparent need to implement our own meta charset decetion, given that it does in fact appear to support
This seems to contradict your assertion that “PHP DOM's character encoding detection, [•••] does not support ” — please let me know if I’m missing something obvious here! Assuming I’m not missing anything, my conclusion would be the following:
The first case is easy enough to solve internally, as we can simply add the UTF-8 BOM to any HTML fragments used in testing. The others are harder. I have a few ideas about potential solutions, but would like to hear back from you and the other maintainers before trying to come up with anything in more detail. |
I seem to have missed your reply, @barnabywalters, which is quite unfortunate. Apologies about that.
I too am confused. I had run a battery of tests and had confirmed the situation at the time, but maybe it was simply a matter of my system being misconfigured, as I cannot reproduce it now, either. I had been running PHP in Windows, which is a somewhat unusual arrangement to begin with, and maybe I still had an ancient libxml hanging around which PHP decided to use rather than its own newer version. In any case, it seems not to be a problem after all.
The HTTP header case is relatively simple: add a The last case is trickier, though probably vanishingly rare. Nevertheless, one could do the equivalent of |
I ran some fresh tests on what is recognized by libxml in PHP versions across time as an encoding declaration, using this test program: <?php
$uni = "\xC3\xA9"; // U+00E9 LATIN SMALL LETTER E WITH ACUTE in UTF-8
$iso = "\xE9"; // U+00E9 LATIN SMALL LETTER E WITH ACUTE in ISO 8859-1
$bom = "\xEF\xBB\xBF"; // UTF-8 byte order mark
$mojibake = "\xC3\x83\xC2\xA9"; // U+00E9 LATIN SMALL LETTER E WITH ACUTE in UTF-8 decoded as ISO 8859-1 and re-encoded into UTF-8
function mkUtf16be($s) {
return preg_replace("/[\x{01}-\x{7F}]/s", "\x00$0", $s);
};
function mkUtf16le($s) {
return preg_replace("/[\x{01}-\x{7F}]/s", "$0\x00", $s);
};
// sample documents with various forms of encoding detection
$input = array(
array($iso, $uni),
array($uni, $mojibake),
array($bom.$uni, $uni),
array("<meta charset='utf-8'>$uni", $uni),
array("<meta charset='iso8859-1'>$iso", $uni),
array("<meta http-equiv='content-type' content='text/html;charset=utf-8'>$uni", $uni),
array("<meta http-equiv='content-type' content='text/html;charset=iso8859-1'>$iso", $uni),
array("<meta http-equiv='content-type' content='text/html;charset=utf-8'><meta charset='iso8859-1'>$uni", $uni),
array("<meta http-equiv='content-type' content='text/html;charset=iso8859-1'><meta charset='utf-8'>$iso", $uni),
array("<?xml version='1.0' encoding='UTF-8'?>$iso", $uni),
array("<?xml version='1.0' encoding='iso8859-1'?>$iso", $uni),
array("<?xml version='1.0' encoding='iso8859-1'?><meta charset='iso8859-1'>$iso", $uni),
array("<?xml version='1.0' encoding='iso8859-1'?><meta http-equiv='content-type' content='text/html;charset=iso8859-1'>$iso", $uni),
array("$bom<meta http-equiv='content-type' content='text/html;charset=iso8859-1'>$uni", $mojibake),
array(mkUtf16be("<meta charset='utf-16be'>\x00$iso"), ""),
array(mkUtf16le("<meta charset='utf-16le'>$iso\x00"), ""),
array(mkUtf16be("<meta charset='utf-16'>$iso\x00"), ""),
array(mkUtf16be("\xFE\xFF<meta charset='utf-16be'>\x00$iso"), $uni),
array(mkUtf16le("\xFF\xFE<meta charset='utf-16le'>$iso\x00"), $uni),
array(mkUtf16be("\xFE\xFF\x00$iso"), $uni),
array(mkUtf16le("\xFF\xFE$iso\x00"), $uni),
array(mkUtf16be("\xFE\xFF<meta charset='utf-8'>\x00$iso"), $uni),
array(mkUtf16le("\xFF\xFE<meta charset='utf-8'>$iso\x00"), $uni),
);
echo "PHP ".PHP_VERSION." libxml ".LIBXML_DOTTED_VERSION."\n";
foreach ($input as $k => $test) {
list($in, $exp) = $test;
$k = str_pad((string) ++$k, strlen((string) sizeof($input)), " ", STR_PAD_LEFT);
$d = new DOMDocument;
@$d->loadHTML($in);
if (($exp === "" && !$d->documentElement) || $d->documentElement->textContent === $exp) {
echo "$k. PASS\n";
} elseif ($d->documentElement) {
echo "$k. FAIL ".bin2hex((string) $d->documentElement->textContent)."\n";
} else {
echo "$k. FAIL\n";
}
} I used Windows versions of PHP as these are known to be bundled with contemporaneous versions of libxml. My findings were thus:
|
This is the result of extension testing against PHP 5.6 through PHP 8.2 in Linux and Windows to determine how encoding detection actually works in DOMDocument
As UTF-8 is now heuristically detected, no explicit encoding declaration is required
Neither is likely to appear in a real HTML document; x-user-defined is not actually supported properly, but it's unlikely to be used for HTML
And now, a third attempt. The patch as it now stands is much more complex, but it has the following advantages:
Though it is more than 500 added lines, much of that is either static data or detailed comments. All existing tests pass without modification, and if there is interest in merging this patch I will of course write as many tests as needed to cover the new logic. |
All new lines are covered, and an effort was made to cover as many permutations of PCRE matches as practical
This patch removes reliance on the mbstring extension in three places:
unicodeTrim
function used mbstring to generate a UTF-8 encoding of a U+00A0 NO-BREAK SPACE character. This has been replaced by a sequence of byte escapes instead. The trimming has also been corrected not to substitute no-break spaces in the middle of the stringnestedMfPropertyNamesFromClass
function used mbstring to remove a prefix from the start of a string. All prefixes to remove are ASCII, however, making the use of mbstring unnecessary. The calls to mbstring functions have simply been replaced with their standard PHP equivalentsunicodeToHtmlEntities
function uses mbstring to work around decoding problems in theDOMDocument
class, according to commit d1c70ad. The commit in question does not elaborate on what these decoding problems are, nor is this functionality tested in the test suite. It is unclear if these problems even exist in modern PHP versions. I have elected to make the conversion optional, using mbstring if available.