Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conditionally remove html[xmlns], convert html[xml:lang] to html[lang] #4194

Merged
merged 6 commits into from
Jan 30, 2020
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions src/Dom/Document.php
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,8 @@ public function loadHTML( $source, $options = 0 ) {
libxml_use_internal_errors( $libxml_previous_state );

if ( $success ) {
$this->normalize_html_attributes();

// Remove http-equiv charset again.
$meta = $this->head->firstChild;
if (
Expand Down Expand Up @@ -1021,6 +1023,34 @@ private function restore_doctype_node( $html ) {
return preg_replace( self::HTML_RESTORE_DOCTYPE_PATTERN, '\1!\3\4>', $html, 1 );
}

/**
* Normalizes HTML attributes to be HTML5 compatible.
*
* Conditionally removes html[xmlns], and converts html[xml:lang] to html[lang].
*/
private function normalize_html_attributes() {
$html = $this->getElementsByTagName( 'html' )->item( 0 );
westonruter marked this conversation as resolved.
Show resolved Hide resolved
if ( ! $html->hasAttributes() ) {
return;
}

$xmlns = $html->attributes->getNamedItem( 'xmlns' );
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'd probably be better to do something like $xmlns = $html->getAttribute( 'xmlns' ).

But for some reason, $html->getAttribute( 'xmlns' ) returned "", even when $html->attributes->getNamedItem( 'xmlns' ) returned a DOMAttr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure. I can see that $html->attributes->getNamedItem( 'xmlns' ) works across all PHP versions. I don't have a better option.

$xmlns_value_to_strip = 'http://www.w3.org/1999/xhtml';
if ( $xmlns && $xmlns_value_to_strip === $xmlns->nodeValue ) {
$html->removeAttributeNode( $xmlns );
}
westonruter marked this conversation as resolved.
Show resolved Hide resolved

$xml_lang = $html->attributes->getNamedItem( 'xml:lang' );
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered hasAttributeNS or getAttributeNS?

It seems these don't work either. I would have thought at least that this would work:

$dom->documentElement->getAttributeNodeNS( 'http://www.w3.org/XML/1998/namespace', 'lang' );

But it doesn't. Perhaps it's due to loadHTML and that the document is not XML in the first place.

Using $dom->documentElement->attributes->getNamedItem( 'xml:lang' ) works in all PHP versions 5+. So seems like a good one to use.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks for looking at this.

if ( $xml_lang ) {
$lang_node = $html->attributes->getNamedItem( 'lang' );
if ( ( ! $lang_node || ! $lang_node->nodeValue ) && $xml_lang->nodeValue ) {
// Move the html[xml:lang] value to html[lang].
$html->setAttribute( 'lang', $xml_lang->nodeValue );
kienstra marked this conversation as resolved.
Show resolved Hide resolved
}
$html->removeAttributeNode( $xml_lang );
}
}

/**
* Deduplicate a given tag.
*
Expand Down
25 changes: 25 additions & 0 deletions tests/php/test-class-amp-dom-document.php
Original file line number Diff line number Diff line change
Expand Up @@ -88,6 +88,31 @@ public function data_dom_document() {
'<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html amp lang="en">' . $head . '<body class="some-class"><p>Text</p></body></html>',
'<!DOCTYPE html><html amp lang="en">' . $head . '<body class="some-class"><p>Text</p></body></html>',
],
'html_with_xmlns_and_xml_lang' => [
'utf-8',
'<!DOCTYPE html><html xmlns="http://www.w3.org/1999/xhtml" xml:lang="es">' . $head . '<body></body></html>',
'<!DOCTYPE html><html lang="es">' . $head . '<body></body></html>',
],
'html_with_xmlns_value_that_should_remain' => [
'utf-8',
'<!DOCTYPE html><html xmlns="http://www.w3.org/TR/html4/">' . $head . '<body></body></html>',
'<!DOCTYPE html><html xmlns="http://www.w3.org/TR/html4/">' . $head . '<body></body></html>',
],
'html_with_lang_and_xml_lang' => [
'utf-8',
'<!DOCTYPE html><html lang="es" xml:lang="fr">' . $head . '<body></body></html>',
'<!DOCTYPE html><html lang="es">' . $head . '<body></body></html>',
],
'html_with_empty_xml_lang' => [
'utf-8',
'<!DOCTYPE html><html xml:lang="">' . $head . '<body></body></html>',
'<!DOCTYPE html><html>' . $head . '<body></body></html>',
],
'html_with_empty_lang' => [
'utf-8',
'<!DOCTYPE html><html lang="" xml:lang="es">' . $head . '<body></body></html>',
'<!DOCTYPE html><html lang="es">' . $head . '<body></body></html>',
],
'slashes_on_closing_tags' => [
'utf-8',
'<!DOCTYPE html><html amp lang="en"><head><meta charset="utf-8" /></head><body class="some-class"><p>Text</p></body></html>',
Expand Down