Skip to content
This repository has been archived by the owner on Jan 29, 2020. It is now read-only.

Zend\Dom\Query and special UTF-8 characters #10

Open
GeeH opened this issue Jun 28, 2016 · 1 comment
Open

Zend\Dom\Query and special UTF-8 characters #10

GeeH opened this issue Jun 28, 2016 · 1 comment

Comments

@GeeH
Copy link
Contributor

GeeH commented Jun 28, 2016

This issue has been moved from the zendframework repository as part of the bug migration program as outlined here - http://framework.zend.com/blog/2016-04-11-issue-closures.html


Original Issue: https://api.github.com/repos/zendframework/zendframework/issues/7618
User: @mtrippodi
Created On: 2015-08-26T13:51:12Z
Updated At: 2015-11-06T22:17:32Z
Body

use Zend\Dom\Query;
use Zend\Debug\Debug;

$html = '<div><h1>ßüöä</h1></div>';
$dom = new Query($html);
$nodes = $dom->execute('h1');
Debug::dump($nodes->current()->nodeValue);

...will result in sth. like:

�üöä

$html = '<div><h1>ßüöä</h1></div>';
$dom = new Query(utf8_decode($html));
$nodes = $dom->execute('h1');
Debug::dump($nodes->current()->nodeValue);

... will solve the problem and result in correct rendering.

For convenience I extended Zend\Dom\Query:

<?php

namespace MyNamespace\Dom;

use Zend\Dom\Query as ZF2Query;

class Query extends ZF2Query
{

    /**
     * Set document to query. If is UTF-8: decode.
     *
     * @param  string $document
     * @param  null|string $encoding Document encoding
     * @return Query
     */
    public function setDocument($document, $encoding = null)
    {
        if (0 === strlen($document)) {
            return $this;
        }

        $_encoding = empty($encoding) ? $this->getEncoding() : $encoding;
        if($_encoding == 'UTF-8')
            $document = utf8_decode($document);

        return parent::setDocument($document, $encoding);
    }
}

Now I wonder if this could be perhaps implemented in Zend\Dom\Query. Or do I miss something and there's a better solution?
Thanks
m.


Comment

User: @mtrippodi
Created On: 2015-08-26T18:15:20Z
Updated At: 2015-08-26T19:17:05Z
Body
OK, forget my first "solution". It's bad because e.g. ...

$html = '<div><h1>€</h1></div>';
$dom = new Query(utf8_decode($html));
$nodes = $dom->execute('h1');
Debug::dump($nodes->current()->nodeValue); 

...will result in:

?

This is, because all that utf8_decode() does is convert a string encoded in UTF-8 to ISO-8859-1. This is of course not good because UTF-8 can represent many more characters than ISO-8859-1. See this comment at PHP Man.

The real problem is, that DOMDocument::loadHTML () by default will always treat the source-string as ISO-8859-1-encoded. Unfortunately, you can only change this behavior by specifying the encoding in the html head at the beginning of the source-string. This comment at PHP Man still seems to apply even though it is 10 years old and UTF-8 is so common nowadays!

So, based on this comment I again extended Zend\Dom\Query as follows:

<?php

namespace MyNamespace\Dom;

use Zend\Dom\Query as ZF2Query;

class Query extends ZF2Query
{

    /**
     * Set document to query
     *
     * @param  string $document
     * @param  null|string $encoding Document encoding
     * @return Query
     */
    public function setDocument($document, $encoding = null)
    {
        if (0 === strlen($document)) {
            return $this;
        }

        $prepend = '';
        $_encoding = empty($encoding) ? $this->getEncoding() : $encoding;
        if(!empty($_encoding) && strtolower($_encoding) != 'iso-8859-1')
                 $prepend = sprintf('<?xml encoding="%s">', $_encoding);

        // breaking XML declaration to make syntax highlighting work
        if ('<' . '?xml' == substr(trim($document), 0, 5)) {
            if (preg_match('/<html[^>]*xmlns="([^"]+)"[^>]*>/i', $document, $matches)) {
                $this->xpathNamespaces[] = $matches[1];
                return $this->setDocumentXhtml($prepend . $document, $encoding);
            }
            return $this->setDocumentXml($document, $encoding);
        }
        if (strstr($document, 'DTD XHTML')) {
            return $this->setDocumentXhtml($prepend . $document, $encoding);
        }
        return $this->setDocumentHtml($prepend . $document, $encoding);
    }
}

Still, two questions remain:

  • Is this the best solution?
  • Should a solution be implemented in Zend\Dom\Query?

Comment

User: @croensch
Created On: 2015-08-28T14:15:05Z
Updated At: 2015-08-28T14:15:05Z
Body
AFAIK if no header is present the passed encoding is used, if the header is present the passed encoding is ignored. So if your documents are always in iso-8859-1 then just try setDocument() as it is?


@weierophinney
Copy link
Member

This repository has been closed and moved to laminas/laminas-dom; a new issue has been opened at laminas/laminas-dom#4.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants