Skip to content
This repository has been archived by the owner on Jan 29, 2020. It is now read-only.

Uri implementation does not support the utf8 uri's #155

Closed
zstate opened this issue Mar 11, 2016 · 9 comments
Closed

Uri implementation does not support the utf8 uri's #155

zstate opened this issue Mar 11, 2016 · 9 comments

Comments

@zstate
Copy link

zstate commented Mar 11, 2016

Uri implementation does not support utf8 uri's because of PHP parse_url function (see https://secure.php.net/manual/en/function.parse-url.php#114817)

@zf2timo
Copy link

zf2timo commented Mar 11, 2016

Can you create a Unit Test which demonstrates the issue?

@zstate
Copy link
Author

zstate commented Mar 11, 2016

class UriTest extends PHPUnit_Framework_TestCase
{
    /**
     * This test fails, but it should not
     */
    public function testUtf8Uri()
    {
        $uri = new \Zend\Diactoros\Uri('http://ουτοπία.δπθ.gr/');

        $this->assertEquals('ουτοπία.δπθ.gr' , $uri->getHost());
    }
}

@zf2timo
Copy link

zf2timo commented Mar 12, 2016

I added your provided Test locally, and the Test did not failed.
Does it fail on your local System?

And please also created a PR with this Unit Test, so we can see, the result on Travis.

@zstate
Copy link
Author

zstate commented Mar 14, 2016

Yeah looks like it is my environment which is Mac OS X EL Capitan 10.11.2 (15C50) with PHP 5.5.30 (cli) (built: Oct 23 2015 17:21:45).

When I run the tests I'm getting this output:

There was 1 failure:

  1. ZendTest\Diactoros\UriTest::testUtf8Uri
    Failed asserting that two strings are equal.
    --- Expected
    +++ Actual
    @@ @@
    -'ουτοπία.δπθ.gr'
    +'ο?_?_ο?_ία.δ?_θ.gr'

Not sure what's going on and what I need to configure to make it work. Running tests on CentOS works fine and the travis build successful.

@zf2timo
Copy link

zf2timo commented Mar 14, 2016

I think this issue can be closed for now.
As seen in PR #159 the Travis Build is successful.

@Pudge601
Copy link

I've run into this issue myself when trying to run the unit tests on my machine, and it appears to be due to locale settings, specifically the LC_CTYPE (docs) locale setting.

This locale setting is automatically determined based on the LANG environment variable, which in my case was set to en_GB.UTF-8, and the output of the unit test with this environment was;

1) ZendTest\Diactoros\UriTest::testUtf8Uri
Failed asserting that two strings are identical.
--- Expected
+++ Actual
@@ @@
-'ουτοπία.δπθ.gr'
+'��_�_��_��.��_�.gr'

I'm really just posting this here for future reference in case anybody else hits this issue, I don't think there's anything that can be changed in this library since the problem happens in parse_url.

However, this behaviour does lead to fatal errors for Uri::filterPath and Uri::filterQueryOrFragment, due to not handling the cases where preg_replace_callback can return null on error (these should ideally be resolved to some sort of exception rather than trying to return null from a method which is has a string return type).

@michalbundyra
Copy link
Member

@Pudge601 what if we check:

$url === unparse_url(parse_url($url))

Do you think it's gonna help? Are you saying that "invalid characters" are returned from parse_url?
(of course there is no function unparse_url in PHP, but we can add private one in the class, and use it).

@Pudge601
Copy link

It wouldn't be possible to write a unparse_url function; once the URL has gone through parse_url, there is information loss where the output is messed up.

Quite simply, parse_url is not multibyte safe, and was never designed to be, which seems to be why this bug from 2010 has still not been addressed. As mentioned in the comments on that bug, a nicer behaviour would be for parse_url to "treat all extended characters ... as opaque characters and copy them as-is without modification", but I don't see that happening any time soon.

I had intended for my comment to just be informative for anyone else who found this issue rather than suggesting that anything needed to be done in this library, but I guess if this library is purporting to support UTF-8, then it could be considered broken at the moment in the sense that it doesn't work in differing locales.

There are a couple of options I can think of to try and support UTF8 URIs in this library;

  • Use a wrapper for parse_url which URL encodes/decodes extended characters (e.g. in this comment from the manual page)
  • Manually parse the URL in PHP without parse_url

Or alternatively this library could simply reject URI's with extended characters which aren't properly encoded, as they are technically invalid according to RFC 3986.

Whichever way, it might be worth adding something to the Travis config to run the tests under different locale environments to show how this is reproduced in the unit tests (though this might be a pain, since the available values for the locale are platform dependent).

@weierophinney
Copy link
Member

@Pudge601 Could you open a new issue and/or PR, using the text from your previous comment as background, please?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants