Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect normalization of character sequence "%EF%BD%9E" #182

Closed
jaimeiniesta opened this issue Dec 23, 2014 · 4 comments
Closed

Incorrect normalization of character sequence "%EF%BD%9E" #182

jaimeiniesta opened this issue Dec 23, 2014 · 4 comments
Labels

Comments

@jaimeiniesta
Copy link

Hello, we've found an issue on MetaInspector trying to normalize a japanese URL, here's how to reproduce it on Ruby 2.1.2 and Addressable 2.3.6:

require 'open-uri'
require 'addressable/uri'

url            = 'http://ja.wikipedia.org/wiki/Template:%EF%BD%9E'
normalized_url = Addressable::URI.parse(url).normalize.to_s

puts open(url).status            #=> 200
puts open(normalized_url).status #=> 404

This URL is being normalized to "http://ja.wikipedia.org/wiki/Template:~", which looks like, but it's not, what it should be:

captura de pantalla 2014-12-23 a las 22 24 26

This example screenshot is what you get when opening the URL in a browser (Chrome in my case):

Template:~ - Wikipedia

I'm not sure what it should be normalized to, but it looks like this character sequence should remain untouched, instead of being converted to "~".

@jaimeiniesta
Copy link
Author

Related:

#160

@sporkmonger
Copy link
Owner

I believe that Wikipedia is the one that's wrong here. URI normalization requires Unicode normalization form NFKC, which tries to eliminate look-alike characters like this one as an explicit goal. I'm sure you're aware of the danger of phishing attacks and that's the primary concern involved in the choice of which normalization form to use.

When this gets normalized to '~', what downstream effect is that having for you and what are you trying to achieve with the normalization call? Sometimes path normalization is not appropriate, in which case you can use Addressable to normalize on a component-by-component basis.

@jaimeiniesta
Copy link
Author

Thans for the clarification @sporkmonger -- maybe you can have a look at this @hokaccha

@hokaccha
Copy link

hokaccha commented Feb 5, 2015

Thanks for the clarification!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants