Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

output xml didn't follow spec in non ASCII character URI #346

Open
lisbethw1130 opened this issue Apr 20, 2020 · 4 comments
Open

output xml didn't follow spec in non ASCII character URI #346

lisbethw1130 opened this issue Apr 20, 2020 · 4 comments

Comments

@lisbethw1130
Copy link

lisbethw1130 commented Apr 20, 2020

As sitemap spec mentioned, the xml itself should do a xml entity escape, which the gem already have.
But the url should first do the RFC-3986 standard for URIs or the RFC-3987 standard for IRIs, and xml entity escape at last. sitemap generator seems didn't follow RFC-3986 now.

add 'linkTestEntityEscape&<> and RFC3986ü中文' 
# output: <loc>https://website.test/linkTestEntityEscape&amp;&lt;&gt; and RFC3986ü中文</loc>
# should be: <loc>https://website.test/linkTestEntityEscape%26%3C%3E%20and%20RFC3986%C3%BC%E4%B8%AD%E6%96%87</loc>

add 'ü中文?aaa=bbb'
# output: <loc>https://website.test/ü中文?aaa=bbb</loc>
# should be: <loc>https://website.test/%C3%BC%E4%B8%AD%E6%96%87?aaa=bbb</loc>

can someone help me to check if my conclusion is right since I'm just a junior programmer and not sure it's right.

If everything is OK, a PR for this issue will be sent later.

Best Regards,
Lisbeth

@lisbethw1130
Copy link
Author

Anyone has the idea?

@kjvarga
Copy link
Owner

kjvarga commented May 26, 2020

Hi @lisbethw1130 I think you're right. When I wrote this gem years ago it wasn't internationalized to handle UTF-8 and that wasn't as prevalent as it is today. It would be great if you could add that functionality, with tests :)

@lisbethw1130
Copy link
Author

Here's some obstacle I bumped in and solving:

  1. url escape can't be done in sitemap generator, so I wrote the tips in readme.
    e.g., we can't accurately split the query part and path part with a unescaped uri

https://example.com/dd?dd=?aa=vv can be https://example.com/dd%3Fdd=?aa=vv or https://example.com/dd?dd=%3Faa=vv

  1. Ruby doesn't escape single quote as xml spec mentioned, I just opened an issue in order to find out the real issue.

Any idea is welcome ;)

@olleolleolle
Copy link

Awesome that the change was released in Ruby, @lisbethw1130! 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants