urllib3.util.parse_url() handling of URLs that contain Unicode in hostname #799

LeXofLeviafan · 2024-12-01T06:49:56Z

test_bukuDb.py includes a test bookmark for URL 'http://www.zażółćgęśląjaźń.pl/'. To my understanding, this means such URLs are intended to be supported by Buku.

However, when parsing it with urllib3.util.parse_url(), the .netloc of produced URL is encoded with punycode: 'www.xn--zaglja-cxa0mpa5p6q5a80a6ota.pl'. This might be considered to be "correct" behaviour in the technical sense (since RFC limits allowed hostname chars to ASCII), but not for the purposes of "extracting the relevant substring", which is what Buku uses it for. (To compare, Bukuserver uses urllib.parse.urlparse() which produces the expected value 'www.zażółćgęśląjaźń.pl'.)

Additionally, using parse_url() to parse URLs with Unicode hostnames implicitly requires idna library to be installed (urllib3 v2+ includes this information in the cause exception, while v1.26.x just raises a generic "failed to parse" error when invoked with such a URL).

Because of these reasons, I strongly suggest that URL parsing in Buku is switched from using urllib3.util.parse_url() back to urllib.parse.urlparse().

The text was updated successfully, but these errors were encountered:

jarun · 2024-12-02T15:17:35Z

That would be a lot of work though. Do you have the time?
And it may also invite unforeseen issues.

LeXofLeviafan · 2024-12-02T15:59:22Z

I mean, it's only actually used in 4 places…

The only major difference I foresee here is that the "bad URL" check would be much less likely to fail because of a parsing error (which as far as I can tell only happens in urllib when parsing IPv6 URLs). And of course it wouldn't be raising an exception defined by urllib3.

…I've actually discovered this while attempting to add netloc to the list of supported sort fields (and subsequently trying to figure out why netloc is obtained differently in different source files). I don't believe that sorting by punycode-encoded netloc would produce the same results as when using the non-encoded one (and displaying such an encoded netloc to the user – in UI or console output – would certainly be treated as a bug) 😅

The alternative would be to use both functions in the same source file (one to obtain netloc and otherwise the other)

jarun · 2024-12-03T14:23:30Z

I would rather have 1 version only. Keep the changes as minimal as possible.

LeXofLeviafan added the bug label Dec 1, 2024

LeXofLeviafan added a commit to LeXofLeviafan/buku that referenced this issue Dec 8, 2024

[jarun#799] replaced parse_url() with urlparse()

aa87b88

LeXofLeviafan mentioned this issue Dec 8, 2024

order by netloc #800

Merged

jarun closed this as completed in #800 Dec 8, 2024

LeXofLeviafan mentioned this issue Dec 8, 2024

ToDo list #484

Open

github-actions bot locked and limited conversation to collaborators Jan 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

urllib3.util.parse_url() handling of URLs that contain Unicode in hostname #799

urllib3.util.parse_url() handling of URLs that contain Unicode in hostname #799

LeXofLeviafan commented Dec 1, 2024 •

edited

Loading

jarun commented Dec 2, 2024

LeXofLeviafan commented Dec 2, 2024 •

edited

Loading

jarun commented Dec 3, 2024

urllib3.util.parse_url() handling of URLs that contain Unicode in hostname #799

urllib3.util.parse_url() handling of URLs that contain Unicode in hostname #799

Comments

LeXofLeviafan commented Dec 1, 2024 • edited Loading

jarun commented Dec 2, 2024

LeXofLeviafan commented Dec 2, 2024 • edited Loading

jarun commented Dec 3, 2024

LeXofLeviafan commented Dec 1, 2024 •

edited

Loading

LeXofLeviafan commented Dec 2, 2024 •

edited

Loading