Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URLParser and WaybackURLKeyMaker fail on URLs with IPv6 address hostname #100

Merged
merged 1 commit into from
Nov 27, 2024

Conversation

sebastian-nagel
Copy link
Contributor

URLs/URIs with an IPv6 address as host fail to parse by URLParser. Consequently, WaybackURLKeyMaker fails to make the SURT key:

2024-11-26 11:07:55,243 ERROR o.c.u.WarcCdxWriter [pool-6-thread-1] Failed to make SURT for https://[2600:1f18:200d:fb00:2b74:867c:ab0c:150a]/robots.txt: java.net.URISyntaxException: bad port 1f18:200d:fb00:2b74:867c:ab0c:150a]: https://[2600:1f18:200d:fb00:2b74:867c:ab0c:150a]/robots.txt
        at org.archive.url.URLParser.parse(URLParser.java:257)
        at org.archive.url.WaybackURLKeyMaker.makeKey(WaybackURLKeyMaker.java:60)
        at org.commoncrawl.util.WarcCdxWriter.writeCdxLine(WarcCdxWriter.java:141)

This PR fixes the parser failure. Enclosing [ and ] are stripped from the IPv6 hosts to stay compatible with SURT keys generated by the Python surt module:

>>> from surt import surt
>>> surt("https://34.203.211.192/robots.txt")
'192,211,203,34)/robots.txt'
>>> surt("https://[2600:1f18:200d:fb00:2b74:867c:ab0c:150a]/robots.txt")
'2600:1f18:200d:fb00:2b74:867c:ab0c:150a)/robots.txt'

@ato ato merged commit d589dd9 into iipc:master Nov 27, 2024
5 checks passed
@ato
Copy link
Member

ato commented Nov 27, 2024

Thanks. Released as 1.1.11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants