Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

1,2,3-octet/hexadecimal hostnames detected as IPv4 addresses #290

Closed
elliotwutingfeng opened this issue May 20, 2023 · 1 comment · Fixed by #292
Closed

1,2,3-octet/hexadecimal hostnames detected as IPv4 addresses #290

elliotwutingfeng opened this issue May 20, 2023 · 1 comment · Fixed by #292

Comments

@elliotwutingfeng
Copy link
Contributor

elliotwutingfeng commented May 20, 2023

The following inputs are recognized as IPv4 addresses due to the use of socket.inet_aton().

1.1.1 -> domain parsed as 1.1.1
1.1 -> domain parsed as 1.1
1 -> domain parsed as 1 (output is still correct nonetheless)

The above is legacy behavior from UNIX's inet_aton for classful networks, a network addressing architecture made obsolete in 1993.

01.01.01.01 -> domain parsed as 01.01.01.01
01.01.01 -> domain parsed as 01.01.01
01.01 -> domain parsed as 01.01
01 -> domain parsed as 01 (output is still correct nonetheless)

0x1.0x1.0x1.0x1 -> domain parsed as 0x1.0x1.0x1.0x1
0x1.0x1.0x1 -> domain parsed as 0x1.0x1.0x1
0x1.0x1 -> domain parsed as 0x1.0x1
0x1 -> domain parsed as 0x1 (output is still correct nonetheless)

Given that tldextract's regex-based ipv4() function only recognizes IPv4 addresses with 4 decimal octets without zero padding, this is probably a bug.

It can be fixed by using socket.inet_pton() in looks_like_ip() instead of socket.inet_aton(). However, it is only supported on Unix/Unix-Like/Windows systems. Some of these systems do not.

A more portable fix would be using ipaddress.IPv4Address, though it is much slower.

If suffix_index == len(labels) == 4, are there any edge cases not covered by IP_RE?

@elliotwutingfeng elliotwutingfeng changed the title 1,2,3-octet hostnames detected as IPv4 addresses 1,2,3-octet/hexadecimal hostnames detected as IPv4 addresses May 21, 2023
@john-kurkowski
Copy link
Owner

Thank you for the thorough report.

It can be fixed by using socket.inet_pton() in looks_like_ip() instead of socket.inet_aton(). However, it is only supported on Unix/Unix-Like/Windows systems. Some of these systems do not.

A more portable fix would be using ipaddress.IPv4Address, though it is much slower.

Maybe try socket.inet_pton, and if it's unavailable for the system, fall back to ipaddress.IPv4Address?

john-kurkowski added a commit that referenced this issue May 26, 2023
…th unicode dots. (#292)

- IPv4 addresses with unicode dots are now recognized. Closes #287
- IPv4 addresses must have 4 decimal octets. Closes #290

---------

Co-authored-by: John Kurkowski <john.kurkowski@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants