Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Registrant not consistent for a few TLDs #21

Closed
baderdean opened this issue Aug 30, 2023 · 20 comments
Closed

Registrant not consistent for a few TLDs #21

baderdean opened this issue Aug 30, 2023 · 20 comments

Comments

@baderdean
Copy link

baderdean commented Aug 30, 2023

The registrant sometimes is not the real one. Let's take google as an example. Here the list of there domains: https://www.google.com/supported_domains

>>> import requests
>>> r = requests.get("https://www.google.com/supported_domains")
>>> domains = [domain[1:] for domain in r.text.splitlines()]
>>> registrants = []
>>> for domain in domains:
...     try:
...        w = whoisdomain.query(domain)
...     except Exception as e:
...        print(e)
...        continue
...     if hasattr(w, "registrant"):
...        if registrants.get(w.registrant):
...           registrants[w.registrant].append(domain)
...        else:
...           registrants[w.registrant] = [domain, ]

Here the results:

{None: ['google.com.ag',
        'google.as',
        'google.bg',
        'google.com.bo',
        'google.cd',
        'google.fi',
        'google.ge',
        'google.gg',
        'google.hr',
        'google.im',
        'google.je',
        'google.kg',
        'google.kz',
        'google.com.ly',
        'google.co.ma',
        'google.mg',
        'google.com.mx',
        'google.com.ng',
        'google.nu',
        'google.ro',
        'google.se',
        'google.sn',
        'google.sm',
        'google.td',
        'google.com.tw',
        'google.co.ug',
        'google.ws',
        'google.rs'],
 '': ['google.ae',
      'google.com.ar',
      'google.com.au',
      'google.bf',
      'google.com.br',
      'google.de',
      'google.dk',
      'google.fm',
      'google.gl',
      'google.co.id',
      'google.ie',
      'google.co.il',
      'google.co.jp',
      'google.la',
      'google.lt',
      'google.lu',
      'google.lv',
      'google.mu',
      'google.nl',
      'google.no',
      'google.com.om',
      'google.pl',
      'google.pt',
      'google.com.qa',
      'google.ru',
      'google.tm',
      'google.com.ua',
      'google.co.uk',
      'google.co.za'],
 'ADMIN-LEO': ['google.co.ls'],
 'CN_10': ['google.co.cr'],
 'CON000020360': ['google.co.ve'],
 'G830057': ['google.si'],
 'GDA-ITFARM': ['google.co.tz'],
 'GI7803022-NICAT': ['google.at'],
 'GL210-IS': ['google.is'],
 'GOOGLE INC': ['google.bj'],
 'GOOGLE LLC (SGNIC-ORG1624232)': ['google.com.sg'],
 'Google Canada Corporation': ['google.ca'],
 'Google Inc.': ['google.co.zm'],
 'Google Ireland Holdings Unlimited Company': ['google.fr', 'google.it'],
 'Google Korea, LLC': ['google.co.kr'],
 'Google LLC': ['google.com',
                'google.com',
                'google.am',
                'google.bi',
                'google.by',
                'google.ci',
                'google.cl',
                'google.cm',
                'google.com.co',
                'google.dm',
                'google.ee',
                'google.com.gi',
                'google.co.in',
                'google.co.ke',
                'google.com.lb',
                'google.me',
                'google.mn',
                'google.co.mz',
                'google.com.na',
                'google.com.pe',
                'google.com.pr',
                'google.rw',
                'google.com.sa',
                'google.sc',
                'google.sh',
                'google.com.sl',
                'google.so',
                'google.st',
                'google.tn',
                'google.com.tr',
                'google.com.vc',
                'google.cat'],
 'Google LLC (กูเกิล แอลแอลซี)': ['google.co.th'],
 'HONG KONG INTERNET HOLDING LIMITED': ['google.com.hk'],
 'MM1171195': ['google.cz'],
 'Not shown, please visit www.dnsbelgium.be for webbased whois.': ['google.be'],
 'Techno Bros. IT Solution Pty. Ltd.': ['google.com.et'],
 'UNET-R11': ['google.mk'],
 'mmr-170347': ['google.sk'],
 '北京谷翔信息技术有限公司': ['google.cn']}

Example of wrong values:

  • 'MM1171195': ['google.cz'],
  • 'UNET-R11': ['google.mk'],
  • 'mmr-170347': ['google.sk'],
  • 'ADMIN-LEO': ['google.co.ls'],
  • 'CN_10': ['google.co.cr'],
  • 'CON000020360': ['google.co.ve'],
  • 'G830057': ['google.si'],
  • 'GDA-ITFARM': ['google.co.tz'],
  • 'GI7803022-NICAT': ['google.at'],
  • 'GL210-IS': ['google.is'],
  • '' and None for a few of them

It could be easily fixed with the most of them by using Registrant Organization like in those examples below:

❯ whois google.cz
%  (c) 2006-2021 CZ.NIC, z.s.p.o.
% 
% Intended use of supplied data and information
% 
% Data contained in the domain name register, as well as information
% supplied through public information services of CZ.NIC association,
% are appointed only for purposes connected with Internet network
% administration and operation, or for the purpose of legal or other
% similar proceedings, in process as regards a matter connected
% particularly with holding and using a concrete domain name.
% 
% Full text available at:
% http://www.nic.cz/page/306/intended-use-of-supplied-data-and-information/
% 
% See also a search service at http://www.nic.cz/whois/
% 
% 
% Whoisd Server Version: 3.12.2
% Timestamp: Wed Aug 30 13:06:30 2023

domain:       google.cz
registrant:   MM1171195
admin-c:      MM1171195
nsset:        MM1543911
registrar:    REG-MARKMONITOR
registered:   21.07.2000 15:21:00
changed:      23.04.2018 20:24:01
expire:       22.07.2024

contact:      MM1171195
org:          Google LLC
name:         Domain Administrator
address:      1600 Amphitheatre Parkway
address:      Mountain View
address:      94043
address:      CA
address:      US
registrar:    REG-MARKMONITOR
created:      02.03.2018 18:52:05
changed:      15.05.2018 21:32:00

nsset:        MM1543911
nserver:      ns2.google.com 
nserver:      ns4.google.com 
nserver:      ns3.google.com 
nserver:      ns1.google.com 
tech-c:       MM193020
registrar:    REG-MARKMONITOR
created:      18.05.2011 23:27:16

contact:      MM193020
org:          MarkMonitor Inc.
name:         Domain Provisioning
address:      2150 S Bonito Way
address:      Suite 150
address:      Meridian
address:      83642
address:      ID
address:      US
registrar:    REG-MARKMONITOR
created:      03.02.2011 18:24:34
changed:      29.06.2021 23:29:20

or

❯ whois google.sk
Domain:                       google.sk
Created:                      2003-07-24
Valid Until:                  2024-07-24
Updated:                      2023-06-22
Domain Status:                clientTransferProhibited, clientUpdateProhibited, clientDeleteProhibited
Nameserver:                   ns1.google.com
Nameserver:                   ns2.google.com
Nameserver:                   ns3.google.com
Nameserver:                   ns4.google.com

Domain registrant:            mmr-170347
Name:                         Domain Administrator
Organization:                 Google Ireland Holdings Unlimited Company
Organization ID:              369511
Phone:                        +353.14361000
Email:                        dns-admin@google.com
Street:                       70 Sir John Rogerson's Quay
City:                         Dublin
Postal Code:                  2
Country Code:                 IE
Authorised Registrar:         MARK-0292
Created:                      2019-06-07
Updated:                      2019-06-07

Registrar:                    MARK-0292
Name:                         MarkMonitor International Limited
Organization:                 MarkMonitor International Limited
Organization ID:              4847541
Phone:                        +1.2083895740
Email:                        registry.admin@markmonitor.com
Street:                       12 New Fetter Lane
City:                         London
Postal Code:                  EC4A 1JP
Country Code:                 UK
Created:                      2018-06-27
Updated:                      2023-08-03

Administrative Contact:       mmr-170347
Name:                         Domain Administrator
Organization:                 Google Ireland Holdings Unlimited Company
Organization ID:              369511
Phone:                        +353.14361000
Email:                        dns-admin@google.com
Street:                       70 Sir John Rogerson's Quay
City:                         Dublin
Postal Code:                  2
Country Code:                 IE
Created:                      2019-06-07
Updated:                      2019-06-07

Technical Contact:            mmr-170347
Name:                         Domain Administrator
Organization:                 Google Ireland Holdings Unlimited Company
Organization ID:              369511
Phone:                        +353.14361000
Email:                        dns-admin@google.com
Street:                       70 Sir John Rogerson's Quay
City:                         Dublin
Postal Code:                  2
Country Code:                 IE
Created:                      2019-06-07
Updated:                      2019-06-07
@mboot-github
Copy link
Owner

mboot-github commented Aug 31, 2023 via email

@baderdean
Copy link
Author

baderdean commented Aug 31, 2023

The registrant IMHO should be "Google LLC" preferably over "MM1171195"

@mboot-github
Copy link
Owner

mboot-github commented Aug 31, 2023 via email

@baderdean
Copy link
Author

I can handle this PR. It will not be so hard if I use the same technique as I did for .fr
Regarding GDPR, it's the organization's name, not the person's name so not it does not apply.
And even if it does apply, here the data processor is the user not the tool in itself.
The only issue a user may have with this tool, and that already exists actually, is the persistent file storage cache mechanism yet it's unrelated to this specific issue.

@mboot-github
Copy link
Owner

mboot-github commented Aug 31, 2023 via email

@mboot-github
Copy link
Owner

mboot-github commented Sep 1, 2023 via email

@baderdean
Copy link
Author

baderdean commented Sep 1, 2023

It's not stable but that's much better than it was. We can make it even better by comparing the registrant with the registrar and picking the first registrant that's different from the registrar (.lower()) if there is more than one registrant.

@mboot-github
Copy link
Owner

mboot-github commented Sep 1, 2023 via email

@mboot-github
Copy link
Owner

ok new code is now available in tld_regexpr that convers all existing re strings to functions.
see the def R(reString)

the function is called by whoisParser.py as func(textString) and should return a [str]

this means it is now easy to add more complicated parsing then one single regex
we can now make a func like:

def contextExtract(beginRe: str, endRe: str, findRe: str) -> Callable[[str], List[str]]:
      def f(textString: str) -> List[str]:
          extract_section = extract_from_to(beginRe, endRe) 
          return  find_inSEction(extract_section, findReString)
    return f

to limit searched for something in a particular section of the whois cli result

the road is open for much more targeted searches
even based on information from other parts of the whois response

@mboot-github
Copy link
Owner

i updated the "sk" tld to use the new contextual extract

@mboot-github
Copy link
Owner

add findFromToAndLookForWithFindFirst contextual search based on a previous findFirst,
used in "fr" tld,
example google.fr,
{} is used to add to fromStr

@mboot-github
Copy link
Owner

how would you want to handle

  • whois pik.bzh -h whois.nic.bzh

@baderdean
Copy link
Author

baderdean commented Sep 4, 2023

Ok, we could do it progressively :

  1. registrant by default is organization, second name (when redacted for privacy return the value anyway, yet we should also have a set of different redacted for privacy values and got an option to skip values in this case)
  2. we'll had a tree-like structure like in this approach https://www.netmeister.org/blog/whois.html or whoisit one (IMHO, we should have both: a flat easy one and an automatically nested one in an advanced mode)

FYI, I've benchmarked whoisdomain, asyncwhois and whoisit against this registrant name issue and performance. It appears that whoisdomain is the fastest and close second in terms of quality. Test has to be reproduced in other machines because of network/caching issues.

❯ ./whoisdomain-benchmark.py 
{'asyncwhois': {'count': 49,
                'duration': 285.84409061399947,
                'end': 6710.12183199,
                'percentage': 26,
                'start': 6424.277741376},
 'whoisdomain': {'count': 44,
                 'duration': 195.54051797400007,
                 'end': 6396.365353012,
                 'percentage': 24,
                 'start': 6200.824835038},
 'whoisit': {'count': 6,
             'duration': 27.91238160300054,
             'end': 6424.27773957,
             'percentage': 3,
             'start': 6396.365357967}}

@mboot-github
Copy link
Owner

added the proper parsing for google. from your test case #21 (comment), some actually have no organization or name (google.si, google.co.tz) some have no data or no registrant, a similar test for meta wold be nice if that is possible

@baderdean
Copy link
Author

baderdean commented Sep 6, 2023

What are meta words? Could you describe a little bit the test case you wish, I could write it.
Thanks for your quick PR, I thought I'll do it.

I think than in the long-term, an approched based similar to JSWhois is the most interesting: https://github.com/jschauma/jswhois/blob/main/src/jswhois.go / https://www.netmeister.org/blog/whois.html i.e:
(0. write pythonic code (adopt PEP8, black, ruff))

  1. define a common nomenclature, flat AND tree based (for multiple entities) loosely inspired from RDAP but much simpler
  2. categorize TLD by their NIC Whois format type
  3. do some custom parsing capabilities for some TLDs that have a derived version of format
  4. add RDAP support to this library for some TLD like .be who dropped Whois
  5. add unit tests for every TLD - at least 3 different would be better
    (Nice to have: 6. detect "privacy" comments)

@mboot-github
Copy link
Owner

JSwhois is certainly interesting to look at (item 2 could be derived from work at jswhois)

  1. currently adding rdap would mean to introduce a additional dependency and parsing rdap is not that easy as they go endlessly on in lists ;-)
  2. the option: withRedacted will include text that would normally be redacted

i finished all registrars i could find for the google test: see

cd /tests/ 
make google

@mboot-github
Copy link
Owner

meta is all domains owned by facebook

see: whois meta.com

@baderdean
Copy link
Author

Ok, for meta.com it was a bit hard so I tried with various facebook domains

It gaves me this result (quite good actually) on a old release, so it should be even better now:

✦ ❯ ./whoisdomain-benchmark.py facebook 
Whois domain with parsers: ['whoisdomain', 'whoisit', 'asyncwhois', 'pythonwhoisalt'] on organization FACEBOOK for domains ['facebook.cn', 'facebook.com', 'facebook.fr', 'facebook.de', 'facebook.co.uk', 'facebook.net', 'facebook.gr', 'facebook.nl', 'facebook.se', 'facebook.fi', 'facebook.ae', 'facebook.cm', 'facebook.co', 'facebook.it', 'facebook.es', 'facebook.za', 'facebook.ca', 'facebook.pl', 'facebook.su', 'facebook.ru', 'facebook.tw', 'facebook.jp', 'facebook.au', 'facebook.nz', 'facebook.ar', 'facebook.mx', 'facebook.is', 'facebook.io']
{'asyncwhois': {'count': 3, 'duration': 9.292810824001208, 'percentage': '11%'},
 'pythonwhoisalt': {'count': 2,
                    'duration': 11.09157811201294,
                    'percentage': '7%'},
 'whoisdomain': {'count': 6,
                 'duration': 12.752617797988933,
                 'percentage': '21%'},
 'whoisit': {'count': 2, 'duration': 5.797948399995221, 'percentage': '7%'}}

@rl-devops
Copy link
Contributor

thanks, (.za has no whois server, nz has no registrar in the response)
i added a test for facebook that we can expand on

i will see if i can make a minimal rdap client we could use
and add some rdap info to the ZZ dict in tld_regexpr.py

currently adding public suffix list info (if library 'tld' is included on the running platform)
and im thinking on replacing all if self.pc.verbose: print stderr with: logging.debug(msg)

@mboot-github
Copy link
Owner

shall we close this one for now, we can add new issues for future work on rdap , and json response with grouping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants