-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Clean list of domains to visit by removing invalid entries (non-websites, etc.) #21
Comments
I tried out PyFunceble in #23. Only 45 of the top 2k sites are not Another issue we have to deal with that might not be covered by #18 (comment) is all the content-hosting subdomains and URL shorteners that are present in the Majestic list (e.g. goo.gl, t.co, wp.com, 1.bp.blogspot.com). It's really a shame that Alexa is out of date, because it seemed much more skewed towards sites people actually visit. |
Have a nice day/night. Cheers, |
fwiw - my personal experience is that I have to do at least an additional pass on my INACTIVE hosts before I consider it INACTIVE. I have a local pfSense box running Unbound as my DNS resolver with Quad9 set as the upstream resolver. I don't know if it is that PyFunceble tries to process the list very fast but I almost think the INACTIVE list should be automatically processed at least 1 more time before making its final decision to mark it as INACTIVE. |
@jawz101 Interesting behavior. Do you have some data for me? Also, Did you try to parse your DNS server IP directly to PyFunceble? Cheers, |
This should have been resolved by #23. At this point though we switched to the Tranco list in #45, which seems to mostly obviate the need to validate domains. |
We could make a little script using https://funilrys.github.io/PyFunceble/ that will clean our list of domains to visit before we visit them. This should speed up the crawl and reduce our error rate, as most failures are caused by unreachable websites: #18 (comment)
The text was updated successfully, but these errors were encountered: