Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean list of domains to visit by removing invalid entries (non-websites, etc.) #21

Closed
ghostwords opened this issue Jul 25, 2018 · 5 comments
Labels
enhancement New feature or request

Comments

@ghostwords
Copy link
Member

We could make a little script using https://funilrys.github.io/PyFunceble/ that will clean our list of domains to visit before we visit them. This should speed up the crawl and reduce our error rate, as most failures are caused by unreachable websites: #18 (comment)

@bcyphers
Copy link
Contributor

bcyphers commented Aug 31, 2018

I tried out PyFunceble in #23. Only 45 of the top 2k sites are not "ACTIVE", so the tool is helpful but not perfect. It's unclear how our scan is generating so many DNS NOT FOUND errors that PyFunceble seems to ignore.

Another issue we have to deal with that might not be covered by #18 (comment) is all the content-hosting subdomains and URL shorteners that are present in the Majestic list (e.g. goo.gl, t.co, wp.com, 1.bp.blogspot.com). It's really a shame that Alexa is out of date, because it seemed much more skewed towards sites people actually visit.

@funilrys
Copy link
Contributor

funilrys commented Sep 26, 2018

Hi there,
developer of PyFunceble in here!

Thanks for using PyFunceble!

Just wanted to say that I'm ready to take all remarks on PyFunceble. So if you find something which is not correct or you have question please let me know. If you prefer to contact me personally, please do! You can contact me per email or Keybase.

I'll be interested to see how you run your scan to find DNS NOT FOUND as it might help me improve and solve that problematic. Is it a script or something else ?

About subdomains, let me redirect you to the SPECIAL source documentation.
Indeed, we already cover blogspot domains but I'll be happy to improve my tool for other content-hosting providers.

Have a nice day/night.

Cheers,
Nissar

@jawz101
Copy link

jawz101 commented Aug 23, 2019

fwiw - my personal experience is that I have to do at least an additional pass on my INACTIVE hosts before I consider it INACTIVE. I have a local pfSense box running Unbound as my DNS resolver with Quad9 set as the upstream resolver.

I don't know if it is that PyFunceble tries to process the list very fast but I almost think the INACTIVE list should be automatically processed at least 1 more time before making its final decision to mark it as INACTIVE.

@funilrys
Copy link
Contributor

@jawz101 Interesting behavior. Do you have some data for me?

Also, Did you try to parse your DNS server IP directly to PyFunceble?

Cheers,
Nissar

@ghostwords
Copy link
Member Author

This should have been resolved by #23. At this point though we switched to the Tranco list in #45, which seems to mostly obviate the need to validate domains.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants