Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve host-level PageRanks #52

Open
1 of 3 tasks
sylvinus opened this issue Jul 31, 2016 · 1 comment
Open
1 of 3 tasks

Improve host-level PageRanks #52

sylvinus opened this issue Jul 31, 2016 · 1 comment

Comments

@sylvinus
Copy link
Contributor

sylvinus commented Jul 31, 2016

As explained in our blog post, our host-level PageRank is very experimental and still very subject to spam.

Here is a list of our current ideas to improve it, feel free to contribute yours!

  • Don't follow rel=nofollow links
  • Better weights on the edges (treat links between subdomains differently? give less weight for links in the boilerplate and/or at the end of the page? give more weight depending on the number of distinct pages linking to the domain?)
  • Try to group domains belonging to the same owner (By IP address/DNS info? See Import DNS metadata #15)

Going to URL-level PageRanks would obviously help a a lot but it is out of scope for this issue.

@sylvinus
Copy link
Contributor Author

Sebastian from Common Crawl just did a very interesting first pass on spam in the dumps:
https://gist.github.com/sebastian-nagel/beb244bf1f7092a06a1479335a5e268b

This script is able to detect a few webspam clusters based on their domain name and pagerank similarity.

sylvinus added a commit that referenced this issue Aug 25, 2016
…to "dataproviders" to avoid confusion with document sources, and other smaller refactors
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant