Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add first document-level quality signals #28

Open
sylvinus opened this issue Mar 6, 2016 · 3 comments
Open

Add first document-level quality signals #28

sylvinus opened this issue Mar 6, 2016 · 3 comments

Comments

@sylvinus
Copy link
Contributor

sylvinus commented Mar 6, 2016

We will need to have a model that evaluates many features from documents and gives us a document quality score.

Before doing any machine learning, it would be great to explore the first few features/signals we could include.

A first list of ideas, please add your own!

  • Vocabulary (we should look at email spam filters)
  • Broken HTML (this is rather broad!)
  • Use of tags like <blink> :-)
  • Usage of known JavaScript trackers/libraries (could be good or bad)
  • Specific services like domain parking
  • Usage of ALL CAPS text?
@sylvinus
Copy link
Contributor Author

#34 may give us other signal ideas

@IvRRimum
Copy link

IvRRimum commented Jun 1, 2016

404 pages ?

@indolering
Copy link

indolering commented Sep 7, 2016

  • Responsive design
  • Accessible design
  • JS modals
  • Number of redirects required to get to page
  • Page load time
  • Total number of requests to load page
  • URL appearing in sitemap file
  • Parse URL, subdomains, and query parameters into arrays?
  • Domain
  • TLD (normalized using public suffix list)
  • Authority ranking of outbound links
  • Broken outbound links
  • Use of HTTPS
  • Use of microdata formatting?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants