Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Classification #22

Open
jogli5er opened this issue May 31, 2018 · 0 comments
Open

Classification #22

jogli5er opened this issue May 31, 2018 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@jogli5er
Copy link
Member

Features:
binary: Set of words (vectorized)
binary + weighting: binary vector multiplied with weights
frequency: Bag of words (vectorized)
frequency + weight: some function, e.g. log_2(freq_in_body) + 10*log_2(freq_in_header)
possible weighting schemes:
Word is in title ( tags)
Word is contained in body

  1. Unsupervised: classification, later manually named by us by picking centre and extreme points to look at. Further, we can play around with the number of clusters we want to find and see what is found if we do not limit the number of clusters
  2. Supervised: Label 100 manually (by us), then let about 1'000 - 5'000 be labelled externally by hand, let the rest be labelled externally. After that, we can at least train on this set and try to predict the rest of the URLs.

Process:

  1. Detect language
  2. Remove stop words
  3. Depending on the language may use stemming or other reduction schemes
  4. Create sets and bags of words (weighted), on which one should learn
  5. Randomly select URLs to be manually labelled (for supervised only)
  6. Run analysis on the dataset
@jogli5er jogli5er added the enhancement New feature or request label May 31, 2018
@jogli5er jogli5er self-assigned this May 31, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant