Classification #22

jogli5er · 2018-05-31T03:45:13Z

Features:
binary: Set of words (vectorized)
binary + weighting: binary vector multiplied with weights
frequency: Bag of words (vectorized)
frequency + weight: some function, e.g. log_2(freq_in_body) + 10*log_2(freq_in_header)
possible weighting schemes:
Word is in title ( tags)
Word is contained in body

Unsupervised: classification, later manually named by us by picking centre and extreme points to look at. Further, we can play around with the number of clusters we want to find and see what is found if we do not limit the number of clusters
Supervised: Label 100 manually (by us), then let about 1'000 - 5'000 be labelled externally by hand, let the rest be labelled externally. After that, we can at least train on this set and try to predict the rest of the URLs.

Process:

Detect language
Remove stop words
Depending on the language may use stemming or other reduction schemes
Create sets and bags of words (weighted), on which one should learn
Randomly select URLs to be manually labelled (for supervised only)
Run analysis on the dataset

jogli5er added the enhancement New feature or request label May 31, 2018

jogli5er self-assigned this May 31, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Classification #22

Classification #22

jogli5er commented May 31, 2018

Classification #22

Classification #22

Comments

jogli5er commented May 31, 2018