-
Notifications
You must be signed in to change notification settings - Fork 263
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Classification based on subject and preview text #8257
Conversation
I ran the meta estimator 100 times on the same data set, extracted all parameters of all individual estimators and did some number crunching with python and plotting with gnuplot. Parameters of the best configuration for my personal mails (considering variance and f1 score):
The data shows a significant improvement of the f1 score while showing acceptable variance. The lowest recorded f1 score is still higher than the vanilla f1 score. Vanilla extractor and vanilla estimator (GaussianNB)
New extractor and new estimator (KNN)
Variance across 100 experiments |
Personal accountold
new
Work accountold
new
|
4aba737
to
905ea6d
Compare
Current finding is that the input data looks skewed. The dimensional reduction produces tiny numbers. Normalization brings them back into a reasonable range but the values have almost no variance across the feature vectors of all messages. Word count vectorization works. There are lots of 0s and occasional 1s. It is rare to have a word more than once in a single subject. Dimensional reduction will still make sense. Applying https://docs.rubixml.com/2.0/transformers/tf-idf-transformer.html after the WCV would be worth a try. |
f8eb1bf
to
a14e720
Compare
Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>
Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>
Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>
Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>
a14e720
to
909d31f
Compare
Supersedes #7918