Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Classification based on subject and preview text #8257

Closed
wants to merge 26 commits into from

Conversation

st3iny
Copy link
Member

@st3iny st3iny commented Mar 21, 2023

Supersedes #7918

@st3iny
Copy link
Member Author

st3iny commented Mar 21, 2023

I ran the meta estimator 100 times on the same data set, extracted all parameters of all individual estimators and did some number crunching with python and plotting with gnuplot.

Parameters of the best configuration for my personal mails (considering variance and f1 score):

k weighted kernel
15 true Manhattan

The data shows a significant improvement of the f1 score while showing acceptable variance. The lowest recorded f1 score is still higher than the vanilla f1 score.

Vanilla extractor and vanilla estimator (GaussianNB)

$ occ mail:account:train -vvv 4
[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 141 are important
[debug] data set split into 280 (i: 112) training and 70 (i: 29) validation sets with 4 dimensions
[debug] classification report: {"recall":0.3448275862068966,"precision":1,"f1Score":0.5128205128205129}
[debug] classifier validated: recall(important)=0.3448275862069, precision(important)=1 f1(important)=0.51282051282051
[debug] classifier 71 persisted
42MB of memory used

New extractor and new estimator (KNN)

$ occ mail:account:train -vvv --new 4
[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 141 are important
[debug] data set split into 280 (i: 112) training and 70 (i: 29) validation sets with 14 dimensions
[debug] classification report: {"recall":1,"precision":0.7631578947368421,"f1Score":0.8656716417910448}
[debug] classifier validated: recall(important)=1, precision(important)=0.76315789473684 f1(important)=0.86567164179104
[debug] classifier 73 persisted
82MB of memory used

Variance across 100 experiments

variance-plot

@ChristophWurst
Copy link
Member

ChristophWurst commented Mar 23, 2023

Personal account

old

[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 174 are important
[debug] data set split into 280 (i: 121) training and 70 (i: 53) validation sets with 4 dimensions
[debug] classification report: {"recall":0.49056603773584906,"precision":0.96296296296296291,"f1Score":0.65000000000000002}
[debug] classifier validated: recall(important)=0.49056603773585, precision(important)=0.96296296296296 f1(important)=0.65
[debug] classifier 3255 persisted

new

[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 174 are important
[debug] data set split into 280 (i: 121) training and 70 (i: 53) validation sets with 14 dimensions
[debug] classification report: {"recall":0.81132075471698117,"precision":0.87755102040816324,"f1Score":0.84313725490196079}
[debug] classifier validated: recall(important)=0.81132075471698, precision(important)=0.87755102040816 f1(important)=0.84313725490196
[debug] classifier 3256 persisted

Work account

old

[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 187 are important
[debug] data set split into 280 (i: 154) training and 70 (i: 33) validation sets with 4 dimensions
[debug] classification report: {"recall":1,"precision":1,"f1Score":1}
[debug] classifier validated: recall(important)=1, precision(important)=1 f1(important)=1
[debug] classifier 3258 persisted

new

[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 187 are important
[debug] data set split into 280 (i: 154) training and 70 (i: 33) validation sets with 14 dimensions
[debug] classification report: {"recall":0.87878787878787878,"precision":1,"f1Score":0.93548387096774188}
[debug] classifier validated: recall(important)=0.87878787878788, precision(important)=1 f1(important)=0.93548387096774
[debug] classifier 3259 persisted

@st3iny st3iny force-pushed the enh/noid/classification-based-on-subject-IV branch from 4aba737 to 905ea6d Compare March 24, 2023 13:33
@ChristophWurst
Copy link
Member

Performance analysis

Bildschirmfoto vom 2023-03-28 13-27-40
Bildschirmfoto vom 2023-03-28 13-28-24

@ChristophWurst
Copy link
Member

ChristophWurst commented Mar 29, 2023

Current finding is that the input data looks skewed. The dimensional reduction produces tiny numbers. Normalization brings them back into a reasonable range but the values have almost no variance across the feature vectors of all messages.

Bildschirmfoto vom 2023-03-28 13-57-47

Word count vectorization works. There are lots of 0s and occasional 1s. It is rare to have a word more than once in a single subject. Dimensional reduction will still make sense. Applying https://docs.rubixml.com/2.0/transformers/tf-idf-transformer.html after the WCV would be worth a try.

@st3iny st3iny force-pushed the enh/noid/classification-based-on-subject-IV branch from f8eb1bf to a14e720 Compare May 15, 2023 14:38
st3iny and others added 21 commits May 17, 2023 12:20
Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>
Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>
Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>
Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>
@st3iny st3iny force-pushed the enh/noid/classification-based-on-subject-IV branch from a14e720 to 909d31f Compare May 17, 2023 10:36
@ChristophWurst ChristophWurst mentioned this pull request Nov 29, 2023
24 tasks
@st3iny st3iny deleted the enh/noid/classification-based-on-subject-IV branch December 19, 2024 07:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: ☑️ Done
Development

Successfully merging this pull request may close these issues.

2 participants