WIP: Classification based on subject and preview text #8257

st3iny · 2023-03-21T17:53:49Z

Supersedes #7918

st3iny · 2023-03-21T18:12:23Z

I ran the meta estimator 100 times on the same data set, extracted all parameters of all individual estimators and did some number crunching with python and plotting with gnuplot.

Parameters of the best configuration for my personal mails (considering variance and f1 score):

k	weighted	kernel
15	true	Manhattan

The data shows a significant improvement of the f1 score while showing acceptable variance. The lowest recorded f1 score is still higher than the vanilla f1 score.

Vanilla extractor and vanilla estimator (GaussianNB)

$ occ mail:account:train -vvv 4
[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 141 are important
[debug] data set split into 280 (i: 112) training and 70 (i: 29) validation sets with 4 dimensions
[debug] classification report: {"recall":0.3448275862068966,"precision":1,"f1Score":0.5128205128205129}
[debug] classifier validated: recall(important)=0.3448275862069, precision(important)=1 f1(important)=0.51282051282051
[debug] classifier 71 persisted
42MB of memory used

New extractor and new estimator (KNN)

$ occ mail:account:train -vvv --new 4
[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 141 are important
[debug] data set split into 280 (i: 112) training and 70 (i: 29) validation sets with 14 dimensions
[debug] classification report: {"recall":1,"precision":0.7631578947368421,"f1Score":0.8656716417910448}
[debug] classifier validated: recall(important)=1, precision(important)=0.76315789473684 f1(important)=0.86567164179104
[debug] classifier 73 persisted
82MB of memory used

Variance across 100 experiments

ChristophWurst · 2023-03-23T16:18:10Z

Personal account

old

[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 174 are important
[debug] data set split into 280 (i: 121) training and 70 (i: 53) validation sets with 4 dimensions
[debug] classification report: {"recall":0.49056603773584906,"precision":0.96296296296296291,"f1Score":0.65000000000000002}
[debug] classifier validated: recall(important)=0.49056603773585, precision(important)=0.96296296296296 f1(important)=0.65
[debug] classifier 3255 persisted

new

[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 174 are important
[debug] data set split into 280 (i: 121) training and 70 (i: 53) validation sets with 14 dimensions
[debug] classification report: {"recall":0.81132075471698117,"precision":0.87755102040816324,"f1Score":0.84313725490196079}
[debug] classifier validated: recall(important)=0.81132075471698, precision(important)=0.87755102040816 f1(important)=0.84313725490196
[debug] classifier 3256 persisted

Work account

old

[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 187 are important
[debug] data set split into 280 (i: 154) training and 70 (i: 33) validation sets with 4 dimensions
[debug] classification report: {"recall":1,"precision":1,"f1Score":1}
[debug] classifier validated: recall(important)=1, precision(important)=1 f1(important)=1
[debug] classifier 3258 persisted

new

[debug] found 1 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 350 messages of which 187 are important
[debug] data set split into 280 (i: 154) training and 70 (i: 33) validation sets with 14 dimensions
[debug] classification report: {"recall":0.87878787878787878,"precision":1,"f1Score":0.93548387096774188}
[debug] classifier validated: recall(important)=0.87878787878788, precision(important)=1 f1(important)=0.93548387096774
[debug] classifier 3259 persisted

ChristophWurst · 2023-03-28T11:29:37Z

Performance analysis

ChristophWurst · 2023-03-29T07:45:48Z

Current finding is that the input data looks skewed. The dimensional reduction produces tiny numbers. Normalization brings them back into a reasonable range but the values have almost no variance across the feature vectors of all messages.

Word count vectorization works. There are lots of 0s and occasional 1s. It is rare to have a word more than once in a single subject. Dimensional reduction will still make sense. Applying https://docs.rubixml.com/2.0/transformers/tf-idf-transformer.html after the WCV would be worth a try.

Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>

st3iny added enhancement 2. developing labels Mar 21, 2023

st3iny self-assigned this Mar 21, 2023

st3iny force-pushed the enh/noid/classification-based-on-subject-IV branch from 4aba737 to 905ea6d Compare March 24, 2023 13:33

st3iny force-pushed the enh/noid/classification-based-on-subject-IV branch from f8eb1bf to a14e720 Compare May 15, 2023 14:38

st3iny and others added 21 commits May 17, 2023 12:20

Classify emails based on subjects

707f69b

fixup! Classify emails based on subjects

4694e9e

fixup! Classify emails based on subjects

45e3fd5

Cache features per sender

c814be4

Implement preprocess command

a8e6cb1

feat(importance-classifier): Reduce text feature vector

9ab8e77

Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>

fixup! feat(importance-classifier): Reduce text feature vector

07c5013

Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>

fixup! feat(importance-classifier): Reduce text feature vector

30fdfed

Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>

fixup! feat(importance-classifier): Reduce text feature vector

fcc4090

Signed-off-by: Christoph Wurst <christoph@winzerhof-wurst.at>

fixup! feat(importance-classifier): Reduce text feature vector

400ad37

fixup! feat(importance-classifier): Reduce text feature vector

23fccdb

fixup! feat(importance-classifier): Reduce text feature vector

6aa7596

fixup! feat(importance-classifier): Reduce text feature vector

2f35709

fixup! feat(importance-classifier): Reduce text feature vector

ed840d6

fixup! feat(importance-classifier): Reduce text feature vector

cf7cec1

fixup! feat(importance-classifier): Reduce text feature vector

8e6f55f

fixup! feat(importance-classifier): Reduce text feature vector

892754c

fixup! feat(importance-classifier): Reduce text feature vector

880fe2d

fixup! feat(importance-classifier): Reduce text feature vector

315b821

fixup! feat(importance-classifier): Reduce text feature vector

4640793

fixup! feat(importance-classifier): Reduce text feature vector

77a8c26

st3iny added 5 commits May 17, 2023 12:20

fixup! fixup! feat(importance-classifier): Reduce text feature vector

e7847ba

Try wcv -> tfidf pipeline

f55e9a1

Fix transformer persistence

6d50a01

Refactor classifcation of new messages

d91fe43

Refactor peristence

909d31f

st3iny force-pushed the enh/noid/classification-based-on-subject-IV branch from a14e720 to 909d31f Compare May 17, 2023 10:36

ChristophWurst mentioned this pull request Nov 29, 2023

Christoph's advent of code #9105

Closed

24 tasks

This was referenced Oct 15, 2024

Priority inbox doesn't do what it's supposed to do #3968

Open

feat: classify emails by importance based on subjects #10277

Merged

ChristophWurst closed this in #10277 Dec 19, 2024

st3iny deleted the enh/noid/classification-based-on-subject-IV branch December 19, 2024 07:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Classification based on subject and preview text #8257

WIP: Classification based on subject and preview text #8257

st3iny commented Mar 21, 2023

st3iny commented Mar 21, 2023 •

edited

Loading

ChristophWurst commented Mar 23, 2023 •

edited

Loading

ChristophWurst commented Mar 28, 2023

ChristophWurst commented Mar 29, 2023 •

edited

Loading

WIP: Classification based on subject and preview text #8257

WIP: Classification based on subject and preview text #8257

Conversation

st3iny commented Mar 21, 2023

st3iny commented Mar 21, 2023 • edited Loading

Vanilla extractor and vanilla estimator (GaussianNB)

New extractor and new estimator (KNN)

Variance across 100 experiments

ChristophWurst commented Mar 23, 2023 • edited Loading

Personal account

Work account

ChristophWurst commented Mar 28, 2023

Performance analysis

ChristophWurst commented Mar 29, 2023 • edited Loading

st3iny commented Mar 21, 2023 •

edited

Loading

ChristophWurst commented Mar 23, 2023 •

edited

Loading

ChristophWurst commented Mar 29, 2023 •

edited

Loading