Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: classify emails by importance based on subjects #10277

Merged
merged 1 commit into from
Dec 19, 2024

Conversation

st3iny
Copy link
Member

@st3iny st3iny commented Oct 17, 2024

Partly addresses #3968

Closes #8257

Part 5: The great rebasing ...

Improvements

  • Classify emails by importance using a new and improved classification pipeline
  • Persist classifier models (and transformer pipelines) in memory cache only
  • Drop the oc_mail_classifiers table

How to tests?

  1. Have some important mails in your inbox.
  2. Make sure to have the classification enabled: Account settings modal -> "Data collection consent" ✅ and "Mark as important" ✅
  3. Train a model: occ mail:account:train -vvv <account-id> (extract account id with occ mail:account:export <user-id>)
  4. Receive some mails and observe that some are marked as important and some aren't.

@st3iny

This comment was marked as resolved.

@st3iny st3iny force-pushed the enh/noid/classification-based-on-subject-V branch from ebbed72 to 21d45eb Compare October 22, 2024 11:53
@st3iny st3iny added 3. to review feature:priority inbox Features and bugs related to the "priority inbox" feature and removed 2. developing labels Oct 22, 2024
@st3iny st3iny marked this pull request as ready for review October 22, 2024 17:47
Copy link
Member

@ChristophWurst ChristophWurst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First impression: looks good 😎

lib/Command/PreprocessAccount.php Outdated Show resolved Hide resolved
lib/Db/StatisticsDao.php Outdated Show resolved Hide resolved
lib/Db/StatisticsDao.php Outdated Show resolved Hide resolved
@ChristophWurst
Copy link
Member

ChristophWurst commented Nov 13, 2024

I'll observe my account for a while. So far the stats are (suspiciously) excellent:

[debug] found 36 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 300 messages of which 42 are important
WCF vocab size: 500
[debug] data set split into 240 (i: 33) training and 60 (i: 9) validation sets with 504 dimensions
[debug] classification report: {"recall":1,"precision":1,"f1Score":1}
[debug] classifier validated: recall(important)=1, precision(important)=1 f1(important)=1

@st3iny st3iny force-pushed the enh/noid/classification-based-on-subject-V branch from a7ea9c0 to f4fc5bc Compare November 15, 2024 06:39
@ChristophWurst
Copy link
Member

image

I'm still not sure how I can judge the quality of the new classification. Is it overfitting?

@kesselb
Copy link
Contributor

kesselb commented Nov 15, 2024

[debug] found 7 incoming mailbox(es)
[debug] found 1 outgoing mailbox(es)
[debug] found 300 messages of which 241 are important
[debug] data set split into 240 (i: 195) training and 60 (i: 46) validation sets with 504 dimensions
[debug] classification report: {"recall":1,"precision":0.8518518518518519,"f1Score":0.92}
[debug] classifier validated: recall(important)=1, precision(important)=0.85185185185185 f1(important)=0.92
[debug] Classifier for account 1 persisted

I will keep an eye on the classification ;)

However, I haven't cleaned my inbox in a while and fear the countless undeleted "update xyz announcements" or "calendar notifications" might influence the model 🙈

@kesselb
Copy link
Contributor

kesselb commented Nov 15, 2024

  1. Would adding an option to exclude or include specific mailboxes from the training improve the results? For instance, my Archive folder holds about 160 important messages, while my "Test Data" folder has around 5,000 emails that I keep for testing purposes. These test emails aren’t relevant at all, but I move them there instead of deleting them to maintain a large mailbox for tests.

  2. While reviewing the code, I noticed that starring a message in Thunderbird sets the flag_flagged attribute. Should we consider this attribute when identifying important messages?

$importantMessages = array_filter($messages, static function (Message $message) {
return ($message->getFlagImportant() === true);
});

@st3iny st3iny force-pushed the enh/noid/classification-based-on-subject-V branch from f4fc5bc to 9371b00 Compare December 18, 2024 18:01
Signed-off-by: Richard Steinmetz <richard@steinmetz.cloud>
@st3iny st3iny force-pushed the enh/noid/classification-based-on-subject-V branch from 9371b00 to 1907ebc Compare December 18, 2024 18:12
@st3iny
Copy link
Member Author

st3iny commented Dec 18, 2024

  1. Would adding an option to exclude or include specific mailboxes from the training improve the results? For instance, my Archive folder holds about 160 important messages, while my "Test Data" folder has around 5,000 emails that I keep for testing purposes. These test emails aren’t relevant at all, but I move them there instead of deleting them to maintain a large mailbox for tests.

This might be a good follow-up. Currently, all incoming mailboxes, that have no special use (trash etc.) are considered. I think the reasoning goes that it should also work for users who archive a lot of emails because their inboxes would be (almost) empty most of the time.

  1. While reviewing the code, I noticed that starring a message in Thunderbird sets the flag_flagged attribute. Should we consider this attribute when identifying important messages?

Currently, only the pseudo-standardized important tag is considered. In Thunderbird it is a bit akward to set it. Right click an email -> Tag -> Click on the first one (by default: red, "Important").

The starred flag is actually a standardized IMAP keyword (\\Flagged) so the logic is a bit different. We only consider the important tag for now.

@ChristophWurst ChristophWurst merged commit 6008afb into main Dec 19, 2024
33 of 35 checks passed
@ChristophWurst ChristophWurst deleted the enh/noid/classification-based-on-subject-V branch December 19, 2024 07:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3. to review enhancement feature:priority inbox Features and bugs related to the "priority inbox" feature
Projects
Status: ☑️ Done
Development

Successfully merging this pull request may close these issues.

3 participants