PredText

424 Final project Collaborators: Cao Qinxiang (CQ), Samuel Gichohi (SG), Nicole Loncke (NL)

Sunday April 27

Accomplished:

established outline of EM model
basic formatting of emails

Tasks:

get database working with JSON dumps (SG + NL) - return all words with given prefix - return bag of words in each email (w/ frequency count) - return bag of all correspondents - return all messages between (person1, person2) - {word: frequency} dictionary for Google dataset
begin coding EM for training our model (CQ)

NEXT MEETING: 4/30, 7PM Sherrerd Hall!

Sunday April 30

Interface required by Qinxiang:

getMsg(p1, p2): p1, p2 are string which represents persons' name. return value should be a list of lists of strings, in which every list of strings represent one email and every string represent a word. All words should be in lower cases.

getSender(): return value is a list of strings. Every string represents a person's name.

getReceiver(sender): return value is a list of strings. Every string represents a person's name to whom sender has sent emails.

getWordList(): return value is a list of strings. Every string represents a word appear in Enron data. All words should be in lower cases.

getGoogleData(word_chain): return value is a integers. in which if word_chain is"A,B,C" and output is 10, it means, "A B C" appears for 10 times in google data. Input is ensure to be appeared in Enron data.

Friday May 9

Accomplished:

Established testing module (SG)
Tested database functions (SG)
Tested EM Algorithm (CQ)

Tasks:

write script that puts emails from one sender into the database (NL)
- inputs: name of one of the enron correspondnets
- walks the mail directory and gets all their sent items
- parses the message body into the tuple format
- puts the comma-separated body string into database using EmailAgent.insert_email()
finish parsing email message body (NL)
- handle forwarded messages!
- remove punctuation and whitespace but don't ignore any words
Write unit tests (ALL!)

NEXT MEETING: 5/10, SPELMAN!

Saturday May 10

Accomplished:

Added tests for utils.parse_message().
Edited utils.walkdir() to filter by keyword.
Wrote insert_by_sender() in main.py.
Tested EM Algorithm on some samples!

Tasks:

Test main.insert_by_sender().

NEXT MEETING: 5/11, SPELMAN!

Sunday May 11

Accomplished:

Cleaned up enron email directory and pushed some sample emails to repo.
Computed the probabilities of some emails and got realistic results.
Issue: It may take a long time to get data from Google ngrams API due to its large size and that we only care about ngrams that appear in the Enron dataset. Solution: use BigQuery in order to make requests.
Issue: Online text prediction---can we suggest words as the user is typing? Solution: sure.

Miscellaneous

Here are just some musings worth documenting in preparation for the writeup. Feel free to add your thoughts.

Improvements/Future Work:

Better handling of punctuation. Exclamations, questions, periods can be very expressive and telling of relationship dynamics.
Spelling correction? Sometimes our algorithm will get stumped when it finds an unknown prefix. If we had more time we could compute word distance in order to try to match the user input with words that we know from the Google dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 108 Commits
emails		emails
.gitignore		.gitignore
README.md		README.md
biggoogle.py		biggoogle.py
em.py		em.py
getngram.py		getngram.py
google0.py		google0.py
google1.py		google1.py
google2.py		google2.py
google3.py		google3.py
google4.py		google4.py
google5.py		google5.py
google6.py		google6.py
google7.py		google7.py
google8.py		google8.py
google9.py		google9.py
googleA.py		googleA.py
google_solberg.py		google_solberg.py
log.txt		log.txt
main.py		main.py
plot.txt		plot.txt
selberg_without_google		selberg_without_google
tests.py		tests.py
utils.py		utils.py
words.py		words.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PredText

Sunday April 27

Sunday April 30

Friday May 9

Saturday May 10

Sunday May 11

Miscellaneous

About

Releases

Packages

Contributors 2

Languages

sgichohi/PredText

Folders and files

Latest commit

History

Repository files navigation

PredText

Sunday April 27

Sunday April 30

Friday May 9

Saturday May 10

Sunday May 11

Miscellaneous

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages