-
Notifications
You must be signed in to change notification settings - Fork 0
/
README.txt
33 lines (25 loc) · 1.13 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ FILES / DIRECTORIES ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
- data directory
This contains raw data, for now just test.csv and train.csv (from Kaggle).
- corpora directory
This contains text files used for the LanguageModel class. They consist of
individual sentences, one per line, in plaintext. There are four,
{clean, insult}_corpus_{train, test}.txt
- LanguageModel.py
Contains a shell class used to easily interact with generating new NLTK
LanguageModels. Just hides a lot of the complexity of the NLTK library, and
will probably be extended when we stop using simple Unigrams later.
- naiveBayesBaseline.py
Contains the baseline version of the naiveBayes algorithm.
There are several optimizations on top of the baseline implementation
that can be switched off with flags. See the RESULTS.txt file for details
on this.
- RESULTS.txt
Contains the printouts from individual runs. In here, I noted the different
things I tried, how they work, and how to modify naiveBayesBaseline.py to
recreate the results.
Ideas for Improvements:
- Kneser-ney
- Spell check
- DONE Laplace Smoothing
- DONE Remove stopwords