Gibberish looks like real words, but it really has no meaning at all. Example - hhduaihd
- numpy
- seaborn
pip install -r requirements.txt
- Define accepted characters as [a-z ].
- Create method to Tokenize at character level.
- Create method to generate ngram.
- Create a 27x27 matrix with 10 as initial value.
- This matrix will tell us the probability of getting 2 characters simultaneouly.
- Initially it is set to 10 because if new word occurs which we haven't seen then we don't want it's probability to be zero.
- The heatmap of probabilities will look like this The heatmap is uniform because the probability is same for every pair.
- Take a large corpus and read every line.
- Tokenzie each line and calculate ngram.
- Increase the count in probability matrix with 1 on each occurance of a character pair.
- Now, we need some way to normalize these probabilites. For that I've divided every row by its sum and taken log of it.
- After getting normalized the heatmap of probabilities looks like this.
- How to decide whether the word gibberish or not. For that we can try 2 approaches
- Multiply the probabilites of ngrams in Markov Chain fashion.
- Add the probabilites of ngrams.
Adding probabilies seems good option because if a low probability ngram occurs then it will impact the whole accuracy drastically whereas addition we not cause that much impact.
- For prediction, we will generate ngrams and add the probabilities.
Please refer to CONTRIBUTIONS.md