Skip to content

Latest commit

 

History

History
48 lines (35 loc) · 2.17 KB

README.md

File metadata and controls

48 lines (35 loc) · 2.17 KB

Language Model


The aim of this project is to compare and analyse the performance of different language models. For this project, two statistical language models are used: Kneser-Ney and Witten-Bell.


Language Models

  • Neural_Model/data/LM_1 contains the perplexities for the Kneser-Ney model.
  • Neural_Model/data/LM_2 contains the perplexities for the Witen-Bell model.
  • Neural_Model/data/LM_2 contains the perplexities for the Neural model.

Code

It is advisable to run the code on a GPU to save time.

  • Neural_Model/2019114005-Assignment_2.py contains the code for the Neural Model in the form of a python file.
  • Neural_Model/2019114005-Assignment_2.ipynb contains the code for the Neural Model in the form of a python notebook.
  • To find the perplexity of a particular sentence, add the sentence into example_sentence in line 304 in the python file and similarly in the notebook.
  • Make sure you train the model before changing the example_sentence.
  • The model can be trained by changing the variable data in line 72 in the python file to the relative path to the corpus and similarly in the notebook.
  • Any hyperparameters can be tuned by changing the respective hyperparameters(all in capitals).

It takes approximately an hour for the model to train on the cleaned Brown Corpus.


Statistical Models

Finding Probability

Commands have to be run from inside the Neural Model directory.

  • To find the probability of a sentence using m method on the n dataset run which will provide a prompt to take the input:
    • python3 language_model.py m path_to_n
    • For example, if you want to check the probability of a sentence based on the Witten Bell model and on the Health DataSet you can run:
      • python3 language_model.py w ./Corpus/Health_English.txt

Finding Perplexities

  • To find the perplexity of a sentence using:
    • Kneyser Ney on Health Corpus uncomment line 424
    • Kneyser Ney on Tech Corpus uncomment line 425
    • Witten Bell on Health Corpus uncomment line 423
    • Witten Bell on Tech Corpus uncomment 422

Report

For a detailed report, please see the attached report. (report.pdf)