GitHub - zaultavangar/DL-final-project: Probing the dynamics of word embeddings and prompt completions using time-sliced training of language models

How to run our project:

First, to preprocess the headline dataset run preprocess.py. This will store the preprocessed headline data in the /data_preprocessed/ directory.

Next, to run the embedding model for the entire dataset, run embedding_model_full.py. For the data split into years, run embedding_model_year_wise.py. This will store the embeddings in the /vectors/ directory.

Then run the procrustes.py script to apply the orthogonal Procrustes method on all the embedding matrices in order to align them across years.

Use the word_analysis.py script to actually play around with the aligned embedding matrices. This file contains a method for finding nearest neighbors to a word within a given year and plotting them using the t-SNE nonlinear dimensionality reduction algorithm (plot_nearest_neighbors). It also contains a method plot_cos_similarities which plots the cosine similarity of two words of interest over all the years in which they both appear with high frequency.

Finally, to run our language models use the DL_Final_Project_Language_Model notebook included in the repository. We ran the notebook in Google Drive so that we could use GPU, and to do so you need to upload the data directories as well so the notebook can access them in training the models. Note that in the final cell of the notebook you can change the start prompt fed to the models, as well as the final index of the output of the calls to np.argsort() in order to control the number of tokens generated (there are comments marking where to do this).

We hope you enjoy playing around with our models!

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
data		data
DL_Final_Project_Language_Model.ipynb		DL_Final_Project_Language_Model.ipynb
README.md		README.md
embedding_model_full.py		embedding_model_full.py
embedding_model_year_wise.py		embedding_model_year_wise.py
embedding_neighbors.py		embedding_neighbors.py
preprocess.py		preprocess.py
procrustes.py		procrustes.py
unpickle.py		unpickle.py
word_analysis.py		word_analysis.py
word_analyzation.py		word_analyzation.py
words_with_count_19.tsv		words_with_count_19.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

Contributors 3

Languages

zaultavangar/DL-final-project

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages