Research code. No longer maintained (as of 2022)
The research paper associated with this repository can be found here.
The training input consists of transcribed child-directed speech from the CHILDES database. The text used to train the RNN is loaded, in memory, using the custom-built Python package AOCHILDES. After the minimally processed raw text is loaded, ordered by the age of the target child, a ByteLevel BPE tokenizer (introduced with GPT-2) is used to build the training vocabulary and for tokenization.
Note: The age-order effect occurs independent of tokenization choice,
as it has been found to occur with whitespace tokenization, spacy
tokenization, and ByteLevel BPE tokenization.
The order in which the training data is presented to an RNN (either SRN or LSTM) facilitates learning semantic category distinctions between common nouns.
First, create a new virtual environment for Python 3.7. Then:
pip install git+https://github.com/phueb/EntropicStartTheory
To install the dependencies, execute the following in your virtual environment:
pip install -r requirements.txt
This will also install the following custom dependencies.
Used to retrieve child-directed speech text data. It is available here.
The text files are prepared for training using a custom Python package Preppy
.
It is available here.
It performs no reordering of the input, and assumes instead that the lines in the text file are already in the order that they should be presented to the model.
Evaluation of semantic category knowledge requires the custom Python package CategoryEval
.
It is available here.
It computes how well the model's learned representations recapitulate some human-created gold category structure.
By default, it returns the balanced accuracy, but F1 and Cohen's Kappa can be computed also.
The code is designed to run on multiple machines,
at the UIUC Learning & Language Lab using a custom job submission system called Ludwig.
If you are a member of the UIUC Learning & Language lab, you can use ludwig
to run jobs in parallel on multiple machines.
This is recommended if multiple replications need to be run, or if no access to GPUs is otherwise available.
If you are a member of the lab, you should first obtain access to the lab's file server.
Next, find the locations to the source code folders for the custom dependencies.
Assuming new packages are installed into a virtual environment at, venv/lib/pyton3.7/site-packages/
,
you can submit jobs with ludwig
:
ludwig -e data/ \
venv/lib/pyton3.7/site-packages/AOCHILDES/aochildes \
venv/lib/pyton3.7/site-packages/AOCHILDES/original_transcripts/ \
venv/lib/pyton3.7/site-packages/Preppy/preppy \
venv/lib/pyton3.7/site-packages/CategoryEval/categoryeval
Alternatively, the experiment can be run without access to the lab's file server:
ludwig --isolated
The core logic to train and evalute a single model is contained in entropicstarttheory.jobs.main()
.
Notice, however, that this function takes as input a dict that specifies a single set of hyper parameters called param2val
.
You can:
- create this dict yourself (see
entropicstarttheory.params.py
), or - use my command line tool
ludwig
to automatically pass this dict toentropicstarttheory.jobs.main()
based on what is inentropicstarttheory.params.py
For training multiple models or performing hyper parameter tuning, I recommend the latter option.
It allows the researcher to specify all possible combinations of hyper parameters once in entropicstarttheory.params.py
,
and ludwig
does the hard work of 1) creating one param2val
for each combination, 2) creating jobs, and 3) passing a unique param2val
to each job.
When using ludwig
to train and evaluate 1 replication of each hyperparmater combination, type into your terminal:
ludwig --isolated
The flag --isolated
simply means the program will not try to communicate with the lab's file server, and will instead read and write from and to your local machine only.
When done, look for a folder called runs
in the root directory of your project. All files created during the job should be stored there.
To run 10 replications of each hyper parameter combination, type into your terminal:
ludwig --isolated -r 10
To plot a summary of the results:
python3 plot/plot_ba_summary.py
Note: Results are plotted for all models matching any hyper parameter combination specified in entropicstarttheory.params.py
.
Initial work began in 2016, at the University of California, Riverside, under the supervision of Prof Jon Willits.
In an effort to simplify the code used in this repository, a major rewrite was undertaken in October 2019. The code was ported from tensorflow 1.12 to pytorch. The following changes resulted due to the porting:
- the custom RNN architecture was replaced by a more standard architecture. Specifically, prior to October 2019, embeddings were directly added to the hidden layer. In the standard RNN architecture, embeddings undergo an additional transformation step before being added to the hidden layer. However, the key finding, that age-ordered training improves semantic category learning, replicated in this new architecture.
spacy
tokenization was replaced by Byte-Level BPE tokenization, to removeUNK
symbols from training. This reduced overall category learning, but provides a more realistic account of distributional learning. Importantly, the age-order effect still occurs with the new tokenization.
Developed on Ubuntu 18.04 with Python 3.7