Task Adaptive Pre-Training with N-grams (TAPT-n)

This code is based on the T-DNA repo (https://github.com/shizhediao/T-DNA) with the following changes:

Corrections were made to get the code to run.
The ngrams script referenced in T-DNA was missing. To generate ngrams use the t-dna-ngrams-*.ipynb notebooks in this repo.
- The original paper calls for the use of PMI to decide which ngrams to keep. Since all ngrams in our dataset have a negative PMI, we simply use the ones with the highest frequency.
- The embeddings are generated using FASTTEXT, and saved as a numpy array (in the code this numpy array is refered to as the "model", but it is not the actual fasttext model bin file). To train FASTTEXT use fasttext-train*.ipynb notebooks.
- If you need to generate ngrams, you will need to insall SpaCy - it is not included in the requirements.txt file.
The tokenizer class for xlm-roberta was added to the tokenization.py file
- The source of the vocabulary was also updated to play nice with xlm
- And the run-language-modeling-xlm.py file was also modified accordingly
- All the other classes for xlm inherit from roberta without any changes 🎉
To perform TAPT (task adaptive pre-training) run the train-mlm-xlm.sh script
- Training takes approximately 6 hours (3 epochs) on a p2 or G4 GPU in Sagemaker
- The model output from this phase will need to be fine-tuned on a downstream task such as sentence similarity or classification
Finally, you will need to download the relevant model files from huggingface and change the paths in the scripts accordingly.

T-DNA Citation:

@inproceedings{DXSJSZ2021,
    title = "Taming Pre-trained Language Models with N-gram Representations for Low-Resource Domain Adaptation",
    author = "Diao, Shizhe  and
      Xu, Ruijia  and
      Su, Hongjin  and
      Jiang, Yilei  and
      Song, Yan  and
      Zhang, Tong",
    booktitle = "Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)",
    month = aug,
    year = "2021",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2021.acl-long.259",
    doi = "10.18653/v1/2021.acl-long.259",
    pages = "3336--3349",
}

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
TDNA		TDNA
imges		imges
output/ngrams		output/ngrams
.DS_Store		.DS_Store
.gitignore		.gitignore
Cosine distances.ipynb		Cosine distances.ipynb
FLC.ipynb		FLC.ipynb
README.md		README.md
SLC-PTC.py		SLC-PTC.py
SLC.ipynb		SLC.ipynb
SLC.py		SLC.py
eval.py		eval.py
fasttext-train-multilingual.ipynb		fasttext-train-multilingual.ipynb
fasttext-train.ipynb		fasttext-train.ipynb
get-vocab.sh		get-vocab.sh
get_ngrams.py		get_ngrams.py
get_ngrams_embeddings.py		get_ngrams_embeddings.py
interpretable-prop-results.png		interpretable-prop-results.png
log_file		log_file
mlm-requirements.txt		mlm-requirements.txt
remove_punct.py		remove_punct.py
requirements-orig.txt		requirements-orig.txt
requirements.txt		requirements.txt
run_classification.py		run_classification.py
run_language_modeling.py		run_language_modeling.py
run_language_modeling.py.old		run_language_modeling.py.old
sentencepiece_extractor.py		sentencepiece_extractor.py
t-dna-ngrams-Multilingual.ipynb		t-dna-ngrams-Multilingual.ipynb
train-classification.sh		train-classification.sh
train-mlm.sh		train-mlm.sh
train_fasttext.py		train_fasttext.py
training-end-to-end.ipynb		training-end-to-end.ipynb
vocab.json		vocab.json
vocabulary_utils.py		vocabulary_utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Task Adaptive Pre-Training with N-grams (TAPT-n)

T-DNA Citation:

About

Releases

Packages

Languages

kyleiwaniec/TAPT-n

Folders and files

Latest commit

History

Repository files navigation

Task Adaptive Pre-Training with N-grams (TAPT-n)

T-DNA Citation:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages