This project contains all the code for Sam Mayers' summer internship project, Improving Coverage of Synthetically Generated QA Pairs.
Code is based on the Huggingface library.
- Necessary packages can be found in the requirements.txt file.
- Python version 3.9
Dataset for training are DREAM dataset and the NarrativeQA dataset. This data is already preprocessed.
To reformat the original dataset yourself, use preprocess/format_dataset.py
To generate coverage scores for a formatted dataset, use get_coverage_scores.py
To normalize the coverage scores for a formatted dataset with coverage scores, use normalizecoverage.py
The idea is to train the model to generate QA pairs given a dataset of dialogues or other texts. To train vanilla BART, run:
python train_coverageloss.py -train_path PATH_TO_TRAIN_DATA1 PATH_TO_TRAIN_DATA2 -val_path PATH_TO_VAL_DATA1 PATH_TO_VAL_DATA2 -save_dir PATH_WHERE_TO_SAVE_TRAINED_MODEL -log_path logs/training_log
To add in a coverage loss, run:
python train_coverageloss.py -train_path PATH_TO_TRAIN_DATA1 PATH_TO_TRAIN_DATA2 -val_path PATH_TO_VAL_DATA1 PATH_TO_VAL_DATA2 -save_dir PATH_WHERE_TO_SAVE_TRAINED_MODEL -loss var -log_path logs/training_log
python train_coverageloss.py -train_path PATH_TO_TRAIN_DATA1 PATH_TO_TRAIN_DATA2 -val_path PATH_TO_VAL_DATA1 PATH_TO_VAL_DATA2 -save_dir PATH_WHERE_TO_SAVE_TRAINED_MODEL -loss ent -log_path logs/training_log
Other parameters can be changed/included, such as:
- Batch size ( -bsz )
- Gradient accumulation ( -grad_accum )
- Epochs ( -epochs )
- Stop counter ( -stop_counter ) (the number of epochs to continue training for while the validation loss is not improving)
- Learning rate ( -lr )
- Checkpoint ( -checkpoint ) (if you want to load in model and continue training from that checkpoint)
Once a QAGen model is trained, a synthetic dataset can be generated for a dataset of dialogues or texts.
To generate a synthetic dataset, run:
python generate.py -test_path PATH_TO_TEXTS_OR_DIALOGUES -model_path PATH_TO_TRAINED_MODEL -generate PATH_TO_OUTPUT_FILE
Other parameters can be changed/included for the style of generated outputs, such as:
- Max length ( -max_length )
- Top k ( -top_k )
- Top p ( -top_p )
- Number of generated QA pairs per text/dialogue ( -num_qa ) More information on these parameters can be found here.
3 different types of automatic metrics can be run once a model is trained and a synthetic dataset is generated.
1. Macaw
Macaw is a question answering system and can be used to indicate the answerability of generated QA pairs. From the synthetic dataset, generated questions are paired with the corresponding text/dialogue and given to Macaw. Macaw's predicted answer is then compared to the synthetically generated answer using EM, F1, and Bartscore.
To run:
python macaw_eval.py -data_path PATH_TO_SYNTHETIC_DATA -out_path PATH_FOR_OUTPUT_RESULTS
To measure the coverage of the synthetic dataset's generated QA pairs using variance coverage, run:
python coverage_eval.py -data_path PATH_TO_SYNTHETIC_DATA -ctype var -synthetic True
To measure the coverage of the synthetic dataset's generated QA pairs using entropy coverage, run:
python coverage_eval.py -data_path PATH_TO_SYNTHETIC_DATA -ctype ent -synthetic True
The HTML file for the UI for human annotation for comparing two synthetic datasets can be found at human_eval/human_eval_UI.html A script for formatting a dataset from two synthetic datasets for human evaluation can be found at human_eval/human_eval_examples.ipynb