Start with UE-rewriting data with BERT Masked Language Model. Only need data.json.zip
OR data.json
to be present.
python preprocess.py
Rewrite:
python rewriter.py --data_dir part_cleaned_data
To generate hypotheses on rewritten inputs using various benchamark models:
python generate_batch_wise.py --data_dir unseen_from_bert-base-uncased_predicted_by_bert-base-uncased_rewritten_data.txt --model_name blenderbot_small-90M
python generate_batch_wise.py --data_dir unseen_from_bert-base-uncased_predicted_by_bert-base-uncased_rewritten_data.txt --model_name blenderbot-400M-distill
python generate_batch_wise.py --data_dir unseen_from_bert-base-uncased_predicted_by_bert-base-uncased_rewritten_data.txt --model_name blenderbot-1B-distill --eval_batch_size 32
python generate_batch_wise.py --data_dir unseen_from_bert-base-uncased_predicted_by_bert-base-uncased_rewritten_data.txt --rewritten_ids_dir rewritten_ids.pt --model_name DialoGPT-small
python generate_batch_wise.py --data_dir unseen_from_bert-base-uncased_predicted_by_bert-base-uncased_rewritten_data.txt --rewritten_ids_dir rewritten_ids.pt --model_name DialoGPT-medium
python generate_batch_wise.py --data_dir unseen_from_bert-base-uncased_predicted_by_bert-base-uncased_rewritten_data.txt --rewritten_ids_dir rewritten_ids.pt --model_name DialoGPT-large
To generate hypotheses on original inputs using various benchamark models:
python generate_batch_wise.py --data_dir part_cleaned_data.txt --rewritten_ids_dir rewritten_ids.pt --model_name blenderbot_small-90M
python generate_batch_wise.py --data_dir part_cleaned_data.txt --rewritten_ids_dir rewritten_ids.pt --model_name blenderbot-400M-distill
python generate_batch_wise.py --data_dir part_cleaned_data.txt --rewritten_ids_dir rewritten_ids.pt --model_name blenderbot-1B-distill --eval_batch_size 32
python generate_batch_wise.py --data_dir part_cleaned_data.txt --rewritten_ids_dir rewritten_ids.pt --model_name DialoGPT-small
python generate_batch_wise.py --data_dir part_cleaned_data.txt --rewritten_ids_dir rewritten_ids.pt --model_name DialoGPT-medium
python generate_batch_wise.py --data_dir part_cleaned_data.txt --rewritten_ids_dir rewritten_ids.pt --model_name DialoGPT-large
Code execution 5 times 10% of data:
python generate_batch_wise.py --model_name "DialoGPT-small" --debug True
Modify the following to generate an output txt for a given input txt.
python generate_batch_wise.py --model_name "blenderbot_small-90M" --data_dir 'all_data.txt'
evaluate original data via bleu:
python eval.py --debug True
evaluate rewrited data via bleu:
python eval.py --hyp_dir 'blenderbot_small-90M_generate_rewrited.txt' --ref_dir 包含##的rewrited_data --debug True
New eval (bleuT to change the file name back to bleu when using) ··· python metric_evaluate.py -metric [metric_name] -hyp [output_file] -ref [ground truth] ··· metric_name:chrF, rouge, meteor ···
Original:
python train.py
Rewritten:
python train.py --data_dir_txt ../data/all_data_punc_rewritten.txt --eod_token '# #'
Original:
python generate_batch_wise.py --data_dir all_data_punc.txt --model_ckpt pytorch_model.bin
Rewritten (using another checkpoint pytorch_model.bin
):
python generate_batch_wise.py --data_dir all_data_punc_rewritten.txt --model_ckpt pytorch_model.bin