Skip to content

Commit

Permalink
4 hard negative trained retriever
Browse files Browse the repository at this point in the history
  • Loading branch information
t1101675 committed Aug 13, 2023
1 parent 7020e63 commit 7c84016
Show file tree
Hide file tree
Showing 8 changed files with 8 additions and 8 deletions.
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ which will merge the two pairs into `train_lm_0.bin` and `train_lm_0.idx`.
```bash
python3 tools/process_retriever_train_data.py --save retriever_data --data-names TRAIN
```
+ Train the retriever. The `train.jsonl` and `valid.jsonl` data should be put in `retriever_data/TRAIN/p1_en1_hn1_s42/merge`. The trained retriever can be downloaded from this [link](https://huggingface.co/t1101675/PICL/tree/main/results/retriever).
+ Train the retriever. The `train.jsonl` and `valid.jsonl` data should be put in `retriever_data/TRAIN/p1_en1_hn4_s42/merge`. The trained retriever can be downloaded from this [link](https://huggingface.co/t1101675/PICL/tree/main/results/retriever).
```bash
bash scripts/retriever/train.sh ${BASE_PATH}
```
Expand Down
2 changes: 1 addition & 1 deletion scripts/filter/filter.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ CKPT_NAME="gpt2-large"
CKPT="${BASE_PATH}/results/${CKPT_NAME}/"
# data
RAW_DATA="100K_128"
SEARCH_DATA="${RAW_DATA}/TRAIN_p1_en1_hn1_s42_lr5e-05-bs64-G1_4375.pt/L2"
SEARCH_DATA="${RAW_DATA}/TRAIN_p1_en1_hn4_s42_lr5e-05-bs64-G1_4212.pt/L2"
DATA_DIR="${BASE_PATH}/pretrain_data/${RAW_DATA}/gpt2"
IDX_DATA_DIR="${BASE_PATH}/pretrain_data/retrieval_results/${SEARCH_DATA}"
# hp
Expand Down
2 changes: 1 addition & 1 deletion scripts/pretrain/pretrain_picl_gpt2_large.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ CKPT_NAME="gpt2-large"
CKPT="${BASE_PATH}/results/${CKPT_NAME}/"
# data
CORPUS="100K_128"
DATA_PREFIX="picl/${CORPUS}_TRAIN_p1_en1_hn1_s42_lr5e-05-bs64-G1_4375.pt_L2_filtered_0.0"
DATA_PREFIX="picl/${CORPUS}_TRAIN_p1_en1_hn4_s42_lr5e-05-bs64-G1_4212.pt_L2_filtered_0.0"
LM_DATA_PREFIX="full_doc/"
DATA_DIR="${BASE_PATH}/pretrain_data/${CORPUS}/gpt2"
IDX_DATA_DIR="${BASE_PATH}/pretrain_data/${DATA_PREFIX}"
Expand Down
2 changes: 1 addition & 1 deletion scripts/pretrain/pretrain_picl_gpt2_xlarge.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ CKPT_NAME="gpt2-xlarge"
CKPT="${BASE_PATH}/results/${CKPT_NAME}/"
# data
CORPUS="100K_128"
DATA_PREFIX="picl/${CORPUS}_TRAIN_p1_en1_hn1_s42_lr5e-05-bs64-G1_4375.pt_L2_filtered_0.0"
DATA_PREFIX="picl/${CORPUS}_TRAIN_p1_en1_hn4_s42_lr5e-05-bs64-G1_4212.pt_L2_filtered_0.0"
LM_DATA_PREFIX="full_doc/"
DATA_DIR="${BASE_PATH}/pretrain_data/${CORPUS}/gpt2"
IDX_DATA_DIR="${BASE_PATH}/pretrain_data/${DATA_PREFIX}"
Expand Down
2 changes: 1 addition & 1 deletion scripts/pretrain/pretrain_picl_gpt_neo.sh
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ CKPT_NAME="gpt-neo"
CKPT="${BASE_PATH}/checkpoints/${CKPT_NAME}/"
# data
CORPUS="100K_128"
DATA_PREFIX="picl/${CORPUS}_TRAIN_p1_en1_hn1_s42_lr5e-05-bs64-G1_4375.pt_L2_filtered_0.0"
DATA_PREFIX="picl/${CORPUS}_TRAIN_p1_en1_hn4_s42_lr5e-05-bs64-G1_4375.pt_L2_filtered_0.0"
LM_DATA_PREFIX="full_doc/"
DATA_DIR="${BASE_PATH}/pretrain_data/${CORPUS}/gpt-j"
IDX_DATA_DIR="${BASE_PATH}/pretrain_data/${DATA_PREFIX}"
Expand Down
2 changes: 1 addition & 1 deletion scripts/retriever/infer.sh
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ BATCH_SIZE=128
SEED=10
MAX_LEN=256
# runtime
CKPT=${4-"TRAIN_p1_en1_hn1_s42/lr5e-05-bs64-G1/4375.pt"}
CKPT=${4-"TRAIN_p1_en1_hn4_s42/lr5e-05-bs64-G1/4212.pt"}
LOAD_PATH="${WORKING_DIR}/results/retriever/${CKPT}"
SAVE_PATH="${WORKING_DIR}/pretrain_data/retrieval_results/"

Expand Down
2 changes: 1 addition & 1 deletion scripts/retriever/search.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ WORKING_DIR=${1}

MODEL_DIR="${WORKING_DIR}/checkpoints/roberta-base/"
METRIC_TYPE="L2"
DATA_NAME=${2-"100K_128/TRAIN_p1_en1_hn1_s42_lr5e-05-bs64-G1_4375.pt"}
DATA_NAME=${2-"100K_128/TRAIN_p1_en1_hn4_s42_lr5e-05-bs64-G1_4212.pt"}
EMBED_DIR="${WORKING_DIR}/pretrain_data/retrieval_results/${DATA_NAME}"
SAVE_DIR="${WORKING_DIR}/pretrain_data/retrieval_results/"
MAX_NUM=-1
Expand Down
2 changes: 1 addition & 1 deletion scripts/tools/split_picl_train_valid.sh
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
BASE_PATH=${1}
DATA_PREFIX=${2-"100K_128_TRAIN_p1_en1_hn1_s42_lr5e-05-bs64-G1_4375.pt_L2/filtered_0.0"}
DATA_PREFIX=${2-"100K_128_TRAIN_p1_en1_hn4_s42_lr5e-05-bs64-G1_4212.pt_L2/filtered_0.0"}
DATA_NAME="filtered"

MAX_LENGTH=1024
Expand Down

0 comments on commit 7c84016

Please sign in to comment.