How to evaluate results after prediction? #3

jiacheng-ye · 2022-09-03T11:32:27Z

Hi Ohad,
Thanks for your awesome work! I have several questions when using the code:

how to directly perform BM25 retrieval and few-shot inference on the validation set? (26.0 as shown in Table 3)
how to evaluate results given the predictions?

jiacheng-ye · 2022-09-05T03:35:23Z

I've figured out solusions about above questions. With the default parameters in codebase, I got 26.15 with BM25.

However, the EPR performs even worse (22.9) after training the BERT-based retriever. I run EPR with python run.py dataset=break dpr_epochs=120 gpus=1 partition=NLP. I'm not sure where it went wrong :(
Waiting for your help and thanks in advance.

OhadRubin · 2022-09-05T04:35:46Z

Hey, this might be related to the fact that you are using a single gpu, the DPR setup benefits greatly from a large batch size.
The result of 31.9% LFEM from the paper is using 4 GPUs.

jiacheng-ye · 2022-09-06T10:39:34Z

Hi,

Here is the full list of commends:

#!/bin/bash
#SBATCH --job-name=epr_mtop-null_v4
#SBATCH --output=outputs/epr_mtop-null_v4/out.txt
#SBATCH --error=outputs/epr_mtop-null_v4/out.txt
#SBATCH --partition=NLP
#SBATCH --time=12000
#SBATCH --quotatype=reserved
#SBATCH --gres=gpu:2
srun python find_bm25.py output_path=$PWD/data/bm25_mtop-null_a_train.json \
	 dataset_split=train setup_type=a task_name=mtop +ds_size=null L=50 \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun accelerate launch --num_processes 2 --main_process_port 24821 \
	 scorer.py example_file=$PWD/data/bm25_mtop-null_a_train.json \
	 setup_type=qa \
	 output_file=$PWD/data/bm25_mtop-null_a_train_scoredqa.json \
	 batch_size=8     +task_name=mtop +dataset_reader.ds_size=null \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/train_dense_encoder.py train_datasets=[epr_dataset] \
	 train=biencoder_local \
	 output_dir=$PWD/experiments/epr_mtop-null_a_train \
	 datasets.epr_dataset.file=$PWD/data/bm25_mtop-null_a_train_scoredqa.json \
	 datasets.epr_dataset.setup_type=qa  datasets.epr_dataset.hard_neg=true \
	 datasets.epr_dataset.task_name=mtop     datasets.epr_dataset.top_k=5 \
	 +gradient_accumulation_steps=1 train.batch_size=120 \
	 train.num_train_epochs=30 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/generate_dense_embeddings.py \
	 model_file=$PWD/experiments/epr_mtop-null_a_train/dpr_biencoder.29 \
	 ctx_src=dpr_epr shard_id=0 num_shards=1 \
	 out_file=$PWD/experiments/epr_mtop-null_a_train/dpr_enc_index \
	 ctx_sources.dpr_epr.setup_type=qa \
	 ctx_sources.dpr_epr.task_name=mtop +ctx_sources.dpr_epr.ds_size=null \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/dense_retriever.py \
	 model_file=$PWD/experiments/epr_mtop-null_a_train/dpr_biencoder.29 \
	 qa_dataset=qa_epr ctx_datatsets=[dpr_epr] \
	 datasets.qa_epr.dataset_split=validation \
	 encoded_ctx_files=["$PWD/experiments/epr_mtop-null_a_train/dpr_enc_index_*"] \
	 out_file=$PWD/data/validation_epr_mtop-null_a_train_prompts.json \
	 ctx_sources.dpr_epr.setup_type=qa \
	 ctx_sources.dpr_epr.task_name=mtop datasets.qa_epr.task_name=mtop \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun accelerate launch --num_processes 2 --main_process_port 24821 \
	 inference.py \
	 prompt_file=$PWD/data/validation_epr_mtop-null_a_train_prompts.json \
	 task_name=mtop \
	 output_file=$PWD/data/validation_epr_mtop-null_a_train_prede.json \
	 batch_size=10 max_length=1950 \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4

On mtop dataset, number of training data is 95961. The training loss is around 0.07 after 30 epoches, avg loss per batch is 0.071158.

As I'm using A100 80G, I only use two gpus as it is sufficient for 120 batch size.
Finally, I got 25.19 on break and 50.87 on mtop.
Any advice would be helpful 😂

OhadRubin · 2022-09-06T11:21:59Z

I think dpr_epochs=120 is the correct hyperparameter parameter, the contrastive learning objective is able to improve greatly with more compute.
I think the default hp of dpr_epochs=30 was for where I needed to run a large number of experiments.
Recreate our results 120 epochs are necessary. I think..

jiacheng-ye · 2022-09-06T15:54:55Z

I got 49.17 after training 120 epochs on mtop, it's still weird... 😂

OhadRubin · 2022-09-06T16:49:38Z

I will run some tests of my own and try to make sense of this thing.
I'll keep you updated!

jiacheng-ye · 2022-09-16T02:17:34Z

Hi Ohad, do you have any updates? :)

RobertMarton · 2022-09-23T09:06:31Z

Nice work! Anyone know the enviroment requirement file of EPR?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to evaluate results after prediction? #3

How to evaluate results after prediction? #3

jiacheng-ye commented Sep 3, 2022

jiacheng-ye commented Sep 5, 2022

OhadRubin commented Sep 5, 2022 •

edited

Loading

jiacheng-ye commented Sep 6, 2022 •

edited

Loading

OhadRubin commented Sep 6, 2022

jiacheng-ye commented Sep 6, 2022

OhadRubin commented Sep 6, 2022

jiacheng-ye commented Sep 16, 2022

RobertMarton commented Sep 23, 2022

How to evaluate results after prediction? #3

How to evaluate results after prediction? #3

Comments

jiacheng-ye commented Sep 3, 2022

jiacheng-ye commented Sep 5, 2022

OhadRubin commented Sep 5, 2022 • edited Loading

jiacheng-ye commented Sep 6, 2022 • edited Loading

OhadRubin commented Sep 6, 2022

jiacheng-ye commented Sep 6, 2022

OhadRubin commented Sep 6, 2022

jiacheng-ye commented Sep 16, 2022

RobertMarton commented Sep 23, 2022

OhadRubin commented Sep 5, 2022 •

edited

Loading

jiacheng-ye commented Sep 6, 2022 •

edited

Loading