Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to evaluate results after prediction? #3

Open
jiacheng-ye opened this issue Sep 3, 2022 · 8 comments
Open

How to evaluate results after prediction? #3

jiacheng-ye opened this issue Sep 3, 2022 · 8 comments

Comments

@jiacheng-ye
Copy link

Hi Ohad,
Thanks for your awesome work! I have several questions when using the code:

  1. how to directly perform BM25 retrieval and few-shot inference on the validation set? (26.0 as shown in Table 3)
  2. how to evaluate results given the predictions?
@jiacheng-ye
Copy link
Author

I've figured out solusions about above questions. With the default parameters in codebase, I got 26.15 with BM25.

However, the EPR performs even worse (22.9) after training the BERT-based retriever. I run EPR with python run.py dataset=break dpr_epochs=120 gpus=1 partition=NLP. I'm not sure where it went wrong :(
Waiting for your help and thanks in advance.

@OhadRubin
Copy link
Owner

OhadRubin commented Sep 5, 2022

Hey, this might be related to the fact that you are using a single gpu, the DPR setup benefits greatly from a large batch size.
The result of 31.9% LFEM from the paper is using 4 GPUs.

@jiacheng-ye
Copy link
Author

jiacheng-ye commented Sep 6, 2022

Hi,

Here is the full list of commends:

#!/bin/bash
#SBATCH --job-name=epr_mtop-null_v4
#SBATCH --output=outputs/epr_mtop-null_v4/out.txt
#SBATCH --error=outputs/epr_mtop-null_v4/out.txt
#SBATCH --partition=NLP
#SBATCH --time=12000
#SBATCH --quotatype=reserved
#SBATCH --gres=gpu:2
srun python find_bm25.py output_path=$PWD/data/bm25_mtop-null_a_train.json \
	 dataset_split=train setup_type=a task_name=mtop +ds_size=null L=50 \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun accelerate launch --num_processes 2 --main_process_port 24821 \
	 scorer.py example_file=$PWD/data/bm25_mtop-null_a_train.json \
	 setup_type=qa \
	 output_file=$PWD/data/bm25_mtop-null_a_train_scoredqa.json \
	 batch_size=8     +task_name=mtop +dataset_reader.ds_size=null \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/train_dense_encoder.py train_datasets=[epr_dataset] \
	 train=biencoder_local \
	 output_dir=$PWD/experiments/epr_mtop-null_a_train \
	 datasets.epr_dataset.file=$PWD/data/bm25_mtop-null_a_train_scoredqa.json \
	 datasets.epr_dataset.setup_type=qa  datasets.epr_dataset.hard_neg=true \
	 datasets.epr_dataset.task_name=mtop     datasets.epr_dataset.top_k=5 \
	 +gradient_accumulation_steps=1 train.batch_size=120 \
	 train.num_train_epochs=30 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/generate_dense_embeddings.py \
	 model_file=$PWD/experiments/epr_mtop-null_a_train/dpr_biencoder.29 \
	 ctx_src=dpr_epr shard_id=0 num_shards=1 \
	 out_file=$PWD/experiments/epr_mtop-null_a_train/dpr_enc_index \
	 ctx_sources.dpr_epr.setup_type=qa \
	 ctx_sources.dpr_epr.task_name=mtop +ctx_sources.dpr_epr.ds_size=null \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun python DPR/dense_retriever.py \
	 model_file=$PWD/experiments/epr_mtop-null_a_train/dpr_biencoder.29 \
	 qa_dataset=qa_epr ctx_datatsets=[dpr_epr] \
	 datasets.qa_epr.dataset_split=validation \
	 encoded_ctx_files=["$PWD/experiments/epr_mtop-null_a_train/dpr_enc_index_*"] \
	 out_file=$PWD/data/validation_epr_mtop-null_a_train_prompts.json \
	 ctx_sources.dpr_epr.setup_type=qa \
	 ctx_sources.dpr_epr.task_name=mtop datasets.qa_epr.task_name=mtop \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4
srun accelerate launch --num_processes 2 --main_process_port 24821 \
	 inference.py \
	 prompt_file=$PWD/data/validation_epr_mtop-null_a_train_prompts.json \
	 task_name=mtop \
	 output_file=$PWD/data/validation_epr_mtop-null_a_train_prede.json \
	 batch_size=10 max_length=1950 \
	 hydra.run.dir=$PWD/outputs/epr_mtop-null_v4

On mtop dataset, number of training data is 95961. The training loss is around 0.07 after 30 epoches, avg loss per batch is 0.071158.

As I'm using A100 80G, I only use two gpus as it is sufficient for 120 batch size.
Finally, I got 25.19 on break and 50.87 on mtop.
Any advice would be helpful 😂

@OhadRubin
Copy link
Owner

I think dpr_epochs=120 is the correct hyperparameter parameter, the contrastive learning objective is able to improve greatly with more compute.
I think the default hp of dpr_epochs=30 was for where I needed to run a large number of experiments.
Recreate our results 120 epochs are necessary. I think..

@jiacheng-ye
Copy link
Author

I got 49.17 after training 120 epochs on mtop, it's still weird... 😂

@OhadRubin
Copy link
Owner

I will run some tests of my own and try to make sense of this thing.
I'll keep you updated!

@jiacheng-ye
Copy link
Author

Hi Ohad, do you have any updates? :)

@RobertMarton
Copy link

Nice work! Anyone know the enviroment requirement file of EPR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants