Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

wikipedia corpus #5

Closed
hljjjmssyh opened this issue Nov 13, 2024 · 12 comments
Closed

wikipedia corpus #5

hljjjmssyh opened this issue Nov 13, 2024 · 12 comments

Comments

@hljjjmssyh
Copy link

Great job!
Could you share your Wikipedia corpus for retrieval?
I'm curious about the data amount and the method for calculating top-n recall metrics.

@CY-SCUT
Copy link

CY-SCUT commented Nov 14, 2024

me too

@weizhepei
Copy link
Owner

Sure! The retrieval corpus (Wikipedia) can be downloaded using this command wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz.

The dataset used in our work is also available here. The recall metrics measures the fraction of samples where the correct answer to the query is mentioned in the top-k retrieved documents.

Please let us know if you have any further questions!

@hljjjmssyh
Copy link
Author

Thanks a lot. Another point of interest is why different datasets use different retrieval methods. In my opinion, BM25 and DPR can represent sparse retrieval and dense retrieval, respectively. What is the purpose of using other retrieval methods, such as Contriever and DTR?

@weizhepei
Copy link
Owner

Thanks for bringing this up! We actually follow previous works to set up the retrieval process in each benchmark. For example, Self-RAG used Contriever for PopQA and TriviaQA, In-Context RALM used DPR for NQ, ALCE used GTR for ASQA, and FLARE used BM25 for 2WikiMultiHopQA. This provides diverse retrieval environments that help validate the flexibility and generalizability of our method.

Though our InstructRAG is agnostic to the choice of retrievers, I think it’s possible to further improve the RAG performance by enhancing the retrieval process with more advanced retrievers, which could help reduce noise in the retrieved documents.

@hljjjmssyh
Copy link
Author

Thanks for your reply. I have another question regarding vanilla SFT. When I try to reproduce the results of vanilla SFT on PopQA dataset, I get an accuracy of 44.3, which is significantly different from what was reported in the paper. Could you clarify if there are any specific settings related to vanilla SFT that I might be missing? Additionally, I noticed that if the answer appears in the LLM's reply, it is considered a positive result. However, it seems that if the LLM generates a longer response, it is more likely to achieve a higher score.

@weizhepei
Copy link
Owner

That’s a bit unusual, and I’d suggest checking if there’s any misalignment in your training or evaluation process. For your reference, the training details for vanilla SFT are provided in Appendix B, and we did not apply any specific tricks during its training. This training script is configured for training InstructRAG-FT but can be straightforwardly adapted for training vanilla SFT. The only caveat is to ensure that your environment aligns with the configurations specified in our repository, as differences in library versions (e.g., transformers, PyTorch) can lead to non-trivial discrepancies. If you still encounter difficulties reproducing vanilla SFT, feel free to reach out, and we’ll be happy to assist!

Yes, your understanding of the evaluation metrics is correct. We actually discussed these limitations in both Section 3.4 and Section 5. While such pattern-matching based metrics are standard for question-answering tasks, they rely solely on lexical similarity and fail to capture semantic meaning. Moreover, these metrics can suffer from the length bias, as longer responses tend to achieve higher accuracy. To address these shortcomings, we recommend validating the model using the LLM-as-a-judge evaluation over the pattern-matching based evaluation, which allows the judge to consider semantic equivalence and is expected to yield a more fair evaluation (see our Table 5).

@hljjjmssyh
Copy link
Author

Thank you for your prompt feedback. I followed the training details for vanilla SFT provided in Appendix B. However, I haven't found the construction process of the LLM's output. There are multiple answers provided for one question in the PopQA dataset. Therefore, I would like to know the output format of the training data.

@kfchenhn
Copy link

I downloaded your model from Huggingface and ran eval.sh unchanged, only achieving an accuracy of 47.1%?
image

@weizhepei
Copy link
Owner

Thank you for your prompt feedback. I followed the training details for vanilla SFT provided in Appendix B. However, I haven't found the construction process of the LLM's output. There are multiple answers provided for one question in the PopQA dataset. Therefore, I would like to know the output format of the training data.

@hljjjmssyh I think you can reuse our data preparation script and simply replace sample['rationale'] with the answer in https://github.com/weizhepei/InstructRAG/blob/main/src/data_utils.py#L143. For samples with multiple answers, you can randomly choose one to format the data.

@weizhepei
Copy link
Owner

I downloaded your model from Huggingface and ran eval.sh unchanged, only achieving an accuracy of 47.1%? image

@kfchenhn I just tested the model hosted on our HF repo, and it works well for me.

image

You can follow setup.sh to configure the environment. Feel free to let us know or open a new issue if you need further assistance!

@kfchenhn
Copy link

I downloaded your model from Huggingface and ran eval.sh unchanged, only achieving an accuracy of 47.1%? image

@kfchenhn I just tested the model hosted on our HF repo, and it works well for me.

image You can follow [setup.sh](https://github.com/weizhepei/InstructRAG/blob/main/setup.sh) to configure the environment. Feel free to let us know or open a new issue if you need further assistance!

The code, model, and environment I used are exactly the same as those in your repo, but I still cannot achieve the same results as you. I suggest you check whether the online and offline content is consistent

@weizhepei
Copy link
Owner

Thanks for the suggestion. I’ve double checked both the online model weights and the offline copies, and they are indeed consistent. I guess this difference might be due to the discrepancy in our hardware configurations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants