wikipedia corpus #5

hljjjmssyh · 2024-11-13T06:48:04Z

Great job!
Could you share your Wikipedia corpus for retrieval?
I'm curious about the data amount and the method for calculating top-n recall metrics.

CY-SCUT · 2024-11-14T08:24:04Z

me too

weizhepei · 2024-11-15T03:16:30Z

Sure! The retrieval corpus (Wikipedia) can be downloaded using this command wget https://dl.fbaipublicfiles.com/dpr/wikipedia_split/psgs_w100.tsv.gz.

The dataset used in our work is also available here. The recall metrics measures the fraction of samples where the correct answer to the query is mentioned in the top-k retrieved documents.

Please let us know if you have any further questions!

hljjjmssyh · 2024-11-15T07:56:20Z

Thanks a lot. Another point of interest is why different datasets use different retrieval methods. In my opinion, BM25 and DPR can represent sparse retrieval and dense retrieval, respectively. What is the purpose of using other retrieval methods, such as Contriever and DTR?

weizhepei · 2024-11-18T15:36:48Z

Thanks for bringing this up! We actually follow previous works to set up the retrieval process in each benchmark. For example, Self-RAG used Contriever for PopQA and TriviaQA, In-Context RALM used DPR for NQ, ALCE used GTR for ASQA, and FLARE used BM25 for 2WikiMultiHopQA. This provides diverse retrieval environments that help validate the flexibility and generalizability of our method.

Though our InstructRAG is agnostic to the choice of retrievers, I think it’s possible to further improve the RAG performance by enhancing the retrieval process with more advanced retrievers, which could help reduce noise in the retrieved documents.

hljjjmssyh · 2024-11-19T03:15:09Z

Thanks for your reply. I have another question regarding vanilla SFT. When I try to reproduce the results of vanilla SFT on PopQA dataset, I get an accuracy of 44.3, which is significantly different from what was reported in the paper. Could you clarify if there are any specific settings related to vanilla SFT that I might be missing? Additionally, I noticed that if the answer appears in the LLM's reply, it is considered a positive result. However, it seems that if the LLM generates a longer response, it is more likely to achieve a higher score.

weizhepei · 2024-11-19T07:02:26Z

That’s a bit unusual, and I’d suggest checking if there’s any misalignment in your training or evaluation process. For your reference, the training details for vanilla SFT are provided in Appendix B, and we did not apply any specific tricks during its training. This training script is configured for training InstructRAG-FT but can be straightforwardly adapted for training vanilla SFT. The only caveat is to ensure that your environment aligns with the configurations specified in our repository, as differences in library versions (e.g., transformers, PyTorch) can lead to non-trivial discrepancies. If you still encounter difficulties reproducing vanilla SFT, feel free to reach out, and we’ll be happy to assist!

Yes, your understanding of the evaluation metrics is correct. We actually discussed these limitations in both Section 3.4 and Section 5. While such pattern-matching based metrics are standard for question-answering tasks, they rely solely on lexical similarity and fail to capture semantic meaning. Moreover, these metrics can suffer from the length bias, as longer responses tend to achieve higher accuracy. To address these shortcomings, we recommend validating the model using the LLM-as-a-judge evaluation over the pattern-matching based evaluation, which allows the judge to consider semantic equivalence and is expected to yield a more fair evaluation (see our Table 5).

hljjjmssyh · 2024-11-19T11:07:23Z

Thank you for your prompt feedback. I followed the training details for vanilla SFT provided in Appendix B. However, I haven't found the construction process of the LLM's output. There are multiple answers provided for one question in the PopQA dataset. Therefore, I would like to know the output format of the training data.

kfchenhn · 2024-11-19T11:11:28Z

I downloaded your model from Huggingface and ran eval.sh unchanged, only achieving an accuracy of 47.1%?

weizhepei · 2024-11-19T19:25:10Z

Thank you for your prompt feedback. I followed the training details for vanilla SFT provided in Appendix B. However, I haven't found the construction process of the LLM's output. There are multiple answers provided for one question in the PopQA dataset. Therefore, I would like to know the output format of the training data.

@hljjjmssyh I think you can reuse our data preparation script and simply replace sample['rationale'] with the answer in https://github.com/weizhepei/InstructRAG/blob/main/src/data_utils.py#L143. For samples with multiple answers, you can randomly choose one to format the data.

weizhepei · 2024-11-19T19:50:10Z

I downloaded your model from Huggingface and ran eval.sh unchanged, only achieving an accuracy of 47.1%?

@kfchenhn I just tested the model hosted on our HF repo, and it works well for me.

You can follow setup.sh to configure the environment. Feel free to let us know or open a new issue if you need further assistance!

kfchenhn · 2024-11-20T04:20:47Z

I downloaded your model from Huggingface and ran eval.sh unchanged, only achieving an accuracy of 47.1%?

@kfchenhn I just tested the model hosted on our HF repo, and it works well for me.
You can follow [setup.sh](https://github.com/weizhepei/InstructRAG/blob/main/setup.sh) to configure the environment. Feel free to let us know or open a new issue if you need further assistance!

The code, model, and environment I used are exactly the same as those in your repo, but I still cannot achieve the same results as you. I suggest you check whether the online and offline content is consistent

weizhepei · 2024-12-22T01:21:41Z

Thanks for the suggestion. I’ve double checked both the online model weights and the offline copies, and they are indeed consistent. I guess this difference might be due to the discrepancy in our hardware configurations?

weizhepei closed this as completed Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wikipedia corpus #5

wikipedia corpus #5

hljjjmssyh commented Nov 13, 2024

CY-SCUT commented Nov 14, 2024

weizhepei commented Nov 15, 2024

hljjjmssyh commented Nov 15, 2024

weizhepei commented Nov 18, 2024

hljjjmssyh commented Nov 19, 2024

weizhepei commented Nov 19, 2024

hljjjmssyh commented Nov 19, 2024

kfchenhn commented Nov 19, 2024

weizhepei commented Nov 19, 2024

weizhepei commented Nov 19, 2024

kfchenhn commented Nov 20, 2024

weizhepei commented Dec 22, 2024

wikipedia corpus #5

wikipedia corpus #5

Comments

hljjjmssyh commented Nov 13, 2024

CY-SCUT commented Nov 14, 2024

weizhepei commented Nov 15, 2024

hljjjmssyh commented Nov 15, 2024

weizhepei commented Nov 18, 2024

hljjjmssyh commented Nov 19, 2024

weizhepei commented Nov 19, 2024

hljjjmssyh commented Nov 19, 2024

kfchenhn commented Nov 19, 2024

weizhepei commented Nov 19, 2024

weizhepei commented Nov 19, 2024

kfchenhn commented Nov 20, 2024

weizhepei commented Dec 22, 2024