Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusion about the blue4 of NarrativeQA #50

Open
yyTraveler opened this issue Jul 21, 2024 · 10 comments
Open

Confusion about the blue4 of NarrativeQA #50

yyTraveler opened this issue Jul 21, 2024 · 10 comments

Comments

@yyTraveler
Copy link

No description provided.

@yyTraveler yyTraveler changed the title About NarrativeQA Confusion about the blue4 of NarrativeQA Jul 21, 2024
@yyTraveler
Copy link
Author

i tried the experiment and followed the appendix methods. but the blue-4 is much higher than the metrix in paper.

using

  • allenai/unifiedqa-v2-t5-3b-1363200
  • sentence-transformers/multi-qa-mpnet-base-cos-v1

is there any other special in calculating the blue-4 ?

blue1 blue4 meteor rough_l
0.21 0.10 0.17 0.31

@yyTraveler
Copy link
Author

metrix code here, also changed from the allennlp

import nltk

try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt")
    nltk.download("wordnet")

import rouge
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.tokenize import word_tokenize
from nltk.translate.meteor_score import meteor_score
import copy

rouge_l_evaluator = rouge.Rouge(
    metrics=["rouge-l"],
    max_n=4,
    limit_length=True,
    length_limit=100,
    length_limit_type="words",
    apply_avg=True,
    apply_best=True,
    alpha=0.5,
    weight_factor=1.2,
    stemming=True,
)


def bleu_1(p, g):
    smoothie = SmoothingFunction().method4
    return sentence_bleu(g, p, weights=(1, 0, 0, 0), smoothing_function=smoothie)


def bleu_4(p, g):
    smoothie = SmoothingFunction().method4
    return sentence_bleu(g, p, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smoothie)


def meteor(p, g):
    return meteor_score([x.split() for x in g], p.split())

def meteor_with_tokenize(p: str, g: str):
    pp = word_tokenize(p)
    gg = [word_tokenize(g)]
    return meteor_score(gg, pp)


def rouge_l(p, g):
    return rouge_l_evaluator.get_scores(p, g)


def metric_max_over_ground_truths(metric_fn, prediction, ground_truths, tokenize=False):
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        if tokenize:
            score = metric_fn(word_tokenize(prediction), [word_tokenize(ground_truth)])
        else:
            score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
    if isinstance(score, dict) and "rouge-l" in score:
        max_score = copy.deepcopy(score)
        max_score["rouge-l"]["f"] = round(
            max([score["rouge-l"]["f"] for score in scores_for_ground_truths]), 2
        )
        max_score["rouge-l"]["p"] = round(
            max([score["rouge-l"]["p"] for score in scores_for_ground_truths]), 2
        )
        max_score["rouge-l"]["r"] = round(
            max([score["rouge-l"]["r"] for score in scores_for_ground_truths]), 2
        )
        return max_score
    else:
        return round(max(scores_for_ground_truths), 2)


def get_metric_score(prediction, ground_truths):
    bleu_1_score = metric_max_over_ground_truths(bleu_1, prediction, ground_truths, tokenize=True)
    bleu_4_score = metric_max_over_ground_truths(bleu_4, prediction, ground_truths, tokenize=True)
    meteor_score = metric_max_over_ground_truths(meteor_with_tokenize, prediction, ground_truths, tokenize=False)
    rouge_l_score = metric_max_over_ground_truths(
        rouge_l, prediction, ground_truths, tokenize=False
    )

    return (
        bleu_1_score,
        bleu_4_score,
        meteor_score,
        rouge_l_score["rouge-l"]["f"],
        rouge_l_score["rouge-l"]["p"],
        rouge_l_score["rouge-l"]["r"],
    )

@yyTraveler
Copy link
Author

metrics.. sorry for typing :)

@kimoji919
Copy link

Hello ,I was confused with experiment setting about NarrativeQA.Which should I choose as the instruction of prediction?Summary or full document?

@yyTraveler
Copy link
Author

Hello ,I was confused with experiment setting about NarrativeQA.Which should I choose as the instruction of prediction?Summary or full document?

you should read the paper and code carefully. summary in building the tree and qa in llm answer.

@kimoji919
Copy link

I think you may have misunderstood my meaning
I would like to know the specific usage of this dataset in this article, and whether the original data used is the full-text part of the dataset or the abstract part of the dataset?
I noticed that in the sixth page of the article, when introducing the dataset, it is mentioned that the Narrative QA dataset is based on question and answer pairs of full texts of books and movie scripts.
And the article uses a length of 100 tokens when segmenting, while the summary length of the original dataset is about 600-900 tokens. I think it should not be the summary but the full text.
I am looking for a general usage of this dataset in the LLM era, and may not pay attention to some technical details of the article itself, only considering the usage of the dataset. I understand this article as a Structured Hierarchical Retrieval, where nodes are constructed using the entire text during tree building, and then retrieved for QA.
So what you mean is that we are still using full-text data, but in this article, we have processed the raw data into node wise summaries and then performed QA on the retrieved nodes, right?

我觉得你可能误解我的意思了
我想知道该数据集在该文中的具体用法,使用的原始数据是数据集中的全文部分还是数据集中的摘要部分?
我留意到该文中在第六页介绍数据集时提到Narrativeqa数据集是基于书籍和电影脚本全文的问答对。
以及该文在切分片段时采用100个token的长度,而原数据集的摘要长度大概在600-900个tokens左右,我想应该不是摘要而是全文。

我在找一种关于该数据集在LLM时代通用的用法,可能并不会关注到该篇本身的某些技术细节,仅仅考虑数据集用法。这篇文章我理解为一种Structured Hierarchical Retrieval,在构建树的时候用全文切片构建出一个个节点,然后检索进行qa。

所以你的意思是使用的仍然是全文数据,只是在该文中对原始数据进行了分节点的摘要处理然后对检索到的节点进行qa,是吗?

@yyTraveler
Copy link
Author

yes, it's always full-text for this paper.

quoting the experimental section of the original text:

The NarrativeQA-Story task requires a comprehensive understanding of the entire narrative in order to accurately answer its questions, thus testing the model’s ability to comprehend longer texts in the literary domain.

I think you may have misunderstood my meaning I would like to know the specific usage of this dataset in this article, and whether the original data used is the full-text part of the dataset or the abstract part of the dataset? I noticed that in the sixth page of the article, when introducing the dataset, it is mentioned that the Narrative QA dataset is based on question and answer pairs of full texts of books and movie scripts. And the article uses a length of 100 tokens when segmenting, while the summary length of the original dataset is about 600-900 tokens. I think it should not be the summary but the full text. I am looking for a general usage of this dataset in the LLM era, and may not pay attention to some technical details of the article itself, only considering the usage of the dataset. I understand this article as a Structured Hierarchical Retrieval, where nodes are constructed using the entire text during tree building, and then retrieved for QA. So what you mean is that we are still using full-text data, but in this article, we have processed the raw data into node wise summaries and then performed QA on the retrieved nodes, right?

我觉得你可能误解我的意思了 我想知道该数据集在该文中的具体用法,使用的原始数据是数据集中的全文部分还是数据集中的摘要部分? 我留意到该文中在第六页介绍数据集时提到Narrativeqa数据集是基于书籍和电影脚本全文的问答对。 以及该文在切分片段时采用100个token的长度,而原数据集的摘要长度大概在600-900个tokens左右,我想应该不是摘要而是全文。  我在找一种关于该数据集在LLM时代通用的用法,可能并不会关注到该篇本身的某些技术细节,仅仅考虑数据集用法。这篇文章我理解为一种Structured Hierarchical Retrieval,在构建树的时候用全文切片构建出一个个节点,然后检索进行qa。  所以你的意思是使用的仍然是全文数据,只是在该文中对原始数据进行了分节点的摘要处理然后对检索到的节点进行qa,是吗?

@ET-yzk

This comment was marked as resolved.

@Ningyu-y
Copy link

Ningyu-y commented Oct 21, 2024

Hello, I have a confusion about whether to use the NarrativeQA dataset as a separate tree for each piece of data in the dataset, or as a tree for all the data in the dataset?

我有一个困惑,就是在使用NarrativeQA数据集的时候,是将数据集中每一条数据进行单独的构建树,还是将数据集中所有的数据进行构建树

@Ningyu-y
Copy link

i tried the experiment and followed the appendix methods. but the blue-4 is much higher than the metrix in paper.

using

  • allenai/unifiedqa-v2-t5-3b-1363200
  • sentence-transformers/multi-qa-mpnet-base-cos-v1

is there any other special in calculating the blue-4 ?

blue1 blue4 meteor rough_l
0.21 0.10 0.17 0.31

Can you share a full script on how to evaluate this dataset? Thank you so much

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants