Confusion about the blue4 of NarrativeQA #50

yyTraveler · 2024-07-21T11:34:32Z

No description provided.

yyTraveler · 2024-07-21T11:40:43Z

i tried the experiment and followed the appendix methods. but the blue-4 is much higher than the metrix in paper.

using

allenai/unifiedqa-v2-t5-3b-1363200
sentence-transformers/multi-qa-mpnet-base-cos-v1

is there any other special in calculating the blue-4 ?

blue1	blue4	meteor	rough_l
0.21	0.10	0.17	0.31

yyTraveler · 2024-07-21T11:42:37Z

metrix code here, also changed from the allennlp

import nltk

try:
    nltk.data.find("tokenizers/punkt")
except LookupError:
    nltk.download("punkt")
    nltk.download("wordnet")

import rouge
from nltk.translate.bleu_score import sentence_bleu, SmoothingFunction
from nltk.tokenize import word_tokenize
from nltk.translate.meteor_score import meteor_score
import copy

rouge_l_evaluator = rouge.Rouge(
    metrics=["rouge-l"],
    max_n=4,
    limit_length=True,
    length_limit=100,
    length_limit_type="words",
    apply_avg=True,
    apply_best=True,
    alpha=0.5,
    weight_factor=1.2,
    stemming=True,
)


def bleu_1(p, g):
    smoothie = SmoothingFunction().method4
    return sentence_bleu(g, p, weights=(1, 0, 0, 0), smoothing_function=smoothie)


def bleu_4(p, g):
    smoothie = SmoothingFunction().method4
    return sentence_bleu(g, p, weights=(0.25, 0.25, 0.25, 0.25), smoothing_function=smoothie)


def meteor(p, g):
    return meteor_score([x.split() for x in g], p.split())

def meteor_with_tokenize(p: str, g: str):
    pp = word_tokenize(p)
    gg = [word_tokenize(g)]
    return meteor_score(gg, pp)


def rouge_l(p, g):
    return rouge_l_evaluator.get_scores(p, g)


def metric_max_over_ground_truths(metric_fn, prediction, ground_truths, tokenize=False):
    scores_for_ground_truths = []
    for ground_truth in ground_truths:
        if tokenize:
            score = metric_fn(word_tokenize(prediction), [word_tokenize(ground_truth)])
        else:
            score = metric_fn(prediction, ground_truth)
        scores_for_ground_truths.append(score)
    if isinstance(score, dict) and "rouge-l" in score:
        max_score = copy.deepcopy(score)
        max_score["rouge-l"]["f"] = round(
            max([score["rouge-l"]["f"] for score in scores_for_ground_truths]), 2
        )
        max_score["rouge-l"]["p"] = round(
            max([score["rouge-l"]["p"] for score in scores_for_ground_truths]), 2
        )
        max_score["rouge-l"]["r"] = round(
            max([score["rouge-l"]["r"] for score in scores_for_ground_truths]), 2
        )
        return max_score
    else:
        return round(max(scores_for_ground_truths), 2)


def get_metric_score(prediction, ground_truths):
    bleu_1_score = metric_max_over_ground_truths(bleu_1, prediction, ground_truths, tokenize=True)
    bleu_4_score = metric_max_over_ground_truths(bleu_4, prediction, ground_truths, tokenize=True)
    meteor_score = metric_max_over_ground_truths(meteor_with_tokenize, prediction, ground_truths, tokenize=False)
    rouge_l_score = metric_max_over_ground_truths(
        rouge_l, prediction, ground_truths, tokenize=False
    )

    return (
        bleu_1_score,
        bleu_4_score,
        meteor_score,
        rouge_l_score["rouge-l"]["f"],
        rouge_l_score["rouge-l"]["p"],
        rouge_l_score["rouge-l"]["r"],
    )

yyTraveler · 2024-07-21T11:46:20Z

metrics.. sorry for typing :)

kimoji919 · 2024-07-21T17:23:05Z

Hello ,I was confused with experiment setting about NarrativeQA.Which should I choose as the instruction of prediction?Summary or full document?

yyTraveler · 2024-07-21T23:11:25Z

Hello ,I was confused with experiment setting about NarrativeQA.Which should I choose as the instruction of prediction?Summary or full document?

you should read the paper and code carefully. summary in building the tree and qa in llm answer.

kimoji919 · 2024-07-22T00:20:29Z

I think you may have misunderstood my meaning
I would like to know the specific usage of this dataset in this article, and whether the original data used is the full-text part of the dataset or the abstract part of the dataset?
I noticed that in the sixth page of the article, when introducing the dataset, it is mentioned that the Narrative QA dataset is based on question and answer pairs of full texts of books and movie scripts.
And the article uses a length of 100 tokens when segmenting, while the summary length of the original dataset is about 600-900 tokens. I think it should not be the summary but the full text.
I am looking for a general usage of this dataset in the LLM era, and may not pay attention to some technical details of the article itself, only considering the usage of the dataset. I understand this article as a Structured Hierarchical Retrieval, where nodes are constructed using the entire text during tree building, and then retrieved for QA.
So what you mean is that we are still using full-text data, but in this article, we have processed the raw data into node wise summaries and then performed QA on the retrieved nodes, right?

我觉得你可能误解我的意思了
我想知道该数据集在该文中的具体用法，使用的原始数据是数据集中的全文部分还是数据集中的摘要部分？
我留意到该文中在第六页介绍数据集时提到Narrativeqa数据集是基于书籍和电影脚本全文的问答对。
以及该文在切分片段时采用100个token的长度，而原数据集的摘要长度大概在600-900个tokens左右，我想应该不是摘要而是全文。

我在找一种关于该数据集在LLM时代通用的用法，可能并不会关注到该篇本身的某些技术细节，仅仅考虑数据集用法。这篇文章我理解为一种Structured Hierarchical Retrieval，在构建树的时候用全文切片构建出一个个节点，然后检索进行qa。

所以你的意思是使用的仍然是全文数据，只是在该文中对原始数据进行了分节点的摘要处理然后对检索到的节点进行qa，是吗？

yyTraveler · 2024-07-22T00:41:27Z

yes, it's always full-text for this paper.

quoting the experimental section of the original text:

The NarrativeQA-Story task requires a comprehensive understanding of the entire narrative in order to accurately answer its questions, thus testing the model’s ability to comprehend longer texts in the literary domain.

I think you may have misunderstood my meaning I would like to know the specific usage of this dataset in this article, and whether the original data used is the full-text part of the dataset or the abstract part of the dataset? I noticed that in the sixth page of the article, when introducing the dataset, it is mentioned that the Narrative QA dataset is based on question and answer pairs of full texts of books and movie scripts. And the article uses a length of 100 tokens when segmenting, while the summary length of the original dataset is about 600-900 tokens. I think it should not be the summary but the full text. I am looking for a general usage of this dataset in the LLM era, and may not pay attention to some technical details of the article itself, only considering the usage of the dataset. I understand this article as a Structured Hierarchical Retrieval, where nodes are constructed using the entire text during tree building, and then retrieved for QA. So what you mean is that we are still using full-text data, but in this article, we have processed the raw data into node wise summaries and then performed QA on the retrieved nodes, right?

我觉得你可能误解我的意思了我想知道该数据集在该文中的具体用法，使用的原始数据是数据集中的全文部分还是数据集中的摘要部分？我留意到该文中在第六页介绍数据集时提到Narrativeqa数据集是基于书籍和电影脚本全文的问答对。以及该文在切分片段时采用100个token的长度，而原数据集的摘要长度大概在600-900个tokens左右，我想应该不是摘要而是全文。我在找一种关于该数据集在LLM时代通用的用法，可能并不会关注到该篇本身的某些技术细节，仅仅考虑数据集用法。这篇文章我理解为一种Structured Hierarchical Retrieval，在构建树的时候用全文切片构建出一个个节点，然后检索进行qa。所以你的意思是使用的仍然是全文数据，只是在该文中对原始数据进行了分节点的摘要处理然后对检索到的节点进行qa，是吗？

Ningyu-y · 2024-10-21T02:50:15Z

Hello, I have a confusion about whether to use the NarrativeQA dataset as a separate tree for each piece of data in the dataset, or as a tree for all the data in the dataset？

我有一个困惑，就是在使用NarrativeQA数据集的时候，是将数据集中每一条数据进行单独的构建树，还是将数据集中所有的数据进行构建树

Ningyu-y · 2024-10-22T03:22:34Z

i tried the experiment and followed the appendix methods. but the blue-4 is much higher than the metrix in paper.

using

allenai/unifiedqa-v2-t5-3b-1363200

sentence-transformers/multi-qa-mpnet-base-cos-v1

is there any other special in calculating the blue-4 ?

blue1 blue4 meteor rough_l
0.21 0.10 0.17 0.31

Can you share a full script on how to evaluate this dataset? Thank you so much

yyTraveler changed the title ~~About NarrativeQA~~ Confusion about the blue4 of NarrativeQA Jul 21, 2024

This comment was marked as resolved.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Confusion about the blue4 of NarrativeQA #50

Confusion about the blue4 of NarrativeQA #50

yyTraveler commented Jul 21, 2024

yyTraveler commented Jul 21, 2024

yyTraveler commented Jul 21, 2024

yyTraveler commented Jul 21, 2024

kimoji919 commented Jul 21, 2024

yyTraveler commented Jul 21, 2024

kimoji919 commented Jul 22, 2024

yyTraveler commented Jul 22, 2024

This comment was marked as resolved.

Ningyu-y commented Oct 21, 2024 •

edited

Loading

Ningyu-y commented Oct 22, 2024

Confusion about the blue4 of NarrativeQA #50

Confusion about the blue4 of NarrativeQA #50

Comments

yyTraveler commented Jul 21, 2024

yyTraveler commented Jul 21, 2024

yyTraveler commented Jul 21, 2024

yyTraveler commented Jul 21, 2024

kimoji919 commented Jul 21, 2024

yyTraveler commented Jul 21, 2024

kimoji919 commented Jul 22, 2024

yyTraveler commented Jul 22, 2024

This comment was marked as resolved.

Ningyu-y commented Oct 21, 2024 • edited Loading

Ningyu-y commented Oct 22, 2024

Ningyu-y commented Oct 21, 2024 •

edited

Loading