Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The "anwser" for some examples in "qasper.jsonl" is strange #67

Open
Zcchill opened this issue Jul 9, 2024 · 6 comments
Open

The "anwser" for some examples in "qasper.jsonl" is strange #67

Zcchill opened this issue Jul 9, 2024 · 6 comments

Comments

@Zcchill
Copy link

Zcchill commented Jul 9, 2024

I download the data from the offcial url and I found that the "answers" of several examples in "qasper.jsonl" are confusing. Here are several examples:
{"pred": "No", "answers": ["Yes", "No"], "all_classes": null, "length": 2317, "input": "Does this method help in sentiment classification task improvement?", "_id": "bcfe56efad9715cc714ffd2e523eaa9ad796a453e7da77a6"}
{"pred": "unanswerable", "answers": ["Yes", "Unanswerable"], "all_classes": null, "length": 2284, "actual_length": 3533, "input": "Is jiant compatible with models in any programming language?", "_id": "e5d1d589ddb30f43547012f04b06ac2924a1f4fdcf56daab"}
{"pred": "BERTBase", "answers": ["BERTbase", "BERTbase"], "all_classes": null, "length": 3852, "actual_length": 5701, "input": "What BERT model do they test?", "_id": "2a51c07e65a9214ed2cd3c04303afa205e005f4e1ccb172a"}

@Zcchill
Copy link
Author

Zcchill commented Jul 10, 2024

Another example: "_id": "d1aa1132439bd292965634095bf1c9943e062bb6645ff78c".
The query is "how many tags do they look at?"
The given answer seem to source from "We employ two sources of e-book annotation data: (i) editor tags, and (ii) Amazon search terms. For editor tags, we collect data of 48,705 e-books from 13 publishers, namely Kunstmann, Delius-Klasnig, VUR, HJR, Diogenes, Campus, Kiwi, Beltz, Chbeck, Rowohlt, Droemer, Fischer and Neopubli."
But I think the answer of "30 tags" based on "As shown in Table TABREF3 , we collect Amazon review keywords for 2,896 e-books (publishers: Kiwi, Rowohlt, Fischer, and Droemer), which leads to 33,663 distinct review keywords and on average 30 keyword assignments per e-book.\nTag Recommendation Approaches" is better.

@bys0318
Copy link
Member

bys0318 commented Jul 10, 2024

Thanks for your keen observation. We sample the data directly from the test data of Qasper, we suggest you ask the authors of Qasper.

@Zcchill
Copy link
Author

Zcchill commented Jul 10, 2024

Besides, I would like to replicate the results of "GPT-3.5-Turbo-16k" in paper but get results not so close with the results reported in the paper. I wonder the possible reasons since there is no official code for api method.
The results I get are as followed:
{
"2wikimqa": {
"0-4k": 57.09,
"4-8k": 42.82,
"8k+": 32.71
},
"hotpotqa": {
"0-4k": 68.44,
"4-8k": 57.25,
"8k+": 55.38
},
"multi_news": {
"0-4k": 28.57,
"4-8k": 23.34,
"8k+": 22.31
},
"qasper": {
"0-4k": 47.3,
"4-8k": 43.97,
"8k+": 28.35
},
"multifieldqa_en": {
"0-4k": 57.15,
"4-8k": 51.67,
"8k+": 57.52
},
"gov_report": {
"0-4k": 31.79,
"4-8k": 28.82,
"8k+": 27.34
}
}
Experiment setting:

  1. I use the api supported by AzureOpenAI.
  2. The system prompt is None. [{"role":"system","content":''}, {"role":"user","content":prompt}]
  3. inference hyper-parameters:
    completion = client.chat.completions.create(
    model="gpt-35-turbo-16k",
    messages=input,
    temperature=0.0,
    max_tokens=max_tokens,
    stop=stop_token,
    )
    response = completion.choices[0].message.content

@bys0318
Copy link
Member

bys0318 commented Jul 11, 2024

This might be due to the model iteration. We tested the GPT-3.5-Turbo-16k at August, 2023. I think it has a different version now.

@Zcchill
Copy link
Author

Zcchill commented Jul 15, 2024

"You are given a scientific article and a question. Answer the question as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no", or "unanswerable". Do not provide any explanation.\n\nArticle: {context}\n\n Answer the question based on the above article as concisely as you can, using a single phrase or sentence if possible. If the question cannot be answered based on the information in the article, write "unanswerable". If the question is a yes/no question, answer "yes", "no", or "unanswerable". Do not provide any explanation.\n\nQuestion: {input}\n\nAnswer:" The instruction for qasper tasks in dataset2prompt seems redundent, is this a mistake or a deliberate strategy to emphasize the task at both the beginning and the end of a long text (due to position bias)?

@bys0318
Copy link
Member

bys0318 commented Jul 15, 2024

You're right. We want to emphasize the task instruction, so we insert the instruction at both the start and the end of the input.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants