https://sites.google.com/view/wsdm24-docqa
tricks:
- SOLAR-10.7B-Instruct model作为基干模型
- hybrid training:utilize a well-trained model to produce (pseudo) answers for the eval dataset before adding them to the original training set to finetune a new model from scratch
- 噪音数据过滤:提升数据的质量
- model ensemble
相比其他场景下的数据,增加了history的数据。
{
"uuid": "xxxxx",
"history": [
{"question": xxx, "history": xxx},
{"question": xxx, "history": xxx},
...
],
"documents":
[
"Jun 17th through Fri the 21st, 2024 at the Seattle Convention Center, Vancouver Convention Center.", "Workshops within a “track” will take place in the same room (or be co-located), and workshop organizers will be asked to work closely with others in their track ...",
...
],
"question": "Where will CVPR 2024 happen?",
"answer": "CVPR 2024 will happen at the Seattle Convention Center, Vancouver.",
"keywords": # Will not be given.
[
"Vancouver", "CVPR 2024", "Seattle Convention Center"
]
}