Skip to content

iesl/protoqa-data

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ProtoQA Dataset

This repository contains the dataset for ProtoQA ("Family Feud"). See the paper for details on dataset creation.

Data Files:

Each line is a json dictionary, in which:

  • question contains the question (in original and a normalized form)
  • answers (where available) contains:
    • raw original answers provided by survey respondents (when available) with their counts
    • clusters which include the score for each cluster and the strings included in that cluster

For a full description of the data format, see DATAFORMAT.md.

File organization:

  • data/train/train.jsonl: 8781 instances for training or fine-tuning scraped from Family Feud fan sites (see paper). Scraped data has answer clusters with sizes, but only has a single string per cluster (corresponding to the original cluster name.
  • data/dev/dev.scraped.jsonl: 979 instances sampled from the same Family Feud data, for use in model validation and development.
  • data/dev/dev.crowdsourced.jsonl: 51 questions collected with exhaustive answer collection and manual clustering, matching the details of the eval test set (roughly 100 human answers per question).
  • data/test/test.questions.jsonl 102 questions for evaluation. (Note that the test set contains questions only.)

Notes:

This repository contains a data statement (based on Datasheets for Datasets (Gebru et al. 2020) and earlier NLP-specific work (Bender and Friedman 2018)) to provide transparency in data use and encourage others to do so. This is a preliminary version of the statement; please post issues in the repository or contact the authors if you have questions regarding the data details or suggestions regarding the dataset use.

About

Dataset for protoqa ("family feud") data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published