Skip to content

ku-nlp/KUCI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 

Repository files navigation

Kyoto University Commonsense Inference dataset (KUCI)

KUCI is a Japanese dataset for training/evaluating the linguistic capability to infer basic contingency (hereafter, commonsense contingency reasoning). This dataset comprises 104k multiple-choice questions that ask basic contingency. It is also characterized by its semi-automatic construction method: automatic extraction of pairs of basic event expressions that have contingent relation from a raw corpus, verification through crowdsourcing, and automatic generation of commonsense contingency reasoning problems from the verified pairs. Here is an example of the commonsense contingency reasoning problem:

電池の減りはやはり早いので、 (The battery drains so fast that)
  a. 実際の半導体製造装置は実現しません (actual semiconductor manufacturing equipment is not realized)
  b. 今回は期間限定でのお届けになります (it is a limited-time offer this time)
  c. 原子炉を手動停止する ({I} manually shut down a nuclear reactor)
  d. 充電用にUSBケーブル買います ({I} buy a USB cable for charging)
※ {} denotes a dropped pronoun.

The task is to choose the most appropriate choice as the continuation of a given context. In this case, d is a correct choice.

Definitions of Terms

cf. [2], [3]

  • contingency: the discourse relation between events established when one is likely to cause the other
  • core event: high-frequency predicate-argument structure (acquired from case frames)
  • base: pair of a context and a correct choice constituting each problem

Statistics

Train Dev Test
83,127 10,228 10,291

In addition, 862k pseudo-problems are available.

Data Format

The data format is JSON Lines.

{
  "id": 0,
  "context": "電池 の 減り は やはり 早い ので 、",
  "choice_a": "実際 の 半導体 製造 装置 は 実現 し ませ ん",
  "choice_b": "今回 は 期間 限定 で の お 届け に なり ます",
  "choice_c": "原子 炉 を 手動 停止 する",
  "choice_d": "充電 用 に USB ケーブル 買い ます",
  "label": "d",
  "agreement": 2,
  "core_event_pair": "減り/へりv,ガ,早い/はやい|ケーブル/けーぶる,ヲ,買う/かう"
}
Key Type Description
id int unique whole number for each problem (0-origin)
context str context (segmented into morphemes using Juman++ Version: 2.0.0-rc3)
choice_{a, b, c, d} str choice (〃)
label str letter corresponding to a correct choice (any of {a, b, c, d})
agreement int number of crowdworkers who agreed that the base has contingent relation (any of {2, 3, 4})
core_event_pair str pair of core events constituting the base

License

After internal discussion, we have determined to license this dataset under the Creative Commons Attribution-ShareAlike 4.0 International License (CC BY-SA 4.0). If you find any problem, please contact us by e-mail at "nl-resource at nlp.ist.i.kyoto-u.ac.jp" or "omura at nlp.ist.i.kyoto-u.ac.jp".
(" at " = @)

External Links

Reference/Citation

[1] (Omura et al., 2020)

@inproceedings{omura-etal-2020-method,
  title = "{A} {M}ethod for {B}uilding a {C}ommonsense {I}nference {D}ataset based on {B}asic {E}vents",
  author = "Omura, Kazumasa and
    Kawahara, Daisuke and
    Kurohashi, Sadao",
  booktitle = "Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)",
  month = nov,
  year = "2020",
  address = "Online",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2020.emnlp-main.192/",
  doi = "10.18653/v1/2020.emnlp-main.192",
  pages = "2450--2460",
}

[2] (Omura & Kurohashi, 2022)

@inproceedings{omura-kurohashi-2022-improving,
  title = "{I}mproving {C}ommonsense {C}ontingent {R}easoning by {P}seudo-data and its {A}pplication to the {R}elated {T}asks",
  author = "Omura, Kazumasa and
    Kurohashi, Sadao",
  booktitle = "Proceedings of the 29th International Conference on Computational Linguistics",
  month = oct,
  year = "2022",
  address = "Gyeongju, Republic of Korea",
  publisher = "International Committee on Computational Linguistics",
  url = "https://aclanthology.org/2022.coling-1.68/",
  pages = "812--823",
}

[3] (Omura et al., 2023)

@article{omura-etal-2023-building,
  title = "{B}uilding a {C}ommonsense {I}nference {D}ataset based on {B}asic {E}vents and its {A}pplication",
  author = "Omura, Kazumasa and
    Kawahara, Daisuke and
    Kurohashi, Sadao",
  journal = "Journal of Natural Language Processing",
  volume = "30",
  number = "4",
  year = "2023",
  doi = "10.5715/jnlp.30.1206",
  pages = "1206-1239",
  note = "(in Japanese)",
}

Releases

No releases published

Packages

No packages published