CMCQA is a huge conversational question-and-answer data set for the Chinese medical field. It is collected from the Chinese medical conversational question answering website ChunYu, and has medical conversational materials in 45 departments, such as andrology, stormotologry, gynaecology and obstetrics. Specifically, CMCQA has 1.3 million complete sessions or 19.83 million statements or 0.65 billion tokens. At the same time, we further open source all data to promote the development of related fields of conversational question answering in the medical field.
CMCQA是中国医学领域一个庞大的会话问答数据集。它来自中国医学对话问答网站春雨,在男科、耳科、妇产科等45个科室拥有医学对话材料。具体而言,CMCQA拥有130万个完整会话或1983万条语句或6.5亿个令牌,总容量2.84GB。同时,我们进一步开放所有数据源,促进医学领域对话式问答相关领域的发展。
You can find our data in Google drive
你可以从百度网盘中下载数据集
@inproceedings{xia-etal-2022-medconqa,
title = "{M}ed{C}on{QA}: Medical Conversational Question Answering System based on Knowledge Graphs",
author = "Xia, Fei and
Li, Bin and
Weng, Yixuan and
He, Shizhu and
Liu, Kang and
Sun, Bin and
Li, Shutao and
Zhao, Jun",
booktitle = "Proceedings of the The 2022 Conference on Empirical Methods in Natural Language Processing: System Demonstrations",
month = dec,
year = "2022",
address = "Abu Dhabi, UAE",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2022.emnlp-demos.15",
pages = "148--158",
abstract = "The medical conversational system can relieve doctors{'} burden and improve healthcare efficiency, especially during the COVID-19 pandemic. However, the existing medical dialogue systems have the problems of weak scalability, insufficient knowledge, and poor controllability. Thus, we propose a medical conversational question-answering (CQA) system based on the knowledge graph, namely MedConQA, which is designed as a pipeline framework to maintain high flexibility. Our system utilizes automated medical procedures, including medical triage, consultation, image-text drug recommendation, and record. Each module has been open-sourced as a tool, which can be used alone or in combination, with robust scalability. Besides, to conduct knowledge-grounded dialogues with users, we first construct a Chinese Medical Knowledge Graph (CMKG) and collect a large-scale Chinese Medical CQA (CMCQA) dataset, and we design a series of methods for reasoning more intellectually. Finally, we use several state-of-the-art (SOTA) techniques to keep the final generated response more controllable, which is further assured by hospital and professional evaluations. We have open-sourced related code, datasets, web pages, and tools, hoping to advance future research.",
}