Appearing in NeurIPS 2022 dataset and benchmark track
We are releasing under the CC-BY licence a new large-scale dataset for Automatic Symptom Detection (ASD) and Automatic Diagnosis (AD) systems in the medical domain.
The dataset contains patients synthesized using a proprietary medical knowledge base and a commercial rule-based ASD system. Patients in the dataset are characterized by their socio-demographic data, a pathology they are suffering from, a set of symptoms and antecedents related to this pathology, and a differential diagnosis. The symptoms and antecedents can be binary, categorical and multi-choice, with the potential of leading to more efficient and natural interactions between ASD/AD systems and patients.
To the best of our knowledge, this is the first large-scale dataset that includes the differential diagnosis, and non-binary symptoms and antecedents.
- DDXPlus: A New Dataset For Automatic Medical Diagnosis
- Our paper is available on arXiv.
- The dataset in French is hosted on figshare.
- This is the original version of DDXPlus that all results in our paper were obtained on.
- Starting from 9 May 2023, the dataset is also available in English for easier use. This version is hosted on figshare.
- The English version of DDXPlus contains the same data in the same format as the French version.
- Wherever possible, English names or non-semantic codes are used instead of French names.
- Using the English version should lead to the same performance as using the French version.
In what follows, we use the term evidence as a general term to refer to a symptom or an antecedent. The dataset contains the following files:
release_evidences.json
: a JSON file describing all possible evidences considered in the dataset.release_conditions.json
: a JSON file describing all pathologies considered in the dataset.release_train_patients.zip
: a CSV file containing the patients of the training set.release_validate_patients.zip
: a CSV file containing the patients of the validation set.release_test_patients.zip
: a CSV file containing the patients of the test set.
Each evidence in the release_evidences.json
file is described using the following entries:
name
: name of the evidence.- In the English version, this is replaced with a unique, non-semantic code starting with
E
.
- In the English version, this is replaced with a unique, non-semantic code starting with
code_question
: a code allowing to identify which evidences are related. Evidences having the samecode_question
form a group of related symptoms. The value of thecode_question
refers to the evidence that need to be simulated/activated for the other members of the group to be eventually simulated.question_fr
: the query, in French, associated to the evidence.question_en
: the query, in English, associated to the evidence.is_antecedent
: a flag indicating whether the evidence is an antecedent or a symptom.data_type
: the type of the evidence. We use "B" for binary, "C" for categorical, and "M" for multi-choice.default_value
: the default value of the evidence. If this value is used to characterize the evidence, then it is as if the evidence was not synthesized.possible-values
: the possible values for the evidence. Only valid for categorical and multi-choice evidences.- In the English version, every value is replaced with a unique, non-semantic code starting with
V
.
- In the English version, every value is replaced with a unique, non-semantic code starting with
value_meaning
: The meaning, in French and English, of each code that is part of thepossible-values
field. Only valid for categorical and multi-choice evidences.
English
{
"name": "E_130",
"code_question": "E_129",
"question_fr": "De quelle couleur sont les lésions?",
"question_en": "What color is the rash?",
"is_antecedent": false,
"default_value": "V_11",
"value_meaning": {
"V_11": {"fr": "NA", "en": "NA"},
"V_86": {"fr": "foncée", "en": "dark"},
"V_107": {"fr": "jaune", "en": "yellow"},
"V_138": {"fr": "pâle", "en": "pale"},
"V_156": {"fr": "rose", "en": "pink"},
"V_157": {"fr": "rouge", "en": "red"}
},
"possible-values": [
"V_11",
"V_86",
"V_107",
"V_138",
"V_156",
"V_157"
],
"data_type": "C"
}
French
{
"name": "lesions_peau_couleur",
"code_question": "lesions_peau",
"question_fr": "De quelle couleur sont les lésions?",
"question_en": "What color is the rash?",
"is_antecedent": false,
"default_value": "NA",
"value_meaning": {
"NA": {"fr": "NA", "en": "NA"},
"foncee": {"fr": "foncée", "en": "dark"},
"jaune": {"fr": "jaune", "en": "yellow"},
"pale": {"fr": "pâle", "en": "pale"},
"rose": {"fr": "rose", "en": "pink"},
"rouge": {"fr": "rouge","en": "red"}
},
"possible-values": [
"NA",
"foncee",
"jaune",
"pale",
"rose",
"rouge"
],
"data_type": "C"
}
The file release_conditions.json
contains information about the pathologies patients in the datasets may suffer from. Each pathology has the following attributes:
condition_name
: name of the pathology.- In the English version, the English name is used instead of the French name.
cond-name-fr
: name of the pathology in French.cond-name-eng
: name of the pathology in English.icd10-id
: ICD-10 code of the pathology.severity
: the severity associated with the pathology. The lower the more severe.symptoms
: data structure describing the set of symptoms characterizing the pathology. Each symptom is represented by its correspondingname
entry in therelease_evidences.json
file.antecedents
: data structure describing the set of antecedents characterizing the pathology. Each antecedent is represented by its correspondingname
entry in therelease_evidences.json
file.
English
{
"condition_name": "Myasthenia gravis",
"cond-name-fr": "Myasthénie grave",
"cond-name-eng": "Myasthenia gravis",
"icd10-id": "G70.0",
"symptoms": {
"E_65": {},
"E_63": {},
"E_52": {},
"E_172": {},
"E_84": {},
"E_66": {},
"E_90": {},
"E_38": {},
"E_176": {}
},
"antecedents": {
"E_28": {},
"E_204": {}
},
"severity": 3
}
French
{
"condition_name": "Myasthénie grave",
"cond-name-fr": "Myasthénie grave",
"cond-name-eng": "Myasthenia gravis",
"icd10-id": "G70.0",
"symptoms": {
"dysphagie": {},
"dysarthrie": {},
"diplopie": {},
"ptose": {},
"faiblesse_msmi": {},
"dyspn": {},
"fatigabilité_msk": {},
"claud_mâchoire": {},
"rds_paralys_gen": {}
},
"antecedents": {
"atcdfam_mg": {},
"trav1": {}
},
"severity": 3
}
Each patient in each of the 3 sets has the following attributes:
AGE
: the age of the synthesized patient.SEX
: the sex of the synthesized patient.PATHOLOGY
: name of the ground truth pathology (cfcondition_name
property in therelease_conditions.json
file) that the synthesized patient is suffering from.EVIDENCES
: list of evidences experienced by the patient. An evidence can either be binary, categorical or multi-choice. A categorical or multi-choice evidence is represented in the format[evidence-name]_@_[evidence-value]
where [evidence-name
] is the name of the evidence (name
entry in therelease_evidences.json
file) and [evidence-value
] is a value from thepossible-values
entry. Note that for a multi-choice evidence, it is possible to have several[evidence-name]_@_[evidence-value]
items in the evidence list, with each item being associated with a different evidence value. A binary evidence is represented as[evidence-name]
.INITIAL_EVIDENCE
: the evidence provided by the patient to kick-start an interaction with an ASD/AD system. This is useful during model evaluation for a fair comparison of ASD/AD systems as they will all begin an interaction with a given patient from the same starting point. The initial evidence is randomly selected from the evidence list mentioned above (i.e.,EVIDENCES
) and it is part of this list.DIFFERENTIAL_DIAGNOSIS
: The ground truth differential diagnosis for the patient. It is represented as a list of pairs of the form[[patho_1, proba_1], [patho_2, proba_2], ...]
wherepatho_i
is the pathology name (condition_name
entry in therelease_conditions.json
file) andproba_i
is its related probability.
English
{
"AGE": 18,
"DIFFERENTIAL_DIAGNOSIS": [["Bronchitis", 0.19171203430383882], ["Pneumonia", 0.17579340398940366], ["URTI", 0.1607809719801254], ["Bronchiectasis", 0.12429044460990353], ["Tuberculosis", 0.11367177304035844], ["Influenza", 0.11057936110639896], ["HIV (initial infection)", 0.07333003867293564], ["Chagas", 0.04984197229703562]],
"SEX": "M",
"PATHOLOGY": "URTI",
"EVIDENCES": ["E_48", "E_50", "E_53", "E_54_@_V_161", "E_54_@_V_183", "E_55_@_V_89", "E_55_@_V_108", "E_55_@_V_167", "E_56_@_4", "E_57_@_V_123", "E_58_@_3", "E_59_@_3", "E_77", "E_79", "E_91", "E_97", "E_201", "E_204_@_V_10", "E_222"],
"INITIAL_EVIDENCE": "E_91"
}
French
{
"AGE": 18,
"DIFFERENTIAL_DIAGNOSIS": [["Bronchite", 0.19171203430383882], ["Pneumonie", 0.17579340398940366],["IVRS ou virémie", 0.1607809719801254], ["Bronchiectasies", 0.12429044460990353], ["Tuberculose", 0.11367177304035844], ["Possible influenza ou syndrome virémique typique", 0.11057936110639896], ["VIH (Primo-infection)", 0.07333003867293564], ["Chagas", 0.04984197229703562]],
"SEX": "M",
"PATHOLOGY": "IVRS ou virémie",
"EVIDENCES": ["crowd", "diaph", "douleurxx", "douleurxx_carac_@_sensible", "douleurxx_carac_@_une_lourdeur_ou_serrement", "douleurxx_endroitducorps_@_front", "douleurxx_endroitducorps_@_joue_D_", "douleurxx_endroitducorps_@_tempe_G_", "douleurxx_intens_@_4", "douleurxx_irrad_@_nulle_part", "douleurxx_precis_@_3", "douleurxx_soudain_@_3", "expecto", "f17.210", "fievre", "gorge_dlr", "toux", "trav1_@_N", "z77.22"],
"INITIAL_EVIDENCE": "fievre"
}
Binary | Categorical | Multi-choice | Total | |
---|---|---|---|---|
Evidences | 208 | 10 | 5 | 223 |
Symptoms | 96 | 9 | 5 | 110 |
Antecedents | 112 | 1 | 0 | 113 |
Avg | Std dev | Min | 1st quartile | Median | 3rd quartile | Max | |
---|---|---|---|---|---|---|---|
Evidences | 13.56 | 5.06 | 1 | 10 | 13 | 17 | 36 |
Symptoms | 10.07 | 4.69 | 1 | 8 | 10 | 12 | 25 |
Antecedents | 3.49 | 2.23 | 0 | 2 | 3 | 5 | 12 |
Code for reproducing results in the paper can be found in code
.
In our paper, we reported results of two methods, a RL-based method AARLC and a supervised method BASD which is adapted from ASD. For instructions on how to run them, see here for AARLC and here for BASD.