This repository contains
- the original PsyTAR dataset, as downloaded from Ask a Patient, on August 1st, 2019;
- a Python script to convert it to CSV and CoNLL format;
- the converted data.
The folder structure is the following:
data/binary
contains the annotations from theSentence_Labeling
sheet;data/all
contains the annotations from the{ADR, WD, SSI, DI}_Identified
sheets, in CoNLL format;data/conflated
contains the same data asdata/all
, but all the entity types are conflated on a single type.
The corpus is avaiable as a whole in each full.txt.
file. For the sake of reproducibility,
I also provide training, development and test sets splits, with a 80-10-20 ratio.
The code for generating the splits should be perfectly reproducible, i.e. if you run the
Python scripts, you should obtain the exact same splits you see in this repository.
The PsyTAR dataset is under the CC BY 4.0 Data license.
Please cite the original paper if you use the corpus. If you use the splits provided here, please also provide a pointer to this repository.