This repository shows an overview of labeled datasets on Systematic Reviews. The datasets are open datasets. The labeled data can be used for text mining and machine learning purposes. This repository contains scripts to collect, preprocess and clean the systematic review datasets.
If you would like to help improve ASReview, please share your dataset with us! Using your dataset about which records you have included and excluded for your systematic review, we can do research, such as simulation studies, to improve our software. This will benefit everyone who wants to use the software. If you’re interested in our research to improve the software, you can find a short report on previous simulation studies here.
If you are willing to contribute to ASReview by making your dataset available, please make a Pull Request and add the information like in the table below.
The datasets are alphabetically ordered.
Reference | Topic | Sample Size | Inclusion | Link | License |
---|---|---|---|---|---|
Appenzeller-Herzog, 2020 | Wilson disease | 3453 | 0.75% | source | CC-BY Attribution 4.0 International |
Bannach-Brown et al., 2019 | Animal Model of Depression | 1993 | 14.0% | source | CC-BY Attribution 4.0 International |
Cohen et al., 2006 | ACEInhibitors | 2544 | 1.61% | source | NA |
Cohen et al., 2006 | ADHD | 851 | 2.35% | source | NA |
Cohen et al., 2006 | Antihistamines | 310 | 5.16% | source | NA |
Cohen et al., 2006 | Atypical Antipsychotics | 1120 | 13.04% | source | NA |
Cohen et al., 2006 | Beta Blockers | 2072 | 2.03% | source | NA |
Cohen et al., 2006 | Calcium Channel Blockers | 1218 | 8.21% | source | NA |
Cohen et al., 2006 | Estrogens | 368 | 21.74% | source | NA |
Cohen et al., 2006 | NSAIDS | 393 | 10.43% | source | NA |
Cohen et al., 2006 | Opiods | 1915 | 0.78% | source | NA |
Cohen et al., 2006 | Oral Hypoglycemics | 503 | 27.04% | source | NA |
Cohen et al., 2006 | Proton Pump Inhibitors | 1333 | 3.83% | source | NA |
Cohen et al., 2006 | Skeletal Muscle Relaxants | 1643 | 0.55% | source | NA |
Cohen et al., 2006 | Statins | 3465 | 2.45% | source | NA |
Cohen et al., 2006 | Triptans | 671 | 3.58% | source | NA |
Cohen et al., 2006 | Urinary Incontinence | 327 | 12.23% | source | NA |
Hall et al., 2012 | Software Fault Prediction | 8911 | 1.17% | source | CC-BY Attribution 4.0 International |
Kitchenham et al., 2010 | Software Engineering | 1704 | 2.58% | source | CC-BY Attribution 4.0 International |
Kwok et al., 2020 | Virus Metagenomics | 2481 | 4.84% | source | CC-BY Attribution 4.0 International |
Nagtegaal et al., 2019 | Nudging | 2008 | 5.03% | source | CC0 |
Radjenović et al., 2013 | Software Fault Prediction | 6000 | 0.80% | source | CC-BY Attribution 4.0 International |
Van de Schoot et al., 2018 | PTSD | 5783 | 0.66% | source | CC-BY Attribution 4.0 International |
van Dis et al., 2020 | Anxiety-Related Disorders | 10288 | 0.70% | source | NA |
Wahono, 2015 | Software Defect Detection | 7002 | 0.89% | source | CC-BY Attribution 4.0 International |
For publishing either your data and / or your AI-aided systematic review, we recommend using the Open Science frame (OSF). OSF is part of the Center for Open Science (COS), which aims at increasing openness, integrity, and reproducibility of research (OSF, 2020). How to share your data using OSF: A step-by-step guide.
Another platform to publish your data open access is provided by Zenodo. Zenodo is a platform which encourages scientists to share all materials (including data) that are necessary to understand the scholarly process (Zenodo, 2020).
When uploading your dataset to OSF or Zenodo, make sure to provide all relevant information about the dataset, by filling out all available fields. The data to be put on Zenodo or OSF can be documented as extensively as you would like (flowcharts, explanation of certain decisions, etc.). This can include a link to the systematic review itself, if it has been published elsewhere.
When sharing your dataset or a link to your already published systematic review, we recommend using a CC-BY or CC0 license for both Zenodo and OSF. By adding a Creative Commons license, everybody from individual creators to large institutions are given a standardized way to allow use of their creative work under copyright law (Creative Commons, 2020).
In short, the CC-BY license means that reusers are allowed to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use. The CC0 license releases data in the public domain, allowing reuse in any form without any conditions. This can be appropriate when sharing (meta)data only. With both OSF (see step-by-step guide) and Zenodo you can easily add the license to your project after creating a project in either platform.
The folder datasets/
has a subfolder for the different Systematic Reviews
datasets. Each of these subfolders is little project. They contain code and a
README.md
. The scripts in the different dataset folder create a subfolder
named output/
with the result of the data collection.
After reviewing in ASReview LAB, you can export your data, which will provide a file that is in the correct format to be uploaded to the repository. ASReview LAB accepts the file formats mentioned in the table below. More information on the format of the data to be put into ASReview LAB can be found in the datasets documentation.
.ris | .tsv | .csv | .xlsx | |
---|---|---|---|---|
Citation managers | ||||
Endnote | Supported | Not supported | ||
Mendeley | Supported | |||
Refworks | Supported | Not supported | ||
Zotero | Supported | Supported | ||
Search engines | ||||
CINHAL(EBSCO) | Not supported | Not supported | ||
Cochrane | Supported | Supported | ||
Embase | Supported | Supported | Supported | |
Eric (Ovid) | Not supported | Not supported | ||
Psychinfo | Not supported | Not supported | ||
(Ovid) | ||||
Pubmed | Not supported | Not supported | ||
Scopus | Supported | Supported | ||
Web of | Not supported | Not supported | ||
Science | ||||
Systematic Review Software | ||||
Abstrackr | Supported | Supported | ||
Covidence* | Supported | Supported | ||
Distiller | Not supported | Supported** | Supported** | |
EPPI-reviewer | Supported | Not supported | ||
Rayyan | Not supported | Supported | ||
Robotreviewer*** |
- Supported: The data can be exported from the software and imported in ASReview LAB using this extension.
- Not supported: The exported data can not be imported in ASReview LAB using this extension.
- (empty): The data cannot be exported from the software using this extension.
* When using Covidence it is possible to export articles in .ris formats for different citation managers,
such as Endnote, Mendeley, Refworks and Zotero. All of these are compatible with ASReview LAB.
** When exporting from Distiller set the sort references by
to Authors
. Then the data can be
imported in ASReview LAB.
*** Robotreviewer does not provide exports suitable for ASReview LAB, since it supports evidence synthesis.
If you would like to share your data without having used ASReview LAB for the screening of your records, or because you have done the screening manually, please make sure the datafile is in the right format. Two examples can be found at the bottom of the page.
RIS files are used by
digital libraries, like IEEE Xplore, Scopus and ScienceDirect. Citation
managers Mendeley and EndNote support the RIS format as well. For simulation, T1
and AB
are necessary tags, moreover
we use an additional RIS tag with the letters LI
(Label included).
Extensions .csv, .xlsx, and .xls. CSV files should be comma separated and UTF-8 encoded. For CSV files, the simulation software accepts a set of predetermined labels in line with the ones used in RIS files: "title" and "abstract". To indicate labelling decisions, one can use "included" or "label_included". The latter label called "included" is needed to indicate the final included publications in the simulations. This label should be filled with all 0’s and 1’s, where 0 means that the record is not included and 1 means included.
In general, the following column names are allowed, however except for the ones mentioned above, they will not be recognized within the simulation (based on https://pypi.org/project/RISparser/):
first_authors
secondary_authors
tertiary_authors
subsidiary_authors
abstract
author_address
accession_number
authors
custom1
custom2
custom3
custom4
custom5
custom6
custom7
custom8
caption
call_number
place_published
date
name_of_database
doi
database_provider
end_page
end_of_reference
edition
id
number
alternate_title1
alternate_title2
alternate_title3
journal_name
keywords
file_attachments1
file_attachments2
figure
language
label
note
type_of_work
notes
abstract
number_of_Volumes
original_publication
publisher
year
reviewed_item
research_notes
reprint_edition
version
issn
start_page
short_title
primary_title
secondary_title
tertiary_title
translated_author
title
translated_title
type_of_reference
unknown_tag
url
volume
publication_year
access_date
The custom tag is:
label_included
Two examples of authors who have published their systematic review data online:
- A systematic review on treatment for Wilson disease, in RIS format https://zenodo.org/record/3625931#.XvB_92ozblw
- Data from four systematic reviews on fault prediction in software engineering, in .csv format https://zenodo.org/record/1162952#.XvCCZmozblw.
Contact details can be found at the ASReview project page.