-
Notifications
You must be signed in to change notification settings - Fork 3
/
new-paper.Rmd
192 lines (118 loc) · 29.8 KB
/
new-paper.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
---
title: "Common sense approaches to sharing tabular data alongside publication"
author: Nicholas J. Tierney\* (1, 2, 3) & Karthik Ram\* (4)
output:
bookdown::word_document2:
bookdown::pdf_document2:
keep_tex: true
toc: no
includes:
in_header: preamble.tex
bibliography: references.bib
citation_package: 'natbib'
csl: nature-communications.csl
header-includes:
- \setstretch{1.45}
- \usepackage[utf8]{inputenc}
---
```{=tex}
\begin{singlespace}
1 = Monash University, Department of Econometrics and Business Statistics, Melbourne, Australia\\
2 = Australian Centre of Excellence for Mathematical and Statistical Frontiers (ACEMS)\\
3 = Telethon Kids Institute, Perth Children's Hospital, Perth, Australia\\
4 = Berkeley Institute for Data Science, University of California, Berkeley, USA\\
* = Both authors contributed equally to the work.\\
Correspondence to:\\
Nicholas Tierney (nicholas.tierney@gmail.com) and Karthik Ram (karthik.ram@berkeley.edu)
\end{singlespace}
```
<!-- Other titles under consideration: -->
<!-- Barriers to meaningful data sharing -->
<!-- - Meaningful data sharing -->
<!-- - Data sharing practices to realign incentives and enable positive/meaningful culture change\\ -->
<!-- - Enabling culture change for meaningful data sharing\\ -->
<!-- - Principles of sharing data\\ -->
<!-- - Elements of meaningful data sharing\\ -->
<!-- - The biggest barrier to widespread data sharing are incentives and infrastructure.\\ -->
<!-- - Some practical guidance on improving data reuse.\\ -->
# Abstract {.unnumbered}
```{=tex}
\begin{singlespace}
Numerous arguments strongly support the practice of open science, which offers several societal and individual benefits. For individual researchers, sharing research artifacts such as data can increase trust and transparency, improve the reproducibility of one's own work, and catalyze new collaborations. Despite a general appreciation of the benefits of data sharing, research data are often only available to the original investigators. For data that are shared, lack of useful metadata and documentation make them challenging to reuse. In this paper, we argue that a lack of incentives and infrastructure for making data useful are the biggest barrier to creating a culture of widespread data sharing. We compare data to code, examine computational environments in the context of their ability to facilitate the reproducibility of research, provide some practical guidance on how one can improve the chances of their data being reusable, and partially bridge the incentive gap. While previous papers have focused on describing ideal best practices for data and code, we focus on common sense ideas for sharing tabular data, for a target audience of academics working in data science adjacent fields who are about to submit for publication.
\end{singlespace}
```
```{r setup, echo = FALSE}
knitr::opts_chunk$set(echo = FALSE,
out.width = "60%",
fig.align = "center")
suppressMessages(library(ggrepel))
suppressMessages(library(tibble))
suppressMessages(library(kableExtra))
suppressMessages(library(magick))
suppressMessages(library(ggplot2))
suppressMessages(library(cowplot))
suppressMessages(library(gridExtra))
```
# Introduction {#intro}
> "Data! data! data!" he cried impatiently. "I can't make bricks without clay." -- Sherlock Holmes ("The Adventure of the Copper Beeches" by Sir Arthur Conan Doyle)
The idea of open and transparent science dates back to the founding principles of The Royal Society of London in 1660. The movement, which has seen shifting support over the years, has enjoyed growing popularity over the past decade [@Nielsen2020; @Peng2011] primarily among early career researchers [@McKiernan2016]. A key motivation for practicing openness is the notion that sharing all research artifacts will allow others to reproduce and verify results, thereby improving trust in and verification of those findings. Jon Clarebout famously wrote in 1992 that reproducibility in computationally-oriented fields had become so trivial that nonexperts could do the task. Clarebout's optimistic vision was a little ahead of its time, as the tools and technologies available in the early 1990s for literate programming presented too high a barrier for most researchers to adopt without incurring a significant loss in productivity [@Barnes2010].
Meaningful progress on most computational reproducibility, the ability of an independent researcher to recreate figures, tables, and key findings, became more of a practical reality in the late 2000s as new tools and services came into existence. Scientists started sharing more code and software, partly due to the rapid increase in training opportunities made possible by efforts like The Carpentries, combined with the growing adoption of GitHub by scientists [@Ram2013; @Zagalsky2015]. The bigger driver for this is more likely the growing popularity of data science as a discipline distinct from statistics [@Donoho2017]. This rapid growth in data science has been catalyzed by a Cambrianesque explosion of open-source software tools. Programming languages such as Python, R, and Julia have helped scientists implement and share new methods for working with data. Each of these languages enjoys the support of thriving communities of researchers and software developers who contribute many of the building blocks that make them popular. Modern scientists now have access to many interactive computational frameworks (e.g. Jupyter, RMarkdown, Stencila, Observable) that can interface with kernels from dozens of programming languages. The barriers to verification are now low enough that reproducibility is sometimes a click away, free from the misery of software installation [@Jupyter2018].
Notebooks, however, are an incomplete solution for computational reproducibility. For a piece of computational research to be minimally reproducible, it requires three distinct elements: 1) code; 2) computing environment, and 3) data. There has been tremendous progress on the first two, despite some limitations of existing tools and practices. While it took nearly two decades for Clarebout's vision to become reality, improvements in technology and motivations have played a strong role in the culture change. Sharing code wasn't easy at first [@Barnes2010], but became a social and normative part of academic culture. The next stage of this evolution, turning code into software, also became popular with increasing training opportunities [@Wilson2014], tooling, and recognition of code as a scholarly contribution [@Smith2018]. Getting researchers to share usable data has been more of a struggle.
```{r incentable-table, fig.cap = "Table of the mechanisms for behavior change, the incentives, and our assessment of where the elements of data, code, and computational environment rank in terms of completing these aspects. We note that data is often required, but the preceding steps are not, in contrast to code, which has no policy."}
knitr::include_graphics("figures/incentives_pyramid.png")
```
Nosek [@nosek_2021] describes the strategies for culture change, especially when the status quo is too entrenched. For change to occur, more than just motivation is needed: Skills, tools, and invenctives are critical. In a five-step process, Nosek describes how achieving lasting culture change requires us to show proof of concept that an activity is possible, improve the infrastructure to make it easy, and then increase awareness to make it normative. Once there is a critical mass, ways must be found to reward those who put it into practice. Once change has become normative and is widely practices, it is possible to require it of everyone (Figure \@ref(fig:incentable-table)). This framework provides a useful lens through which to examine adoption of code, computational environments, and data for reproducibility. With code, we have slowly progressed through these stages. While few journals require that authors share code, the provision of venues within which code can be shared, along with the necessary training, have paved the way for incentives and credit. All of the necessary foundations to make code sharing required are in place. For computational environments, Docker makes it possible to share them in most common use cases [@Boettiger2015]. But running them still requires expertise and time on the part of reviewers. More recent open services, like Binder [@Jupyter2018], make it possible to run notebooks atop Docker images, but the technology is still new and experiencing growing pains. There is still a long way to go before sharing computational environments in academia becomes mainstream.
Finally, for data, we have implemented these steps out of order. The FAIR data principles, introduced in 2016 with good intentions, [@Wilkinson2016] describe the desired attributes for research data: that they be findable (good metadata), accessible (deposited somewhere persistent), interoperable (easily combined with other datasets), and reusable. We, as a community, started out by making data required before demonstrating how this would directly benefit those who share data. Researchers are asked to deposit their data in one of many institutional or commercial repositories. Many journals even require it as a condition for publication via policy. However, the cost of preparing data for reuse remains high, and career incentives to do so are almost nonexistent. Until we make data reuse easier and more rewarding, no level of policy is going to improve the state of data sharing.
While there has been a growing trend of sharing data alongside publications, the status quo of data sharing in science is not good. In 2016, Rowhani-Farid et al. [@Rowhani-Farid2016] showed that 7 of 157 (4.5%) sampled papers in the *British Medical Journal* shared data, and only 14% of articles claiming open sharing of data actually shared their data. Similarly, in 2018, Stodden et al. [@Stodden2018] showed that only 36% of papers published in the journal *Science* shared data, despite the journal's data sharing policy. Similar findings have been made for PLOS and BMC journals [@Colavizza2020]. There are numerous examples of similar problems in other fields, such as psychology [@vanpaemel2015] and biology [@schofield2009]. The reality is that while data sharing is increasing in many fields, such as ecology and genomics, the amount of data that can be reused is likely lower. To make shared data useful, we need to realign incentives for researchers.
Tools that can be used to prepare datasets and metadata (*the infrastructure*) and minimal sets of practices (*guidelines*) for reuse are sparse [@Arslan2019; @Roche2015]. Unlike code, datasets are far more diverse in terms of heterogeneity, size, and format. This makes them particularly challenging to standardize or easily "install" where the code is running. There have been some efforts to improve this process, such as gogetdata, the Dat Project, Frictionless Data [@Fowler_Barratt_Walsh_2018], and the formation of a private company, Quilt.
Depositing data with rich metadata is laborious, making it a pro forma activity, outside of speciality repositories like Dryad that employ curators. Finally, the infrastructure to track and get credit for data is does not yet exist. Datasets currently do not accrue citation credit since they are not tracked [@lowenberg_2019].
# Improving the state of reusable data sharing
We believe a key audience could be better served by the current literature on data sharing: researchers who have finished data analysis on tabular data, and are ready to submit for publication.
Many papers provide rules and guidelines for sharing data, such as providing metadata, citations, licenses, and lists of data repositories to use. These papers provide general advice that ranges from ethical considerations raised by sharing data [@Moles2013], to how to share many different data types (images, video, etc) [@Hart2016], to tips such as "Reward Colleagues Who Share Their Data Properly." [@Goodman2014] There is also specific advice, such as the importance of very clean and properly shaped data [@White2013], and how to use open-source software with changes tracked [@Boland2017]. We walk a fine line between providing highly specific advice (e.g. how to specify null values) and overly general suggestions (e.g., sharing data is good). Our goal is to provide practical advice for researchers who are familiar with modern data science tools and who already develop their work including research compendia.
As academic data scientists, we spend considerable time cleaning messy data and believe that preparing data for easy reuse can be a catalyst to improve the quality of data and pave the way for other incentives, such as community adoptions, which will then create the culture change needed to credit data in a way similar to the citations used to credit publications. In this paper, we describe a set of common sense practices that researchers can adopt to make data more useful in the context of reproducibility. It should not be a huge burden to share data in a format that is readily usable by others. And, similarly, this burden should be reduced by and for those who want to use another's data.
To make data more reusable, we need to make it easier to generate metadata, standardize good practices within and across communities, and make reusable data sharing a recognized, normalized, part of research. Here we present a set of common sense practices to facilitate this change. These include documentation (Section \@ref(readme)), licensing (Section \@ref(licenses)), metadata (Section \@ref(metadata)), raw and analysis-ready data (Section \@ref(rawdata)), code (Section \@ref(scripts)), avenues to share data (Section \@ref(avenues)) and citation (Section \@ref(cite)). We also discuss considerations for sharing in packages (Section \@ref(pkgs)) and sharing data of various sizes (Section \@ref(size)).
We limit the scope of this paper to small-to medium-sized tabular data, which includes a vast majority of use cases that researchers encounter day to day. Larger datasets with complex workflows (e.g., bioinformatics and medical imaging) face different kinds of challenges and accordingly deserve to be treated separately.
Assuming that the data have no privacy concerns (e.g., human participants, locations of critically endangered species), or that the act of sharing does not put the authors at a competitive disadvantage (data can be embargoed for reasonable periods of time), sharing data will always have a net positive benefit.
Our advice is not meant to be exhaustive, by any means, and is insufficient to fully bridge the gap, but we hope instead, to provide a starting point.
## Documentation (READMEs) {#readme}
Use a README file with your data project to communicate its purpose and contents. A README is often the first place users will look to learn more about a project. The README is a concise text file containing guidance on how to read and interpret the directory -- the who, what, when, where, and how (<https://betta.io/blog/2017/02/07/developer-experience-github-readmes/>). For example, a README in a data repository might explain which directories contain data, tidying scripts, and also why data was collected. The README should be placed in the top level of the project, and there can be multiple README files in a project (e.g., in different directories). In the context of datasets, READMEs are particularly useful when there are no reliable standards. Although data repositories contain a large number of datasets associated with peer reviewed papers, not all are usable (example: https://datadryad.org/stash/dataset/doi:10.5061/dryad.rn6k1). Some of the more recent submissions contain usage notes and a README, which considerably improve reuse potential (e.g. https://datadryad.org/stash/dataset/doi:10.5061/dryad.3xsj3txbz). \
The date of use, associated paper, and versions of software should be recorded in the README, so users can, in the future, replicate similar computing environments or conditions. A more ideal approach would be to clearly describe the computing environment (e.g., in a Dockerfile) and packages used (e.g., requirements.txt).
## Licences {#licenses}
Provide a license file with your data. A license for data establishes the rules for its use, and whether it can be modified or shared. Without a license, or with a nonstandard one, these rules are unclear, and can lead to issues with use, attribution, and citation. While there are a handful of licenses for data, we recommend using the CC0 license, which means that data owners waive all rights to the data, so it can be freely shared, copied, modified, and distributed, without permission or attribution, even for commercial purposes.
We recommend the CC0 as it provides simple rules for attribution, warranty, and use, and is very common. If CC0-licensed is used in scientific publications, you need to cite the location from which you obtained the data as part of scientific practice. There might be some concern that it is possible to conduct fraud, but any such fraud is likely to be found out eventually if the original data is archived properly.
For more information on the benefits of using CC0, and why CCBY is not always appropriate, see https://osc.universityofcalifornia.edu/2016/09/cc-by-and-data-not-always-a-good-fit/. Short and full overviews of CC0 can be found at https://creativecommons.org/.
## Metadata {#metadata}
Include metadata with your data. There are three benefits to a dataset that includes metadata: (1) It provides context by describing variables to humans, (2) It avoids errors by describing and preserving variable types for machine readability, and (3) It makes data easier to find online by facilitating indexing [@google-data-search]. Three pieces of metadata should be considered: (1) **variable names**, (2) **variable descriptions**, and (3) **unit text**. These should be curated by a human. Metadata created should be clearly labeled in a folder or in files named METADATA.
**Variable names** should be short, descriptive names with no spaces or special characters, such as "job_position" or "date." **Variable description** describes the measured variables, e.g., "university job position" or "date started," and should also define *variable codes* and *missing data*. **Unit text** defines the type of variable recorded (e.g., string, date, numeric, integer), in order to avoid serious errors of misrepresented data (e.g., a column of a gene sequence being interpreted as a date).
A simple metadata file that describes the variable names, variable description, and unit text is a great starting point, and is sometimes known as a codebook. Metadata can be encoded into more formal plain text formats that follow a set standard, such as EML or SCHEMA-LD. A deep understanding of these formats isn't required to use them. For more details, see [@Broman2017; @Ellis2017; @Arslan2019; @Leipzig2021].
## Raw data versus analysis-ready data {#rawdata}
Make both raw and analysis-ready data available. Raw data is the first format of data provided before any cleaning has taken place (e.g., data in some binary or proprietary format). It should be placed in its own folder. Data used in analysis should be provided, especially if it was computationally intense to derive. It should be in its own folder, and to maximize interoperability, recorded in a plain text (e.g., csv, tsv) and if possible, in a binary format (e.g., .rds, .sav, .xlsx, Apache Parquet). The binary format allows for immediate reuse in the intended state, while the interoperability of plain text versions remain stable over time. Ideally, data should be in "tidy data" form, where variables are in columns, observations in rows, and each cell has only one value [@Wickham2014].
## Data cleaning scripts {#scripts}
Share code used to transform data from its raw format into analysis-ready form. This should be kept in the same place as raw data. Ideally, this transformation should involve only scripted languages (e.g, R, Python, Matlab, etc.), but any unscripted data preparation steps should be recorded in a plain text file. Authors should run the scripts in a clean environment to ensure that everything runs and that the code doesn't rely on other local artifacts, such as their own folder layout not available to others.
For more complex analyses that involve workflow management systems (see https://github.com/pditommaso/awesome-pipeline for a comprehensive list), a description of how to run the workflow should be included in the documentation, preferably in the README file. Depending on whether the workflow is GUI-driven (e.g. Galaxy) or text-driven (e.g. Nextflow, Snakemake), the workflow itself can be checked into the version control repository. As these communities work closely and use Common Workflow Language as a shared representation, these files can be widely used to recreate the steps involved in producing the final results.
## Avenues to share data {#avenues}
Data should be deposited into a relevant domain repository that generates an accession number such as a DOI (for example, gbif for species data, or genbank for genetics data). An exhaustive searchable list is available at re3data (https://www.re3data.org/) but provides no guidance to researchers. Data that do not fit appropriate domain repositories can be submitted to research data repositories **Zenodo** and **Dryad**, which are currently leading the way in best practices for software and data. In addition to minting DOIs, they also provide citation templates.
If using a research institute's library service to share data, ensure that a DOI is provided so it can be accurately identified and cited. Do not deposit data in local file storage that does not provide DOI or accession numbers.
Although it is fine to share data among collaborators using temporary cloud storage (like Dropbox or Google Drive), once the paper is public, the data needs to be hosted somewhere persistent. Although a cloud-based storage approach is useful while iterating on a paper, it does not scale to a larger readership and is highly prone to disappearing (lack of permissions, etc.). Providing back-up data is particularly relevant in light of cases in which data is suspected of being tampered with [@Schleunes2020; @Viglione2020]. The use of raw data allows for issues to be identified; the absence of raw data in publications creates difficulties in verifying the validity of the data.
Journals provide data papers in a formal publication explaining data that is archived (e.g. Ecology, Scientific Data). This is a familiar avenue for researchers to receive credit and citation. Writing a data paper can be very useful for researchers, but, journals that publish data papers lack guidelines on how to structure data for effective reuse and sharing. We recommend partnering with best practice data repositories that emphasize curation, like Dryad. This way the data will have long-term archiving and will also have professional curation to ensure that the published data are usable and have enough metadata to be understood. While writing a data paper can be very useful for researchers, journals that publish data papers lack guidelines on how to structure data for effective reuse and sharing. Data journals should routinely be partnering with best practice repositories to best support the underlying data.
## Citation: How you want your data to be cited {#cite}
One critical advantage of depositing data in a trusted repository (Section \@ref(avenues)) is that it comes with many benefits, such as preservation, indexing, discoverability, and most importantly, citation with a persistent identifier. Authors should add a CITATION file when possible but the real burden lies with publishers to ensure that these citations are marked up appropriately to allow for the building of open data metrics [@lowenberg_2021, @lowenberg_daniella_2021b].
When citing data, you should cite the preferred CITATION file in the first instance, else, a data publication, (for example in _Scientific Data, Ecology_), and if none of those apply, cite the repository in which the data is deposited (e.g., a Zenodo repository) in the references section of related papers.
## Data as part of open source packages {#pkgs}
Data can be distributed alongside code using a format such as a package in R, Python, and Julia. A package or module structure provides an advantage, as it can be "installed" and loaded just like code. An obvious benefit of this approach is that it allows data to be immediately used in a programming environment without the additional steps of downloading, preprocessing, specifying column types, etc. But there are some limitations to this approach. First, packaging data inside a software package makes the data availability language-centric. Users who work in other languages are unlikely to download and export data out of a package. Second, data size can be a limitation for some repositories (CRAN, the R repository, for example, limits package sizes to 5Mb, although there are workarounds). Other fields, such as bioinformatics, have a process of sharing data through repositories like Bioconductor, which can accommodate a larger data size. Third, there is the cost of authors and data curators/custodians understanding the intricacies of package development in order to share data. Nevertheless, the frictionless approach to sharing data in software packages makes reuse substantially faster than retrieving and correctly parsing files from a repository. However, we recommend the software approach only as an additive to proper data curation.
Packaging data in software and archiving in data repositories are not mutually exclusive activities. First, archive data in a long-term data repository (Section \@ref(avenues)), before including it in a package. The act of packaging data in software can shave hours off the time required to prepare it for use. The canonical citation can be baked into the package's citation file.
## Considerations for data size {#size}
How data is shared can change depending on its size. Defining data size will always be somewhat difficult to do, since storage capacity and norms constantly change. We consider data sizes in reference to a moderately priced laptop, where small data fits on computer memory (RAM), medium data does not fit on RAM, and big data does not fit on the hard disk. No matter the size, data should be archived with an accession number or DOI.
Small data has more options for how it can be shared -- you can bundle it with your analysis code (as in \@ref(rawdata)), or in software (as in \@ref(pkgs)). For medium data, a general purpose data repository (e.g., Zenodo) would be ideal for archiving. Sharing medium data might require it to be broken into separate chunks for reading (e.g., `teaching-1.csv`, `teaching-2.csv`, etc.), and scripts might need to be provided to read it. Large data will require liaising with services from \@ref(avenues), or making special arrangements with a local research institute library, to ensure that there is some way to gain access, and to maintain a DOI.
# Conclusion
Although data sharing is widely mandated by funders, journals, and academic societies, the amount of usable data remains appallingly low. The mere act of depositing data does not make it reusable or useful to a potential consumer. Although data citations do not currently impact promotion and tenure packets, there is some evidence that papers sharing usable data accrue more citations [@Piwowar2007]. Over time, the more useful we make the data, the more likely it becomes that incentive mechanisms will take hold.
The steps leading up to requiring data sharing are still vague and disincentivized. In comparison, although few journals require that code be submitted with publications, such a norm is more likely to be adopted if a foundation like the one we have described is established, with carrots rather than sticks. We have taught researchers how to write modular code via popular training initiatives such as The Carpentries. Commercial services like GitHub have made code sharing fun and productive. For more advanced researchers, data science languages have made it easier to package and disseminate code as reusable software. If we now require mandatory code sharing, it is quite likely that more of it will be useful than the data currently being shared. Researchers will need to be familiar with existing tools and approaches for sharing data. There might be potential for researchers to reinvent the wheel, for example by creating new platforms for data sharing, when services such as Zenodo and Dryad already exist. So it is worth mentioning that researchers should not reinvent new tools without an exhaustive search.
To make data sharing truly useful, we need to make the process easier, improve tools and training, and reward and incentivize researchers who engage in such practices. Only then can shared data truly be useful.
# Acknowledgements {.unnumbered}
We'd like to thank the following people for helpful discussions and contributions to the paper as we were preparing it for publication: Rich FitzJohn, Anna Krystalli, Daniella Lowenberg, and Miles McBain. We'd also like to thank those who provided feedback on the paper when it went up for presubmission: Adrian Barnett, Raymond B Huey, Christian Thurn, Christina Maimone, and GitHub user, 'dschief001'. Finally, we thank the following organizations for their support ACEMS International Mobility Programme, the Helmsley Charitable Trust, the Gordon and Betty Moore Foundation, and the Alfred P. Sloan Foundation.
# Author Contributions {.unnumbered}
Both authors contributed equally to the work, making substantial contributions to the conception, design, drafting and revision of the work. The authors have approved the submitted version, and have agreed both to be personally accountable for the author's own contributions and to ensure that questions related to the accuracy or integrity of any part of the work, even ones in which the author was not personally involved, are appropriately investigated, resolved, and the resolution documented in the literature.
# Competing Interests {.unnumbered}
The authors declare no competing interests.
# References {.unnumbered}
\singlespacing