Students will work through activities highlighting the motivation for and value of literate programming as a concept, and as its implementation in RMarkdown
. Through this, students will get introduced to the concepts of executable documentation and automation. Students will also learn about best practices for structuring spreadsheet-type data files, and the importance of documenting all changes one makes to data. Finally, students will be introduced to combining all these ideas to create automated, executable, and self-documenting data quality insurance and control reports.
At the beginning of the session, students should
- be familiar with
Rstudio
:Rstudio
layout, runningR
commands, and runningknitr
.
At the end of the session students will be able to
- Distinguish between a spreadsheet formatted properly for later analysis and one formatted improperly.
- Recognize and correct common data entry errors.
- Describe the concept of 'raw data', and its implications for reproducible and sound data management.
- Apply the concept of literate programming to produce executable documentation of data management and analysis.
- Recap about markdown,
RMarkdown
,knitr
, and virtues of literate programming from the demonstrations in the Intro lesson: slides
Objective: through hands-on interaction and modification, develop familiarity with RMarkdown
and knitting the output.
Students knit
and modify. Using
countryPick4.Rmd as a template, students learn how to import data, filter to one country, make a plot, write it to file, and comment data choices. Then the activity will illustrate what happens when you knit
:
- Preview/Knit HTML, note what sorts of outputs are left behind.
- Discuss input and output files.
- Which files can we delete and reproduce? Which files are inputs, outputs, converters of inputs to outputs?
This section is meant for students to explore the power of writing reports in R
.
Lesson: 01-programatic-modification
Students identify poor and good data formatting practices, and will learn the importance of documenting modifications. This will lead to making modifications in a self-documenting and executable way.
- Lesson: 02-literate-programming
- EP White, E Baldridge, ZT Brym, KJ Locey, DJ McGlinn, SR Supp (2013) "Nine simple ways to make it easier to (re)use your data." Ideas in Ecology and Evolution 6(2): 1–10, 2013. doi:10.4033/iee.2013.6b.6.f (in particular the section "Use standard table formats")
- [Good practice guidance on releasing statistics in spreadsheets](UK Government)
This lesson was first created as a part of the Organization1 lesson at the 1. Reproducible Science Curriculum Hackathon, and was later split out into its own lesson. The corresponding author is Hilmar Lapp (@hlapp). See the commit log for other contributors.
Please post feedback and issues with the lesson on the repository's issue tracker. For instructor questions about teaching this lesson, you can also contact the corresponding author directly.
- Gapminder data
- Processed and subset (population size, life expectancy, GDP per capita; only every 5 years only starting 1952, only complete records)
Gapminder data as
R
package. The data-raw sub-directory reveals the journey from Gapminder.org's Excel workbooks to increasingly clean and tidy data. - clean dataset can be located in R in the following way (after installing the package):
pathToTsv <- system.file("gapminder.tsv", package = "gapminder")
{: .r}