Statistics 159/259, Spring 2022 Course Summary

Overview

This course teaches “the why and how” of reproducible and collaborative research by combining questions of good computational practice in science, open science and statistical data analysis, in the context of today’s research environment. We will interleave practical topics in software engineering and statistical computing with broader discussions on elements of the philosophy of science and the foundations of statistics.

More details can be found in the syllabus.

Key Resources

Communication: class Piazza.
Lectures will be recorded and posted in the Kaltura system (visible via bCourses), but attendance is mandatory. (Much of the pedagogical value of the class is in participating in discussions and code reviews).
Course readings that are not easy to find free on the web or through the UC Berkeley Library will be posted to bCourses.
Computing resources
- We will use Jupyter notebooks. We will start with hosted notebooks on our Stat 159 JupyterHub. Later in the term, we will discuss installing Jupyter on your own device. The JupyterHub server will have all the packages you need pre-installed.
- The sources for class notes and most other materials are available on github, with a rendered version here.
- Assignments should be submitted by pull request to your private repositories using the GitHub Clasroom.
- Whenever you need to work with GitHub, remember to activate GitHub authentication from the JupyterHub, by running the command github-app-user-auth at a terminal and following the instructions. If once authenticated you can't push to a given repo, it may be that you forgot to add that repo/org to your setup of the authentication app, go here to configure the app's permissions.
A note on the Berkeley Library EZProxy: Some of the resources listed here are scientific articles available only behind journal paywalls. If you haven't already, you should configure your web browser to use the campus library EZProxy so you can access them even if you are working from an off-campus network.

Textbook and supporting materials

While not strictly a textbook for this course, we will rely heavily on the excellent, openly licensed: Research software engineering in Python. We will complement it with these other scientific python resources:

Katy Huff's - Effective Computation in Physics.
Jake van der Plas' A Whirlwind Tour of Python.
Stefan van der Walt's Python Survival Pack and Elegant SciPy Book. The full book and all the notebooks are available.
Josh Bloom's Python for Data Science Berkeley Course.
Lecture notes on scientific python
Getting started with Python for research, a gentle introduction to Python in data-intensive research.
Python for Data Analysis, 2nd Edition, by Wes McKinney, creator of Pandas. Companion Notebooks
Effective Pandas, a book by Tom Augspurger, core Pandas developer.
And we'll use these Earth Science resources for our domain focus:
- Ryan Abernathey's research computing for Earth Sciences.
- Brain Rose's Climate Laboratory.
- Lisa Tauxe's Python for Earth Science Students
Git and git workflows
Continuous integration
- GitHub actions
Miscellaneous computing tutorials
- Berkeley Statistical Computing Facility tutorials

Other bibliography

Above are a list of books and websites mostly focusing on computational skills, and this is a list of all the bibliography we'll refer to in the course. Some of these will become assigned readings, while others are available for your reference.

PLOS Ten Simple Rules

The PLOS Ten Simple Rules collection has many short, valuable papers full of relevant, practical advice in this space. A few that stand out, though many (if not most) are worth your time, are "Ten simple rules for ...":

Computational research

making research software more robust.
writing and sharing computational analyses in Jupyter Notebooks.
Effective Computational Research.
Reproducible Computational Research.

Open Source Software and Open Science

Taking Advantage of Git and GitHub.
the Open Development of Scientific Software.
documenting scientific software.
helping newcomers become contributors to open projects.
Cultivating Open Science and Collaborative R&D.

Data Management

the Care and Feeding of Scientific Data.
Digital Data Storage.
responsible big data research.

The art of research

Effective Statistical Practice.
Doing Your Best Research, According to Hamming.

National Academies Reports

These are key reports produced by the National Academies of Science, Engineering and Medicine. They were created by teams of world experts in the field, and inform policy in multiple areas:

Reproducibility and Replicability in Science, 2018. The previous link contains multiple resources on this topic, including overview videos, from a large effort comissioned by the National Academies of Science, Engineering and Medicine. For reading, this NCBI link has both HTML and PDF download options.
Statistical Challenges in Assessing and Fostering the Reproducibility of Scientific Results, 2016.
Open Source Software Policy Options for NASA Earth and Space Sciences, 2018.
Open Science by Design, Realizing a Vision for 21st Century Research (2018).
Developing a Toolkit for Fostering Open Science Practices, 2021

Other general references on reproduciblity and open science

Millman and Pérez 2014, Developing open source scientific practice.
Keith Baggerly and the Potti & Nevins Cancer Scandal:
- 2012, 60 minutes segment about Duke Clinical Trials (also available on vimeo).
- 2018, The Importance of Reproducible Research in High-Throughput Biology, talk at UW.
Barba 2016, Top-10 Readings in Reproducibility, a syllabus on reproducible research by Prof. Lorena Barba. Of particular interest from Barba is the Barba 2012, Reproducibility PI Manifesto, with slides available here, as well as
Wilson et al, 2012 - Best Practices for Scientific Computing
Turk 2013, How to Scale a Code in the Human Dimension
Granger and Pérez 2021, Jupyter: Thinking and Storytelling With Code and Data.
Vicente-Saenz et al, 2018, Open Science now: A systematic literature review for an integrated definition
The Limits of Reproducibility in Numerical Simulation.
The Practice of Reproducible Research, Case Studies and Lessons from the Data-Intensive Sciences. An online (and printed) book produced by Berkeley researchers. It includes the excellent Achieving Full Replication of our Own Published CFD Results, with Four Different Codes.
Reliability and reproducibility in computational science: implementing verification, validation and uncertainty quantification in silico. A special issue of Philosophical Transactions of the Royal Society A dedicated to this topic, with multiple valuable articles, of which the following are just a few:

Reproducibility and earth/climate science

One of the National Academies reports above commissioned a paper by Bush et al. (2020) titled Perspectives on Data Reproducibility and Replicability in Paleoclimate and Climate Science.
Liu et al. 2019, improving reproducibility in Earth science research.
Geyer et al. 2021, Limits of reproducibility and hydrodynamic noise in atmospheric regional modelling.
Feulner 2016, Science under Societal Scrutiny: Reproducibility in Climate Science.
Hoffimann et al. 2021, Geostatistical Learning: Challenges and Opportunities
Gentemann et al. 2021, Science Storms the Cloud.
Abernathey et al. 2021, Cloud-Native Repositories for Big Scientific Data.

Concepts

Terms related to reproducibility

reproducibility
replicability
repeatability
computational reproducibility
"preproducibility"

Reproducibility and the Philosophy of Science

the role of replication in science
"virtual witnessing" and the role(s) of scientific publishing

Obstacles to reproducibility

data availability
- data
- data format
- data dictionary
- data cleaning and munging
- data pre-processing
reliance on proprietary software
analysis
- breadcrumbs / description
- actual code
- description and what was done are often different
- scripting analyses is key--but not enough
- software versions, libraries, compilers, environments, hardware can matter

Obstacles to replicability

lack of preproducibility: what was done?
"researcher degrees of freedom"
- what was considered but not tried, or tried and discarded?
- choice of hypotheses, P-hacking
- choice of data subsets
- choice of transformations
- choice of models
- choice of estimators
  - if Bayesian, choice of prior
  - if frequentist, what method and why?
  - constraints?
- choice of measures of uncertainty
  - nonparametric / model-based / parametric / asymptotic
  - local / global
  - selective inference, P-hacking, cherry-picking, "garden of forking paths"
  - hypothesis tests: what is the full null? What does it have to do with reality?
"file-drawer effect"
- small $n$ studies
ignoring multiplicity & multiple testing (including selective inference)
intrinsic variability
sensitivity to "influential" observations
appropriate level of abstraction

Obstacles to good science and applied Statistics

confirmation bias
Foundational issues; misinterpretations of probability and uncertainty
- Interpretation of probability
  - prior probabilities
- Types of uncertainty
  - Epistemic and aleatory uncertainty
  - constraints versus priors
- Bayesian and frequentist measures of uncertainty
- Duality between minimax and Bayes estimation
- models versus response schedules
model mania
- correlation (even really strong correlation) is not causation
- fit does not imply correctness
- familiarity does not imply appropriateness (Fallacies do not cease to be fallacies because they become fashions. —G.K. Chesterton)
- Statistical practice as superstition
ritualization of Statistics, cargo-cult science
bad incentive structure in academia

Weaponizing reproducible/open science

https://int.nyt.com/data/documenttools/transparency-rule/d1fb06c8db2b3d4a/full.pdf
https://www.nytimes.com/2021/01/04/climate/trump-epa-science.html

Key ideas/tools from software engineering that can help improve science

revision/version control
documentation, documentation, documentation
modularity and abstraction
scripted analyses and automation
unit tests, regression tests, coverage tests, continuous integration
code review
pair programming
consistency: APIs, calling signatures, object-oriented code
separating data, computation, presentation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

overview.md

overview.md

Statistics 159/259, Spring 2022 Course Summary

Overview

Key Resources

Textbook and supporting materials

Other bibliography

PLOS Ten Simple Rules

National Academies Reports

Other general references on reproduciblity and open science

Reproducibility and earth/climate science

Concepts

Terms related to reproducibility

Reproducibility and the Philosophy of Science

Obstacles to reproducibility

Obstacles to replicability

Obstacles to good science and applied Statistics

Weaponizing reproducible/open science

Key ideas/tools from software engineering that can help improve science

Files

overview.md

Latest commit

History

overview.md

File metadata and controls

Statistics 159/259, Spring 2022 Course Summary

Overview

Key Resources

Textbook and supporting materials

Other bibliography

PLOS Ten Simple Rules

National Academies Reports

Other general references on reproduciblity and open science

Reproducibility and earth/climate science

Concepts

Terms related to reproducibility

Reproducibility and the Philosophy of Science

Obstacles to reproducibility

Obstacles to replicability

Obstacles to good science and applied Statistics

Weaponizing reproducible/open science

Key ideas/tools from software engineering that can help improve science