Patrick Cherry
- emmeans: Estimated marginal means
- DoE: Design of Experiment
- Random forest classification
- Anscombe’s Quartet
Estimated marginal means (EMMs, previously known as least-squares means in the context of traditional regression models) are derived by using a model to make predictions over a regular grid of predictor combinations (called a reference grid).
I use estimated marginal means to estimate the effect sizes when:
- interaction effects are present,
- when multiple effects are present—but are not scaled the same way (e.g. one effect is linear, one effect is reciprocal (1/x) ),
- when variability is not homoscedastic (e.g. as in a
glm
) and I could use the confidence intervals offered byemmeans
that I don’t get with ordinary means, - or when the experiment’s sampling is not balanced, causing some conditions to be over-weighted in the ordinary means.
Design of Experiment principles seek to configure the samples, variables, and controls in a scientific experimental plan to answer the question posed or test the hypothesis while controlling for known sources of variability and confounding due to the methods and materials used to carry out the experiment.
Here, I use the and the standard R
function gen.factorial
to make a
full factorial design and then use the package AlgDesign
to add
blocking for two operators who will be carrying out the experiment (with
n = 3 replicates for each unique sample condition).
Random forest model classification of legal status of trees in San Francisco Department of Public Works data from a Tidy Tuesday project.
Here, I used ranger
engine to train a random forest model to classify
the legal status of the trees using all relevant observations, use the
tune
package to run hyperperamater optimization on mtry
and min_n
,
and then evaluate the accuracy of the model using AUC and by plotting
the correct and incorrect predictions in a map-like format.
Anscombe’s quartet is a set of four x : y value pairs published by F J Anscombe in American Statistician in 1793. The sets have nearly identical descriptive statistics, like mean, standard deviation, R^2 correlation, and least-squares regression slopes, (to ~ 3 decimal places), but are clearly very different data sets when visualized by plotting.
I use unpivotr
to tidy the data upon import, ggplot to make plots, and
purrr
’s map
for functional programming on nested dataframes with
broom
for model object manipulation, included nested in dataframes.