In part 1 we wrote our first R function to compute the difference between the max and min of a numeric vector. We checked the validity of the function's only argument and, informally, we verified that it worked pretty well.
In this part, we generalize this function, learn more technical details about R functions, and set default values for some arguments.
As usual, load the Gapminder excerpt.
library(gapminder)
Let's keep our previous function around as a baseline.
mmm <- function(x) {
stopifnot(is.numeric(x))
max(x) - min(x)
}
The max and the min are special cases of a quantile. Here are other special cases you may have heard of:
- median = 0.5 quantile
- 1st quartile = 0.25 quantile
- 3rd quartile = 0.75 quantile
If you're familiar with box plots, the rectangle typically runs from the 1st quartile to the 3rd quartile, with a line at the median.
If
Let's generalize our function to take the difference between any two quantiles. We can still consider the max and min, if we like, but we're not limited to that.
The eventual inputs to our new function will be the data x
and two probabilities.
First, play around with the quantile()
function. Convince yourself you know how to use it, for example, by cross-checking your results with other built-in functions.
quantile(gapminder$lifeExp)
## 0% 25% 50% 75% 100%
## 23.5990 48.1980 60.7125 70.8455 82.6030
quantile(gapminder$lifeExp, probs = 0.5)
## 50%
## 60.7125
median(gapminder$lifeExp)
## [1] 60.7125
quantile(gapminder$lifeExp, probs = c(0.25, 0.75))
## 25% 75%
## 48.1980 70.8455
boxplot(gapminder$lifeExp, plot = FALSE)$stats
## [,1]
## [1,] 23.5990
## [2,] 48.1850
## [3,] 60.7125
## [4,] 70.8460
## [5,] 82.6030
Now write a code snippet that takes the difference between two quantiles.
the_probs <- c(0.25, 0.75)
the_quantiles <- quantile(gapminder$lifeExp, probs = the_probs)
max(the_quantiles) - min(the_quantiles)
## [1] 22.6475
IQR(gapminder$lifeExp) # hey, we've reinvented IQR
## [1] 22.6475
I'll use qdiff
as the base of our function's name. I copy the overall structure from our previous "max minus min" work but replace the guts of the function with the more general code we just developed.
qdiff1 <- function(x, probs) {
stopifnot(is.numeric(x))
the_quantiles <- quantile(x = x, probs = probs)
max(the_quantiles) - min(the_quantiles)
}
qdiff1(gapminder$lifeExp, probs = c(0.25, 0.75))
## [1] 22.6475
qdiff1(gapminder$lifeExp, probs = c(0, 1))
## [1] 59.004
mmm(gapminder$lifeExp)
## [1] 59.004
Again we do some informal tests against familiar results.
I want you to understand the importance of argument names.
I can name my arguments almost anything I like. Proof:
qdiff2 <- function(zeus, hera) {
stopifnot(is.numeric(zeus))
the_quantiles <- quantile(x = zeus, probs = hera)
return(max(the_quantiles) - min(the_quantiles))
}
qdiff2(zeus = gapminder$lifeExp, hera = 0:1)
## [1] 59.004
While I can name my arguments after Greek gods, it's usually a bad idea. Take all opportunities to make things more self-explanatory via meaningful names.
This is better:
qdiff3 <- function(my_x, my_probs) {
stopifnot(is.numeric(my_x))
the_quantiles <- quantile(x = my_x, probs = my_probs)
return(max(the_quantiles) - min(the_quantiles))
}
qdiff3(my_x = gapminder$lifeExp, my_probs = 0:1)
## [1] 59.004
If you are going to pass the arguments of your function as arguments of a built-in function, consider copying the argument names. Again, the reason is to reduce your cognitive load. This is what I've been doing all along and now you know why:
qdiff1
## function(x, probs) {
## stopifnot(is.numeric(x))
## the_quantiles <- quantile(x = x, probs = probs)
## max(the_quantiles) - min(the_quantiles)
## }
We took this detour so you could see there is no structural relationship between my arguments (x
and probs
) and those of quantile()
(also x
and probs
). The similarity or equivalence of the names accomplishes nothing as far as R is concerned; it is solely for the benefit of humans reading, writing, and using the code. Which is very important!
By this point, I expect someone will have asked about the last line in my function's body. Look above for a reminder of the function's definition.
By default, a function returns the result of the last line of the body. I am just letting that happen with the line max(the_quantiles) - min(the_quantiles)
. However, there is an explicit function for this: return()
. I could just as easily make this the last line of my function's body:
return(max(the_quantiles) - min(the_quantiles))
You absolutely must use return()
if you want to return early based on some condition, i.e. before execution gets to the last line of the body. Otherwise, you can decide your own conventions about when you use return()
and when you don't.
What happens if we call our function but neglect to specify the probabilities?
qdiff1(gapminder$lifeExp)
## Error in quantile.default(x = x, probs = probs): argument "probs" is missing, with no default
Oops! At the moment, this causes a fatal error. It can be nice to provide some reasonable default values for certain arguments. In our case, it would be crazy to specify a default value for the primary input x
, but very kind to specify a default for probs
.
We started by focusing on the max and the min, so I think those make reasonable defaults. Here's how to specify that in a function definition.
qdiff4 <- function(x, probs = c(0, 1)) {
stopifnot(is.numeric(x))
the_quantiles <- quantile(x, probs)
return(max(the_quantiles) - min(the_quantiles))
}
Again we check how the function works, in old examples and new, specifying the probs
argument and not.
qdiff4(gapminder$lifeExp)
## [1] 59.004
mmm(gapminder$lifeExp)
## [1] 59.004
qdiff4(gapminder$lifeExp, c(0.1, 0.9))
## [1] 33.5862
EXERCISE FOR THE READER: upgrade our argument validity checks in light of the new argument probs
## problems identified during class
## we're not checking that probs is numeric
## we're not checking that probs is length 2
## we're not checking that probs are in [0,1]
Here's the function we've written so far:
qdiff4
## function(x, probs = c(0, 1)) {
## stopifnot(is.numeric(x))
## the_quantiles <- quantile(x, probs)
## return(max(the_quantiles) - min(the_quantiles))
## }
What we've accomplished:
- we've generalized our first function to take a difference between arbitrary quantiles
- we've specified default values for the probabilities that set the quantiles
Where to next? In Part 3, we tackle NA
s, the special ...
argument, and formal unit testing.