blog/introduction-to-data-manipulation-in-r-with-dplyr/ #109
Replies: 2 comments 5 replies
-
R is a crappy procedural programming language. R has also an unsurpassed richness of statistical tools and a clean REPL that presents to the user as a functional programming language. Pass your arguments to the right function and get a return value that has a high likelihood of being appropriate. {dplyr} and the related tidyesque tools try to overcome the functional paradigm by providing a superset to make R more procedural. In doing so attention shifts focus from nouns to verbs; the user becomes more concerned with the how, less concerned with the what and in danger of becoming oblivious to why. Your discussion of case_when provides an apt illustration. Assigning the values of the penguins$body_mass_g into the three categories of NA, Low, Medium and High becomes a hot mess attempting to use ifelse(), as you note. And in the procedural toolbox of {base} that's mainly what there is to work with. But that's the wrong place to look. It can (and should) be done functionally.
Everything is explicit and there can be no mystery as to how the additional variable was derived. But it's tedious, principally because R lacks comprehensions. Here's what this looks like in a modern functional language, Julia.
Once given v, a vector containing the penguins$body_mass_g values, it is a two-liner to produce a vector to show the categories. It could be one-line at the cost of folding in the definitions of the quantiles into the last statement. That's more concision than useful, however. The beauty of this is that it iterates over each element in the vector. If it encounters typeof missing (equivalent to NA), it returns missing, otherwise it checks whether the element is in the bottom quartile, in which case it returns "Low" or, if not, in the top quartile in which case it returns "High." Otherwise, it returns "Middle." The possibility to apply a ternary operator is just what is missing in {base} R. I did this in a Julia script, because it's easier to illustrate. It could also be done in an R script through use of the {JuliaCall} package, but that's not something that should regularly be sorted to. And, also, for the large majority of REPL applications, addressing this specific problem is better handled with case_when. So, why do I bring it up? For the tasks that strain R scripting, however, Julia has several advantages.
|
Beta Was this translation helpful? Give feedback.
-
This dplyr intro post is very informative! |
Beta Was this translation helpful? Give feedback.
-
blog/introduction-to-data-manipulation-in-r-with-dplyr/
Learn to use the dplyr R package which helps you to solve the most common data manipulation challenges such as filtering, summarizing or sorting observations
https://statsandr.com/blog/introduction-to-data-manipulation-in-r-with-dplyr/
Beta Was this translation helpful? Give feedback.
All reactions