blog/introduction-to-data-manipulation-in-r-with-dplyr/ #109

2023-12-03T08:59:54Z

giscus[bot]
bot Dec 3, 2023

blog/introduction-to-data-manipulation-in-r-with-dplyr/

Learn to use the dplyr R package which helps you to solve the most common data manipulation challenges such as filtering, summarizing or sorting observations

https://statsandr.com/blog/introduction-to-data-manipulation-in-r-with-dplyr/

technocrat · 2023-12-03T08:59:54Z

technocrat
Dec 3, 2023 — with giscus

R is a crappy procedural programming language. R has also an unsurpassed richness of statistical tools and a clean REPL that presents to the user as a functional programming language. Pass your arguments to the right function and get a return value that has a high likelihood of being appropriate. {dplyr} and the related tidyesque tools try to overcome the functional paradigm by providing a superset to make R more procedural. In doing so attention shifts focus from nouns to verbs; the user becomes more concerned with the how, less concerned with the what and in danger of becoming oblivious to why. Your discussion of case_when provides an apt illustration.

Assigning the values of the penguins$body_mass_g into the three categories of NA, Low, Medium and High becomes a hot mess attempting to use ifelse(), as you note. And in the procedural toolbox of {base} that's mainly what there is to work with. But that's the wrong place to look. It can (and should) be done functionally.

library(palmerpenguins)
# I avoid tibbles when there's no need to deal
# with anything beyond vector variables and
# also avoid data frames when dealing with
# data that is all numeric or all character

dat <- as.data.frame(penguins) 

# categorize according to second and fourth quantiles
bin_mass <- function(x,y){
    x$body_mass_cat = NA
    lo     = fivenum(y)[2]
    hi     = fivenum(y)[4]
    the_na = which(is.na(y))
    the_lo = which(y < lo)
    the_hi = which(y > hi)
    mids   = setdiff(1:dim(x)[1],c(the_na,the_lo,the_hi))
    x$body_mass_cat[the_lo] = "Low"
    x$body_mass_cat[the_hi] = "High"
    x$body_mass_cat[mids]   = "Medium"
    return(x$body_mass_cat)
}

dat$body_mass_cat <- bin_mass(dat,dat$body_mass_g)

Everything is explicit and there can be no mystery as to how the additional variable was derived. But it's tedious, principally because R lacks comprehensions. Here's what this looks like in a modern functional language, Julia.

using RCall
using DataFrames
using Statistics
penguins = R"library(palmerpenguins);as.data.frame(data('penguins',package = 'palmerpenguins'))"
d = rcopy(DataFrame, R"penguins")
v = d.body_mass_g
# everything to this point was in aid of bringing in the data from R
# it would be more direct bringing in from CSV
q = quantile(skipmissing(v),[0.25,0.75])
new_v= [ismissing(x) ? missing : (x < q[1] ? "Low" : (x > q[2] ? "High" : "Middle")) for x in v]

Once given v, a vector containing the penguins$body_mass_g values, it is a two-liner to produce a vector to show the categories. It could be one-line at the cost of folding in the definitions of the quantiles into the last statement. That's more concision than useful, however.

The beauty of this is that it iterates over each element in the vector. If it encounters typeof missing (equivalent to NA), it returns missing, otherwise it checks whether the element is in the bottom quartile, in which case it returns "Low" or, if not, in the top quartile in which case it returns "High." Otherwise, it returns "Middle." The possibility to apply a ternary operator is just what is missing in {base} R.

I did this in a Julia script, because it's easier to illustrate. It could also be done in an R script through use of the {JuliaCall} package, but that's not something that should regularly be sorted to. And, also, for the large majority of REPL applications, addressing this specific problem is better handled with case_when. So, why do I bring it up?

For the tasks that strain R scripting, however, Julia has several advantages.

The syntax is somewhat similar to Python while preserving a functional programming mindset
Scripts easily can be expressed with all dependencies and version metadata to make them completely reproducible
Robust tools for memory management, parallel and distributed processing
JIT compilation provides execution times competitive with Fortran, C/C++, Java, Go, Rust
Dependency hells are very uncommon

5 replies

AntoineSoetewey Dec 3, 2023
Maintainer

Dear @technocrat,

Many thanks for your input! Learning Julia (and Python) is on my (long term) to-do list, for the reasons you mentioned and many others.

That being said, I believe that Julia currently lacks some interesting packages compared to R, in particular for statistical tests and data visualization. Of course, it’s very likely that the gap will narrow in the future thanks to the strong and active Julia’s community.

Regards,
Antoine

technocrat Dec 3, 2023

Julia can "import" R packages to expand its own Statistics library, which is pretty barebone. There's also a bunch that can all be swept in with using StatsKit. For me the Plots package strikes a nice compromise between R plot and ggplot and, of course, there's no bar to switching back and forth. For example, quarto will run Julia chunks (and Python). I think it will be fun for everyone.

AntoineSoetewey Dec 3, 2023
Maintainer

Very interesting to know!

What resources do you recommend to learn it? For someone who has never used it.

technocrat Dec 3, 2023

I'm using Julia for Data Analysis by Bogumił Kamiński after trying a couple of others, one a traditional academic approach building up from definitional elements and another proceeding by analogy from other languages. This one, however, is more practical approach that combines use cases with underlying mechanics of how Julia works. It is extremely well organized.

AntoineSoetewey Dec 3, 2023
Maintainer

Thanks, I’ll definitely check it out!

YuxiaoLuo · 2025-01-02T16:27:50Z

YuxiaoLuo
Jan 2, 2025 — with giscus

This dplyr intro post is very informative!

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blog/introduction-to-data-manipulation-in-r-with-dplyr/ #109

{{title}}

Replies: 2 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

blog/introduction-to-data-manipulation-in-r-with-dplyr/ #109

giscus[bot] bot Dec 3, 2023

blog/introduction-to-data-manipulation-in-r-with-dplyr/

Replies: 2 comments · 5 replies

technocrat Dec 3, 2023 — with giscus

AntoineSoetewey Dec 3, 2023 Maintainer

technocrat Dec 3, 2023

AntoineSoetewey Dec 3, 2023 Maintainer

technocrat Dec 3, 2023

AntoineSoetewey Dec 3, 2023 Maintainer

YuxiaoLuo Jan 2, 2025 — with giscus

giscus[bot]
bot Dec 3, 2023

Replies: 2 comments 5 replies

technocrat
Dec 3, 2023 — with giscus

AntoineSoetewey Dec 3, 2023
Maintainer

AntoineSoetewey Dec 3, 2023
Maintainer

AntoineSoetewey Dec 3, 2023
Maintainer

YuxiaoLuo
Jan 2, 2025 — with giscus