+
+
+
+
+We have been running into issues with model.matrix()
and model.frame()
over the years. We also haven’t done anything with it because it is quite a big task.
+I believe that a replacement is needed, so I’m gathering our requirements here so we can build it right the first time.
+Old notes from Max, which lead to {recipes} but still has information about model.matrix()
- https://rviews.rstudio.com/2017/02/01/the-r-formula-method-the-good-parts/ - https://rviews.rstudio.com/2017/03/01/the-r-formula-method-the-bad-parts/
+
+What we don’t want
+
+Remove ancillary features (subsets, filters)
+As shown with the subset
argument to model.frame()
+
+
model.frame (mpg ~ ., data = mtcars, subset = disp < 100 )
+
+
mpg cyl disp hp drat wt qsec vs am gear carb
+Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
+Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
+Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
+Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
+Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
+
+
+
+
+Don’t allow special functions
+There are special functions that can be defined and used (e.g. offset
). We don’t want those.
+
+
model.frame (mpg ~ wt + vs + am + offset (disp), data = mtcars)
+
+
mpg wt vs am offset(disp)
+Mazda RX4 21.0 2.620 0 1 160.0
+Mazda RX4 Wag 21.0 2.875 0 1 160.0
+Datsun 710 22.8 2.320 1 1 108.0
+Hornet 4 Drive 21.4 3.215 1 0 258.0
+Hornet Sportabout 18.7 3.440 0 0 360.0
+Valiant 18.1 3.460 1 0 225.0
+Duster 360 14.3 3.570 0 0 360.0
+Merc 240D 24.4 3.190 1 0 146.7
+Merc 230 22.8 3.150 1 0 140.8
+Merc 280 19.2 3.440 1 0 167.6
+Merc 280C 17.8 3.440 1 0 167.6
+Merc 450SE 16.4 4.070 0 0 275.8
+Merc 450SL 17.3 3.730 0 0 275.8
+Merc 450SLC 15.2 3.780 0 0 275.8
+Cadillac Fleetwood 10.4 5.250 0 0 472.0
+Lincoln Continental 10.4 5.424 0 0 460.0
+Chrysler Imperial 14.7 5.345 0 0 440.0
+Fiat 128 32.4 2.200 1 1 78.7
+Honda Civic 30.4 1.615 1 1 75.7
+Toyota Corolla 33.9 1.835 1 1 71.1
+Toyota Corona 21.5 2.465 1 0 120.1
+Dodge Challenger 15.5 3.520 0 0 318.0
+AMC Javelin 15.2 3.435 0 0 304.0
+Camaro Z28 13.3 3.840 0 0 350.0
+Pontiac Firebird 19.2 3.845 0 0 400.0
+Fiat X1-9 27.3 1.935 1 1 79.0
+Porsche 914-2 26.0 2.140 0 1 120.3
+Lotus Europa 30.4 1.513 1 1 95.1
+Ford Pantera L 15.8 3.170 0 1 351.0
+Ferrari Dino 19.7 2.770 0 1 145.0
+Maserati Bora 15.0 3.570 0 1 301.0
+Volvo 142E 21.4 2.780 1 1 121.0
+
+
+
+
+
+What we want
+
+Redefine terms
+The terms
attribute that comes out of model.frame()
would be reorganized under this format. Right now they look something like this.
+
+
model.frame (~ ., data = iris) |>
+ attr ("terms" )
+
+
~Sepal.Length + Sepal.Width + Petal.Length + Petal.Width + Species
+attr(,"variables")
+list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
+attr(,"factors")
+ Sepal.Length Sepal.Width Petal.Length Petal.Width Species
+Sepal.Length 1 0 0 0 0
+Sepal.Width 0 1 0 0 0
+Petal.Length 0 0 1 0 0
+Petal.Width 0 0 0 1 0
+Species 0 0 0 0 1
+attr(,"term.labels")
+[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"
+attr(,"order")
+[1] 1 1 1 1 1
+attr(,"intercept")
+[1] 1
+attr(,"response")
+[1] 0
+attr(,".Environment")
+<environment: R_GlobalEnv>
+attr(,"predvars")
+list(Sepal.Length, Sepal.Width, Petal.Length, Petal.Width, Species)
+attr(,"dataClasses")
+Sepal.Length Sepal.Width Petal.Length Petal.Width Species
+ "numeric" "numeric" "numeric" "numeric" "factor"
+
+
+
+
+Sparse matrix for terms
+Notice how the attr(,"factors")
element tends to be a quite sparse matrix? We should encode it differently, especially for very wide data which is becoming more common.
+
+
model.frame (~ ., data = mtcars) |>
+ attr ("terms" )
+
+
~mpg + cyl + disp + hp + drat + wt + qsec + vs + am + gear +
+ carb
+attr(,"variables")
+list(mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb)
+attr(,"factors")
+ mpg cyl disp hp drat wt qsec vs am gear carb
+mpg 1 0 0 0 0 0 0 0 0 0 0
+cyl 0 1 0 0 0 0 0 0 0 0 0
+disp 0 0 1 0 0 0 0 0 0 0 0
+hp 0 0 0 1 0 0 0 0 0 0 0
+drat 0 0 0 0 1 0 0 0 0 0 0
+wt 0 0 0 0 0 1 0 0 0 0 0
+qsec 0 0 0 0 0 0 1 0 0 0 0
+vs 0 0 0 0 0 0 0 1 0 0 0
+am 0 0 0 0 0 0 0 0 1 0 0
+gear 0 0 0 0 0 0 0 0 0 1 0
+carb 0 0 0 0 0 0 0 0 0 0 1
+attr(,"term.labels")
+ [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear"
+[11] "carb"
+attr(,"order")
+ [1] 1 1 1 1 1 1 1 1 1 1 1
+attr(,"intercept")
+[1] 1
+attr(,"response")
+[1] 0
+attr(,".Environment")
+<environment: R_GlobalEnv>
+attr(,"predvars")
+list(mpg, cyl, disp, hp, drat, wt, qsec, vs, am, gear, carb)
+attr(,"dataClasses")
+ mpg cyl disp hp drat wt qsec vs
+"numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric" "numeric"
+ am gear carb
+"numeric" "numeric" "numeric"
+
+
+
+
+Allow subsets of predictors to be processed
+
+
+
+Merge model.frame()
and model.matrix()
+Right now you end up using two functions. Would be nice just to have one main function.
+
+
+API to stop dummies
+Some API to determine which factors should be converted to dummy variables (for example, not id variables, etc)
+
+
+Early returns
+This is fairly self-explanatory, but many simple formulas don’t require much work or could be done simply. Having something reasonable could give us some nice speed ups.
+Silly reprex:
+ early_exit <- function (formula, data) {
+ if (identical (rlang:: f_rhs (formula), quote (. - 1 )) &&
+ all (vapply (data, is.numeric, logical (1 )))) {
+ data <- as.matrix (data)
+ attr (data, "assign" ) <- seq_len (ncol (data))
+ return (data)
+ }
+ stop ("not right" )
+ }
+
+ formula <- ~ . - 1
+
+ bench:: mark (
+ model.matrix (formula, mtcars),
+ early_exit (formula, mtcars)
+ )
+#> # A tibble: 2 × 6
+#> expression min median `itr/sec` mem_alloc `gc/sec`
+#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
+#> 1 model.matrix(formula, mtcars) 236.9µs 254.2µs 3867. 315.1KB 34.4
+#> 2 early_exit(formula, mtcars) 35.4µs 37.9µs 24899. 80.6KB 39.9
+
+
+No reliance on global options
+This has bitten us a lot. Options such as contrasts should be passed in as an argument and not depend on the global state.
+
+
+model_tibble()
and model_data_frame()
+There are times when we want the output of model.matrix()
to be a tibble or data.frame, such as in hardhat::model_matrix() .
+I don’t think it would be unreasonable to keep the object as a data.frame when doing the transformations. This would allow us to handle sparse tibbles natively.
+
+
+
+
+