PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge #40

utterances-bot · 2021-07-24T12:58:29Z

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge

Use tidymodels for unsupervised dimensionality reduction.

https://juliasilge.com/blog/cocktail-recipes-umap/

factorialmap · 2021-07-24T12:58:30Z

Thank you so much Julia. I think this video and content is great as intuitive explanation of PCA and how to implement and visualize it well in RStudio.

portolan75 · 2021-11-14T15:43:37Z

Hi Julia, this tidy workflow is very interesting and I am using it more and more.
I also tried the UMAP workflow, but how to predict umap coordinates on a new set of data?
In your example if I bake umap_prep on a different dataset (with the same variables) does not work, neither using standard 'predict' function.
Am I doing something wrong or is not possible to predict/bake on a new set?

juliasilge · 2021-11-15T02:20:55Z

@portolan75 Is it this problem that you are seeing? Or something else?

If it is something else, then I suggest that you create a reprex (a minimal reproducible example) for the problem you are observing, and post it on RStudio Community. The goal of a reprex is to make it easier for us to recreate your problem so that others can understand it.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

portolan75 · 2021-11-18T19:17:48Z

Hi @juliasilge , thanks for your answer.
In reality after your comment I tried again and realised I did something wrong with my dataset and was not able to 'predict' - bake on the test set.
So I was having good results on the training set but not able to bake the umap coeffs for the test set.
Anyway it worked, thanks for the attention and also for re-directing to the other 'CppMethod' problem which turned useful as well.

Averysaurus · 2021-12-18T01:14:20Z

All this work is so brilliant @juliasilge. Are there are any literature, book chapters, articles, videos on PCA interpretation you can recommend?

juliasilge · 2021-12-20T15:54:26Z

One blog post + conference talk that I personally did is this one, using Stack Overflow data.
I like this Cross Validated answer.
This interactive explanation from setosa.io is one I often come back to.

Kasramhdz · 2022-01-07T14:17:43Z

Thank you for the fantastic tutorial
but I have a question, how can we change the rotation method applied to the step_pca?

juliasilge · 2022-01-07T16:32:22Z

@Kasramhdz The step_pca() function uses stats::prcomp() under the hood, which I don't believe supports that, but you can get out the loadings using tidy() and the type = "coef" argument and then apply a rotation yourself. See this Cross Validated answer for more info.

Kasramhdz · 2022-01-11T17:39:43Z

I have another question,
I'm new to tidymodels but apparently the step_pca() arguments such as nom_comp or threshold are not being implemented when being trained. as in example below, I'm still getting 4 component despite setting nom_comp = 2.

rec <- recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2)

prep(rec) %>% tidy(number = 2, type = "coef") %>%
pivot_wider(names_from = component, values_from = value, id_cols = terms)

juliasilge · 2022-01-12T18:31:50Z

@Kasramhdz The full PCA is determined (so you can still compute the variances of each term) and num_comp specifies how many of the components are retained as predictors. If you want to specify the maximal rank, you can pass that through options:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
rec <- recipe( ~ ., data = USArrests) %>%
    step_normalize(all_numeric()) %>%
    step_pca(all_numeric(), num_comp = 2, options = list(rank. = 2))

prep(rec) %>% tidy(number = 2, type = "coef")
#> # A tibble: 8 × 4
#>   terms     value component id       
#>   <chr>     <dbl> <chr>     <chr>    
#> 1 Murder   -0.536 PC1       pca_T11OM
#> 2 Assault  -0.583 PC1       pca_T11OM
#> 3 UrbanPop -0.278 PC1       pca_T11OM
#> 4 Rape     -0.543 PC1       pca_T11OM
#> 5 Murder    0.418 PC2       pca_T11OM
#> 6 Assault   0.188 PC2       pca_T11OM
#> 7 UrbanPop -0.873 PC2       pca_T11OM
#> 8 Rape     -0.167 PC2       pca_T11OM

^{Created on 2022-01-12 by the reprex package (v2.0.1)}

You could also control this via the tol argument.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge #40

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge #40

utterances-bot commented Jul 24, 2021

factorialmap commented Jul 24, 2021

portolan75 commented Nov 14, 2021

juliasilge commented Nov 15, 2021

portolan75 commented Nov 18, 2021

Averysaurus commented Dec 18, 2021

juliasilge commented Dec 20, 2021

Kasramhdz commented Jan 7, 2022

juliasilge commented Jan 7, 2022

Kasramhdz commented Jan 11, 2022

juliasilge commented Jan 12, 2022

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge #40

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge #40

Comments

utterances-bot commented Jul 24, 2021

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge

factorialmap commented Jul 24, 2021

portolan75 commented Nov 14, 2021

juliasilge commented Nov 15, 2021

portolan75 commented Nov 18, 2021

Averysaurus commented Dec 18, 2021

juliasilge commented Dec 20, 2021

Kasramhdz commented Jan 7, 2022

juliasilge commented Jan 7, 2022

Kasramhdz commented Jan 11, 2022

juliasilge commented Jan 12, 2022