Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge #40

Open
utterances-bot opened this issue Jul 24, 2021 · 10 comments

Comments

@utterances-bot
Copy link

PCA and UMAP with tidymodels and #TidyTuesday cocktail recipes | Julia Silge

Use tidymodels for unsupervised dimensionality reduction.

https://juliasilge.com/blog/cocktail-recipes-umap/

Copy link

Thank you so much Julia. I think this video and content is great as intuitive explanation of PCA and how to implement and visualize it well in RStudio.

Copy link

Hi Julia, this tidy workflow is very interesting and I am using it more and more.
I also tried the UMAP workflow, but how to predict umap coordinates on a new set of data?
In your example if I bake umap_prep on a different dataset (with the same variables) does not work, neither using standard 'predict' function.
Am I doing something wrong or is not possible to predict/bake on a new set?

@juliasilge
Copy link
Owner

@portolan75 Is it this problem that you are seeing? Or something else?

If it is something else, then I suggest that you create a reprex (a minimal reproducible example) for the problem you are observing, and post it on RStudio Community. The goal of a reprex is to make it easier for us to recreate your problem so that others can understand it.

If you've never heard of a reprex before, you may want to start with the tidyverse.org help page. You may already have reprex installed (it comes with the tidyverse package), but if not you can install it with:

install.packages("reprex")

Thanks! 🙌

@portolan75
Copy link

Hi @juliasilge , thanks for your answer.
In reality after your comment I tried again and realised I did something wrong with my dataset and was not able to 'predict' - bake on the test set.
So I was having good results on the training set but not able to bake the umap coeffs for the test set.
Anyway it worked, thanks for the attention and also for re-directing to the other 'CppMethod' problem which turned useful as well.

Copy link

All this work is so brilliant @juliasilge. Are there are any literature, book chapters, articles, videos on PCA interpretation you can recommend?

@juliasilge
Copy link
Owner

Copy link

Thank you for the fantastic tutorial
but I have a question, how can we change the rotation method applied to the step_pca?

@juliasilge
Copy link
Owner

@Kasramhdz The step_pca() function uses stats::prcomp() under the hood, which I don't believe supports that, but you can get out the loadings using tidy() and the type = "coef" argument and then apply a rotation yourself. See this Cross Validated answer for more info.

Copy link

I have another question,
I'm new to tidymodels but apparently the step_pca() arguments such as nom_comp or threshold are not being implemented when being trained. as in example below, I'm still getting 4 component despite setting nom_comp = 2.

rec <- recipe( ~ ., data = USArrests) %>%
step_normalize(all_numeric()) %>%
step_pca(all_numeric(), num_comp = 2)

prep(rec) %>% tidy(number = 2, type = "coef") %>%
pivot_wider(names_from = component, values_from = value, id_cols = terms)

@juliasilge
Copy link
Owner

@Kasramhdz The full PCA is determined (so you can still compute the variances of each term) and num_comp specifies how many of the components are retained as predictors. If you want to specify the maximal rank, you can pass that through options:

library(recipes)
#> Loading required package: dplyr
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union
#> 
#> Attaching package: 'recipes'
#> The following object is masked from 'package:stats':
#> 
#>     step
rec <- recipe( ~ ., data = USArrests) %>%
    step_normalize(all_numeric()) %>%
    step_pca(all_numeric(), num_comp = 2, options = list(rank. = 2))

prep(rec) %>% tidy(number = 2, type = "coef")
#> # A tibble: 8 × 4
#>   terms     value component id       
#>   <chr>     <dbl> <chr>     <chr>    
#> 1 Murder   -0.536 PC1       pca_T11OM
#> 2 Assault  -0.583 PC1       pca_T11OM
#> 3 UrbanPop -0.278 PC1       pca_T11OM
#> 4 Rape     -0.543 PC1       pca_T11OM
#> 5 Murder    0.418 PC2       pca_T11OM
#> 6 Assault   0.188 PC2       pca_T11OM
#> 7 UrbanPop -0.873 PC2       pca_T11OM
#> 8 Rape     -0.167 PC2       pca_T11OM

Created on 2022-01-12 by the reprex package (v2.0.1)

You could also control this via the tol argument.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants