-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DIABLO does not account for differently ordered variables in test set #192
Comments
Actually, I found that the order of variables matters here, because there is a step after that to scale the new data based on the training data as follow: Lines 385 to 389 in 2b6ab06
So if all variables are present in the new data, just in a different order than in the training set, I think we can just add a step to sort them, such as 32e9ac6 |
For consistency, using the template to describe the bug 🐞 Describe the bug: When using the
While the order is important for the algorithm, having differing orders should not prevent the method from running. 🔍 reprex results from reproducible example including sessioninfo(): suppressMessages(library(mixOmics))
data(breast.TCGA) # load in the data
# extract data
X.train = list(mirna = breast.TCGA$data.train$mirna,
mrna = breast.TCGA$data.train$mrna)
X.test = list(mirna = breast.TCGA$data.test$mirna,
mrna = breast.TCGA$data.test$mrna)
Y.train = breast.TCGA$data.train$subtype
# use optimal values from the case study on mixOmics.org
optimal.ncomp = 2
optimal.keepX = list(mirna = c(10,5),
mrna = c(26, 16))
# set design matrix
design = matrix(0.1, ncol = length(X.train), nrow = length(X.train),
dimnames = list(names(X.train), names(X.train)))
diag(design) = 0
# generate model
final.diablo.model = block.splsda(X = X.train, Y = Y.train, ncomp = optimal.ncomp, # set the optimised DIABLO model
keepX = optimal.keepX, design = design)
#> Design matrix has changed to include Y; each block will be
#> linked to Y.
# create new test data with one dataframe being reordered
new.var.order = sample(1:dim(X.test$mirna)[2])
X.test.dup <- X.test
X.test.dup$mirna <- X.test.dup$mirna[, new.var.order]
predict.diablo = predict(final.diablo.model, newdata = X.test)
predict.diablo.reordered = predict(final.diablo.model, newdata = X.test.dup)
#> Error in predict.block.spls(final.diablo.model, newdata = X.test.dup): Each 'newdata[[i]]' must include all the variables of 'object$X[[i]]' Created on 2022-03-21 by the reprex package (v2.0.1) Session infosessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#> setting value
#> version R version 4.1.2 Patched (2021-11-16 r81220)
#> os Windows 10 x64 (build 19044)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate English_Australia.1252
#> ctype English_Australia.1252
#> tz Australia/Sydney
#> date 2022-03-21
#> pandoc 2.14.2 @ C:/Users/Work/AppData/Local/Pandoc/ (via rmarkdown)
#>
#> - Packages -------------------------------------------------------------------
#> package * version date (UTC) lib source
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.1.3)
#> BiocParallel 1.28.3 2021-12-09 [1] Bioconductor
#> cli 3.2.0 2022-02-14 [1] CRAN (R 4.1.2)
#> colorspace 2.0-3 2022-02-21 [1] CRAN (R 4.1.2)
#> corpcor 1.6.10 2021-09-16 [1] CRAN (R 4.1.1)
#> crayon 1.5.0 2022-02-14 [1] CRAN (R 4.1.2)
#> DBI 1.1.2 2021-12-20 [1] CRAN (R 4.1.3)
#> digest 0.6.29 2021-12-01 [1] CRAN (R 4.1.2)
#> dplyr 1.0.8 2022-02-08 [1] CRAN (R 4.1.2)
#> ellipse 0.4.2 2020-05-27 [1] CRAN (R 4.1.2)
#> ellipsis 0.3.2 2021-04-29 [1] CRAN (R 4.1.2)
#> evaluate 0.15 2022-02-18 [1] CRAN (R 4.1.2)
#> fansi 1.0.2 2022-01-14 [1] CRAN (R 4.1.2)
#> fastmap 1.1.0 2021-01-25 [1] CRAN (R 4.1.2)
#> fs 1.5.2 2021-12-08 [1] CRAN (R 4.1.2)
#> generics 0.1.2 2022-01-31 [1] CRAN (R 4.1.2)
#> ggplot2 * 3.3.5 2021-06-25 [1] CRAN (R 4.1.2)
#> ggrepel 0.9.1 2021-01-15 [1] CRAN (R 4.1.2)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.1.2)
#> gridExtra 2.3 2017-09-09 [1] CRAN (R 4.1.2)
#> gtable 0.3.0 2019-03-25 [1] CRAN (R 4.1.2)
#> highr 0.9 2021-04-16 [1] CRAN (R 4.1.2)
#> htmltools 0.5.2 2021-08-25 [1] CRAN (R 4.1.2)
#> igraph 1.2.11 2022-01-04 [1] CRAN (R 4.1.2)
#> knitr 1.37 2021-12-16 [1] CRAN (R 4.1.2)
#> lattice * 0.20-45 2021-09-22 [2] CRAN (R 4.1.2)
#> lifecycle 1.0.1 2021-09-24 [1] CRAN (R 4.1.2)
#> magrittr 2.0.2 2022-01-26 [1] CRAN (R 4.1.2)
#> MASS * 7.3-54 2021-05-03 [2] CRAN (R 4.1.2)
#> Matrix 1.3-4 2021-06-01 [2] CRAN (R 4.1.2)
#> matrixStats 0.61.0 2021-09-17 [1] CRAN (R 4.1.2)
#> mixOmics * 6.18.1 2021-11-18 [1] Bioconductor (R 4.1.2)
#> munsell 0.5.0 2018-06-12 [1] CRAN (R 4.1.2)
#> pillar 1.7.0 2022-02-01 [1] CRAN (R 4.1.2)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.1.2)
#> plyr 1.8.6 2020-03-03 [1] CRAN (R 4.1.2)
#> purrr 0.3.4 2020-04-17 [1] CRAN (R 4.1.2)
#> R.cache 0.15.0 2021-04-30 [1] CRAN (R 4.1.2)
#> R.methodsS3 1.8.1 2020-08-26 [1] CRAN (R 4.1.1)
#> R.oo 1.24.0 2020-08-26 [1] CRAN (R 4.1.1)
#> R.utils 2.11.0 2021-09-26 [1] CRAN (R 4.1.2)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.1.2)
#> rARPACK 0.11-0 2016-03-10 [1] CRAN (R 4.1.2)
#> RColorBrewer 1.1-2 2014-12-07 [1] CRAN (R 4.1.1)
#> Rcpp 1.0.8.2 2022-03-11 [1] CRAN (R 4.1.2)
#> reprex 2.0.1 2021-08-05 [1] CRAN (R 4.1.2)
#> reshape2 1.4.4 2020-04-09 [1] CRAN (R 4.1.2)
#> rlang 1.0.2 2022-03-04 [1] CRAN (R 4.1.3)
#> rmarkdown 2.13 2022-03-10 [1] CRAN (R 4.1.3)
#> RSpectra 0.16-0 2019-12-01 [1] CRAN (R 4.1.2)
#> rstudioapi 0.13 2020-11-12 [1] CRAN (R 4.1.2)
#> scales 1.1.1 2020-05-11 [1] CRAN (R 4.1.2)
#> sessioninfo 1.2.2 2021-12-06 [1] CRAN (R 4.1.2)
#> stringi 1.7.6 2021-11-29 [1] CRAN (R 4.1.2)
#> stringr 1.4.0 2019-02-10 [1] CRAN (R 4.1.2)
#> styler 1.7.0 2022-03-13 [1] CRAN (R 4.1.2)
#> tibble 3.1.6 2021-11-07 [1] CRAN (R 4.1.2)
#> tidyr 1.2.0 2022-02-01 [1] CRAN (R 4.1.2)
#> tidyselect 1.1.2 2022-02-21 [1] CRAN (R 4.1.2)
#> utf8 1.2.2 2021-07-24 [1] CRAN (R 4.1.2)
#> vctrs 0.3.8 2021-04-29 [1] CRAN (R 4.1.2)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.1.2)
#> xfun 0.30 2022-03-02 [1] CRAN (R 4.1.2)
#> yaml 2.3.5 2022-02-21 [1] CRAN (R 4.1.2)
#>
#> [1] C:/Users/Work/Documents/R/win-library/4.1
#> [2] C:/Program Files/R/R-4.1.2patched/library
#>
#> ------------------------------------------------------------------------------ 🤔 Expected behavior: Error should not be raised. 💡 Possible solution: Sorting the test dataframe to have variable order that matches the training dataframe |
fix: predict function has updated error messages for when feature sets are different or in different order
test: added test which catches the two next error messages that can be returned
Hi mixOmics team,
Thank you for your hard work on this great package!
I use the DIABLO pipeline to perform my multi-omics analysis. After built my final model using the multiblock sPLS-DA , I want to to predict new samples with it.
When using the predict function, I got this error message:
Each 'newdata[[i]]' must include all the variables of 'object$X[[i]]'
, however all the variables are there in the new data set.After checking in the source code, I found it was due the order of variables in one block of my new data list which is not exactly the same as in training data set. Once I reordered variables as how they were in training data set, everything goes well.
I think the key point is just to ensure all variables from de trained model are present in the new data, the order doesn't matter. So the checking by
all.equal
is not appropriate heremixOmics/R/predict.R
Line 346 in 2b6ab06
I suggest to replace the if statement by following code:
Best, Lijiao
The text was updated successfully, but these errors were encountered: