DIABLO does not account for differently ordered variables in test set #192

Ning-L · 2022-03-16T16:22:44Z

Hi mixOmics team,

Thank you for your hard work on this great package!

I use the DIABLO pipeline to perform my multi-omics analysis. After built my final model using the multiblock sPLS-DA , I want to to predict new samples with it.

When using the predict function, I got this error message: Each 'newdata[[i]]' must include all the variables of 'object$X[[i]]', however all the variables are there in the new data set.

After checking in the source code, I found it was due the order of variables in one block of my new data list which is not exactly the same as in training data set. Once I reordered variables as how they were in training data set, everything goes well.

I think the key point is just to ensure all variables from de trained model are present in the new data, the order doesn't matter. So the checking by all.equal is not appropriate here

mixOmics/R/predict.R

Line 346 in 2b6ab06

if(all.equal(lapply(newdata,colnames),lapply(X,colnames))!=TRUE)

I suggest to replace the if statement by following code:

if (any(unlist(lapply(seq_along(X), function(i) length(setdiff(colnames(X[[i]]), colnames(newdata[[i]]))) > 0))))

Best, Lijiao

The text was updated successfully, but these errors were encountered:

Ning-L · 2022-03-17T18:38:05Z

Actually, I found that the order of variables matters here, because there is a step after that to scale the new data based on the training data as follow:

mixOmics/R/predict.R

Lines 385 to 389 in 2b6ab06

    
           if (!is.null(attr(X[[1]], "scaled:center"))) 
        
               newdata[which(!is.na(ind.match))] = lapply(which(!is.na(ind.match)), function(x){sweep(newdata[[x]], 2, STATS = attr(X[[x]], "scaled:center"))}) 
        
           if (scale) 
        
               newdata[which(!is.na(ind.match))] = lapply(which(!is.na(ind.match)), function(x){sweep(newdata[[x]], 2, FUN = "/", STATS = attr(X[[x]], "scaled:scale"))})

So if all variables are present in the new data, just in a different order than in the training set, I think we can just add a step to sort them, such as 32e9ac6

Max-Bladen · 2022-03-20T23:40:54Z

For consistency, using the template to describe the bug

🐞 Describe the bug:

When using the predict() function on a DIABLO, if one or more of the test dataframes is supplied with a variable order that differs from the equivalent training dataframe, the following error is raised:

Each 'newdata[[i]]' must include all the variables of 'object$X[[i]]

While the order is important for the algorithm, having differing orders should not prevent the method from running.

🔍 reprex results from reproducible example including sessioninfo():

suppressMessages(library(mixOmics))

data(breast.TCGA) # load in the data

# extract data
X.train = list(mirna = breast.TCGA$data.train$mirna,
               mrna = breast.TCGA$data.train$mrna)

X.test = list(mirna = breast.TCGA$data.test$mirna,
              mrna = breast.TCGA$data.test$mrna)

Y.train = breast.TCGA$data.train$subtype

# use optimal values from the case study on mixOmics.org
optimal.ncomp = 2
optimal.keepX = list(mirna = c(10,5),
                     mrna = c(26, 16))

# set design matrix
design = matrix(0.1, ncol = length(X.train), nrow = length(X.train),
                dimnames = list(names(X.train), names(X.train)))
diag(design) = 0

# generate model
final.diablo.model = block.splsda(X = X.train, Y = Y.train, ncomp = optimal.ncomp, # set the optimised DIABLO model
                                  keepX = optimal.keepX, design = design)
#> Design matrix has changed to include Y; each block will be
#>             linked to Y.


# create new test data with one dataframe being reordered
new.var.order = sample(1:dim(X.test$mirna)[2])
X.test.dup <- X.test
X.test.dup$mirna <- X.test.dup$mirna[, new.var.order]

predict.diablo = predict(final.diablo.model, newdata = X.test)

predict.diablo.reordered = predict(final.diablo.model, newdata = X.test.dup)
#> Error in predict.block.spls(final.diablo.model, newdata = X.test.dup): Each 'newdata[[i]]' must include all the variables of 'object$X[[i]]'

^{Created on 2022-03-21 by the reprex package (v2.0.1)}

Session info

sessioninfo::session_info()
#> - Session info ---------------------------------------------------------------
#>  setting  value
#>  version  R version 4.1.2 Patched (2021-11-16 r81220)
#>  os       Windows 10 x64 (build 19044)
#>  system   x86_64, mingw32
#>  ui       RTerm
#>  language (EN)
#>  collate  English_Australia.1252
#>  ctype    English_Australia.1252
#>  tz       Australia/Sydney
#>  date     2022-03-21
#>  pandoc   2.14.2 @ C:/Users/Work/AppData/Local/Pandoc/ (via rmarkdown)
#> 
#> - Packages -------------------------------------------------------------------
#>  package      * version date (UTC) lib source
#>  assertthat     0.2.1   2019-03-21 [1] CRAN (R 4.1.3)
#>  BiocParallel   1.28.3  2021-12-09 [1] Bioconductor
#>  cli            3.2.0   2022-02-14 [1] CRAN (R 4.1.2)
#>  colorspace     2.0-3   2022-02-21 [1] CRAN (R 4.1.2)
#>  corpcor        1.6.10  2021-09-16 [1] CRAN (R 4.1.1)
#>  crayon         1.5.0   2022-02-14 [1] CRAN (R 4.1.2)
#>  DBI            1.1.2   2021-12-20 [1] CRAN (R 4.1.3)
#>  digest         0.6.29  2021-12-01 [1] CRAN (R 4.1.2)
#>  dplyr          1.0.8   2022-02-08 [1] CRAN (R 4.1.2)
#>  ellipse        0.4.2   2020-05-27 [1] CRAN (R 4.1.2)
#>  ellipsis       0.3.2   2021-04-29 [1] CRAN (R 4.1.2)
#>  evaluate       0.15    2022-02-18 [1] CRAN (R 4.1.2)
#>  fansi          1.0.2   2022-01-14 [1] CRAN (R 4.1.2)
#>  fastmap        1.1.0   2021-01-25 [1] CRAN (R 4.1.2)
#>  fs             1.5.2   2021-12-08 [1] CRAN (R 4.1.2)
#>  generics       0.1.2   2022-01-31 [1] CRAN (R 4.1.2)
#>  ggplot2      * 3.3.5   2021-06-25 [1] CRAN (R 4.1.2)
#>  ggrepel        0.9.1   2021-01-15 [1] CRAN (R 4.1.2)
#>  glue           1.6.2   2022-02-24 [1] CRAN (R 4.1.2)
#>  gridExtra      2.3     2017-09-09 [1] CRAN (R 4.1.2)
#>  gtable         0.3.0   2019-03-25 [1] CRAN (R 4.1.2)
#>  highr          0.9     2021-04-16 [1] CRAN (R 4.1.2)
#>  htmltools      0.5.2   2021-08-25 [1] CRAN (R 4.1.2)
#>  igraph         1.2.11  2022-01-04 [1] CRAN (R 4.1.2)
#>  knitr          1.37    2021-12-16 [1] CRAN (R 4.1.2)
#>  lattice      * 0.20-45 2021-09-22 [2] CRAN (R 4.1.2)
#>  lifecycle      1.0.1   2021-09-24 [1] CRAN (R 4.1.2)
#>  magrittr       2.0.2   2022-01-26 [1] CRAN (R 4.1.2)
#>  MASS         * 7.3-54  2021-05-03 [2] CRAN (R 4.1.2)
#>  Matrix         1.3-4   2021-06-01 [2] CRAN (R 4.1.2)
#>  matrixStats    0.61.0  2021-09-17 [1] CRAN (R 4.1.2)
#>  mixOmics     * 6.18.1  2021-11-18 [1] Bioconductor (R 4.1.2)
#>  munsell        0.5.0   2018-06-12 [1] CRAN (R 4.1.2)
#>  pillar         1.7.0   2022-02-01 [1] CRAN (R 4.1.2)
#>  pkgconfig      2.0.3   2019-09-22 [1] CRAN (R 4.1.2)
#>  plyr           1.8.6   2020-03-03 [1] CRAN (R 4.1.2)
#>  purrr          0.3.4   2020-04-17 [1] CRAN (R 4.1.2)
#>  R.cache        0.15.0  2021-04-30 [1] CRAN (R 4.1.2)
#>  R.methodsS3    1.8.1   2020-08-26 [1] CRAN (R 4.1.1)
#>  R.oo           1.24.0  2020-08-26 [1] CRAN (R 4.1.1)
#>  R.utils        2.11.0  2021-09-26 [1] CRAN (R 4.1.2)
#>  R6             2.5.1   2021-08-19 [1] CRAN (R 4.1.2)
#>  rARPACK        0.11-0  2016-03-10 [1] CRAN (R 4.1.2)
#>  RColorBrewer   1.1-2   2014-12-07 [1] CRAN (R 4.1.1)
#>  Rcpp           1.0.8.2 2022-03-11 [1] CRAN (R 4.1.2)
#>  reprex         2.0.1   2021-08-05 [1] CRAN (R 4.1.2)
#>  reshape2       1.4.4   2020-04-09 [1] CRAN (R 4.1.2)
#>  rlang          1.0.2   2022-03-04 [1] CRAN (R 4.1.3)
#>  rmarkdown      2.13    2022-03-10 [1] CRAN (R 4.1.3)
#>  RSpectra       0.16-0  2019-12-01 [1] CRAN (R 4.1.2)
#>  rstudioapi     0.13    2020-11-12 [1] CRAN (R 4.1.2)
#>  scales         1.1.1   2020-05-11 [1] CRAN (R 4.1.2)
#>  sessioninfo    1.2.2   2021-12-06 [1] CRAN (R 4.1.2)
#>  stringi        1.7.6   2021-11-29 [1] CRAN (R 4.1.2)
#>  stringr        1.4.0   2019-02-10 [1] CRAN (R 4.1.2)
#>  styler         1.7.0   2022-03-13 [1] CRAN (R 4.1.2)
#>  tibble         3.1.6   2021-11-07 [1] CRAN (R 4.1.2)
#>  tidyr          1.2.0   2022-02-01 [1] CRAN (R 4.1.2)
#>  tidyselect     1.1.2   2022-02-21 [1] CRAN (R 4.1.2)
#>  utf8           1.2.2   2021-07-24 [1] CRAN (R 4.1.2)
#>  vctrs          0.3.8   2021-04-29 [1] CRAN (R 4.1.2)
#>  withr          2.5.0   2022-03-03 [1] CRAN (R 4.1.2)
#>  xfun           0.30    2022-03-02 [1] CRAN (R 4.1.2)
#>  yaml           2.3.5   2022-02-21 [1] CRAN (R 4.1.2)
#> 
#>  [1] C:/Users/Work/Documents/R/win-library/4.1
#>  [2] C:/Program Files/R/R-4.1.2patched/library
#> 
#> ------------------------------------------------------------------------------

🤔 Expected behavior:

Error should not be raised. predict() function should handle this case and be able to produce predictions.

💡 Possible solution:

Sorting the test dataframe to have variable order that matches the training dataframe

fix: predict function has updated error messages for when feature sets are different or in different order

test: added test which catches the two next error messages that can be returned

fix: predict function has updated error messages for when feature sets are different or in different order

Ning-L mentioned this issue Mar 16, 2022

Fix variable existence check for Issue #192 #193

Closed

Max-Bladen changed the title ~~The if statement is not appropriate~~ DIABLO does not account for differently ordered variables in test set Mar 20, 2022

Max-Bladen self-assigned this Mar 20, 2022

Max-Bladen added bug Something isn't working wip work-in-progress labels Mar 20, 2022

Max-Bladen mentioned this issue Mar 20, 2022

Fix for Issue #192 #194

Merged

This was linked to pull requests Mar 20, 2022

Fix for Issue #192 #194

Merged

Fix for Issue #122 #195

Closed

Max-Bladen removed a link to a pull request Mar 21, 2022

Fix for Issue #122 #195

Closed

Max-Bladen removed the wip work-in-progress label Mar 21, 2022

Max-Bladen added a commit that referenced this issue Apr 25, 2022

Updated Fix for Issue #192

680cdf5

fix: predict function has updated error messages for when feature sets are different or in different order

Max-Bladen added a commit that referenced this issue Apr 25, 2022

Updated Fix for #192

1c76d5c

test: added test which catches the two next error messages that can be returned

Max-Bladen added the ready-to-review for all PRs that are ready to be reviewed. including complex, larger commits label Sep 7, 2022

Max-Bladen closed this as completed in #194 Sep 15, 2022

Max-Bladen added a commit that referenced this issue Sep 15, 2022

Fix for Issue #192 (#194)

2a828e3

fix: predict function has updated error messages for when feature sets are different or in different order

Max-Bladen removed the ready-to-review for all PRs that are ready to be reviewed. including complex, larger commits label Sep 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DIABLO does not account for differently ordered variables in test set #192

DIABLO does not account for differently ordered variables in test set #192

Ning-L commented Mar 16, 2022 •

edited

Loading

Ning-L commented Mar 17, 2022

Max-Bladen commented Mar 20, 2022

DIABLO does not account for differently ordered variables in test set #192

DIABLO does not account for differently ordered variables in test set #192

Comments

Ning-L commented Mar 16, 2022 • edited Loading

Ning-L commented Mar 17, 2022

Max-Bladen commented Mar 20, 2022

Ning-L commented Mar 16, 2022 •

edited

Loading