Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PCA looks different on current R #74

Closed
sneumann opened this issue Feb 14, 2024 · 3 comments
Closed

PCA looks different on current R #74

sneumann opened this issue Feb 14, 2024 · 3 comments

Comments

@sneumann
Copy link
Member

Hi,
when switching from R underneath MetFamily from R-3.6.3 to R-4.X, the PCA looks different:

R-3.6.3 gives:
PCA363

but R-4.X gives something different: (both 4.0.5 and 4.3.2 give identical result)
PCA405

@khabatv identified (thanks!) that the function has switched the default method (SVD, EVD), so it should help to explicitly specify the method in

switch(ms1AnalysisMethod,

Yours,
Steffen

@sneumann
Copy link
Member Author

sneumann commented Mar 1, 2024

Hi, to document the progress of catching this, it is not the actual calculation of the PCA that causes the difference.
Turns out that already the input data to the PCA in performPca() differs:
Default calculation is with

pca <- pcaMethods::pca(object = dataFrame2, method = "svd", nPcs = numberOfComponents, scale = "none", center = FALSE)

So for 3.6.3 we get

> load("dataFrame2.Rdata") ; dim(dataFrame2) ; summary(dataFrame2[,1]) ; summary(t(dataFrame2[1,]))
[1]   6 219
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-132.66 -132.42  -34.24    0.00   89.57  235.44 
     TRI03        
 Min.   : -49.16,  1st Qu.: 183.76  ,  Median : 249.74  ,  Mean   : 350.71  ,  3rd Qu.: 379.34  ,  Max.   :2156.67 

And for 4.3.2 we get for the same input

> load("dataFrame2.Rdata") ; dim(dataFrame2) ; summary(dataFrame2[,1]) ; t(summary(t(dataFrame2[1,])))
[1]   6 219
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-19.075 -18.563  -6.375   0.000  11.204  37.388                                                                              
    TRI03 
Min.   :-30.62   1st Qu.: 12.71   Median : 32.24   Mean   : 23.58  ,  3rd Qu.: 37.67   Max.   : 55.70  

And already the input data to processing is different:

preprocessDataForPca <- function(dataFrame, scaling, logTransform){

3.6.3:

load("dataFrame.Rdata") ; dim(dataFrame) ; summary(dataFrame[,1]) ; summary(dataFrame[1,])
[1]   6 219
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  314.0   352.2 15674.1 21016.8 34995.0 57758.7 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  12764   39359   64776  257063  160314 5032056 

4.3.2:

> load("dataFrame.Rdata") ; dim(dataFrame) ; summary(dataFrame[,1]) ; summary(dataFrame[1,])
[1]   6 219
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  435.0   446.8   726.5   872.8  1130.0  1731.0 
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   10.0   752.5  1357.0  1235.9  1718.5  2116.0 

The data is originally loaded into the dataList$dataFrameMeasurements and differs already there:
3.6.3

> load("filtering.Rdata") ; dim(dataList$dataFrameMeasurements) ; summary(t(dataList$dataFrameMeasurements[1,1:6])); summary(dataList$dataFrameMeasurements[,1]); 
[1] 2585   13
     85.005 /   7.42
 Min.   :2930        1st Qu.:3333        Median :4718        Mean   :5263        3rd Qu.:6252        Max.   :9578          Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1630    3342   27685    8470 5032056 

4.3.2

load("filtering.Rdata") ; dim(dataList$dataFrameMeasurements) ; summary(t(dataList$dataFrameMeasurements[1,1:6])); summary(dataList$dataFrameMeasurements[,1]); 
[1] 2585   13
     85.005 /   7.42
 Min.   : 411.0      1st Qu.: 529.8      Median : 843.0      Mean   : 992.5      3rd Qu.:1478.8      Max.   :1742.0     
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
    1.0   357.0   915.0   945.3  1501.0  2116.0 

The difference is already present in processMS1data()

processMS1data <- function(

where the input parameter metaboliteProfile is identical, but the resulting matrixDataFrame() differs:
matrixDataFrame <- data.matrix(dataFrameMeasurements)

It seems we are getting (vastly!) different results depending on how we convert from string input to numbers:

summary(as.numeric(metaboliteProfile[, "TRI03"]))
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1630    3342   27685    8470 5032056 
Browse[2]> summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE]))
     TRI03       
 Min.   :   1.0  
 1st Qu.: 357.0  
 Median : 915.0  
 Mean   : 945.3  
 3rd Qu.:1501.0  
 Max.   :2116.0  

@sneumann
Copy link
Member Author

sneumann commented Mar 1, 2024

So, starting a new comment, as we don't need the code in the MetFamily shiny app anymore, and not even the MetFamily container. The issue is already in the rocker/shiny base containers.
There is different behaviour between the R versions (and shiny dependency packages installed) in 3.6.3 and 4.3.2.

The minimum reproducible example needs the attached Rdata file (do not uncompress! The .gz ending is only to make GitHub happy !) in the current directory
metaboliteProfileOrig.Rdata.gz

And run it in the shiny containers with different base versions:

sneumann@msbi-corei:~/$ docker run -it --rm --entrypoint=/usr/local/bin/R -v $PWD:/out rocker/shiny:3.6.3 -e 'load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()'

R version 3.6.3 (2020-02-29) -- "Holding the Windsock"
Copyright (C) 2020 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1630    3342   27685    8470 5032056 
     TRI03        
 Min.   :      0  
 1st Qu.:   1630  
 Median :   3342  
 Mean   :  27685  
 3rd Qu.:   8470  
 Max.   :5032056  
R version 3.6.3 (2020-02-29)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Debian GNU/Linux 10 (buster)

Matrix products: default
BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.3.5.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C             
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_3.6.3
> 

And the broken 4.3.2:

sneumann@msbi-corei:~/$ docker run -it --rm --entrypoint=/usr/local/bin/R -v $PWD:/out rocker/shiny:4.3.2 -e 'load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()'

R version 4.3.2 (2023-10-31) -- "Eye Holes"
Copyright (C) 2023 The R Foundation for Statistical Computing
Platform: x86_64-pc-linux-gnu (64-bit)

R is free software and comes with ABSOLUTELY NO WARRANTY.
You are welcome to redistribute it under certain conditions.
Type 'license()' or 'licence()' for distribution details.

  Natural language support but running in an English locale

R is a collaborative project with many contributors.
Type 'contributors()' for more information and
'citation()' on how to cite R or R packages in publications.

Type 'demo()' for some demos, 'help()' for on-line help, or
'help.start()' for an HTML browser interface to help.
Type 'q()' to quit R.

> load("/out/metaboliteProfileOrig.Rdata.gz"); summary(as.numeric(metaboliteProfile[, "TRI03"])) ; summary(data.matrix(metaboliteProfile[, "TRI03", drop = FALSE])); sessionInfo()
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
      0    1630    3342   27685    8470 5032056 
     TRI03       
 Min.   :   1.0  
 1st Qu.: 357.0  
 Median : 915.0  
 Mean   : 945.3  
 3rd Qu.:1501.0  
 Max.   :2116.0  
R version 4.3.2 (2023-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 22.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/openblas-pthread/libblas.so.3 
LAPACK: /usr/lib/x86_64-linux-gnu/openblas-pthread/libopenblasp-r0.3.20.so;  LAPACK version 3.10.0

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

time zone: Etc/UTC
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] compiler_4.3.2

@sneumann
Copy link
Member Author

sneumann commented Mar 1, 2024

So, I pinpointed the changed bevaviour to

https://cran.r-project.org/bin/windows/base/old/4.0.0/NEWS.R-4.0.0.html

NEW FEATURES
[...]
data.matrix() now converts character columns to factors and from this to integers.

=> Now we need to fix MetFamily, occurrences of data.matrix() seem to be limited to ./R/DataProcessing.R

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant