Skip to content

Commit

Permalink
Add files via upload
Browse files Browse the repository at this point in the history
  • Loading branch information
wgmao authored Aug 1, 2020
1 parent e2eaccb commit 68b37ac
Show file tree
Hide file tree
Showing 4 changed files with 122 additions and 66 deletions.
124 changes: 78 additions & 46 deletions inst/doc/vignette.Rnw
Original file line number Diff line number Diff line change
Expand Up @@ -3,67 +3,99 @@
\begin{document}
\SweaveOpts{concordance=TRUE}

There are three essential inputs in order to run $DataRemix()$.

\begin{itemize}
\item \textbf{svdres}: This stands for the SVD decomposition output of the gene expression profile $svd(\textbf{matrix})$. If the matrix is large, SVD decomposition doesn't need to be full-rank $min(nrow(\textbf{matrix}), ncol(\textbf{matrix}))$ which is computationally intensive.
\item \textbf{matrix}: This stands for the gene expression profile with the dimension $gene-by-sample$. If \textbf{svdres} is not full-rank, matrix needs to be included in order to calculate the residual. Generally including \textbf{matrix} makes the residual computation more efficient.
\item \textbf{objective function}: Users have to specify a objective of interest. The objective function would use remixed data based on \textbf{svdres} and \textbf{matrix} as input. It's natural to include more parameters in the objective function. The following two examples will demonstrate how to include objective-specific parameters into the $DataRemix()$ function.
\end{itemize}

Here we list two examples to illustrate how to run $DataRemix()$ function. The first example is to optimize known pathway recovery based on the GTEx gene expression profile. The second case is a toy example where we know the ground truth.


\section{GTEx Correlation Network}
In this section, we define the objective to be optimizing the known pathway recovery based on the GTEx gene expression data. We formally define the objective as the average AUC across pathways and we also keep track of the average AUPR value. $corMatToAUC()$ is the main objective function with two inputs: $data$ and $GS$. $data$ is the GTEx gene correlation matrix and $GS$ stands for the pathway matrix. You can refer to the $corMatToAUC()$ documentation for more information.
<<echo=TRUE, eval=TRUE>>=
library(DataRemix)
@
We first load the data. $GTex\_cc$ stands for the GTEx gene correlation matrix with dimension 7294-by-7294 and $canonical$ represents the canonical mSigDB pathways with dimension 7294-by-1330. $GTex\_cc$ and $canonical$ correspond to $data$ and $GS$ as input to $corMatToAUC()$. In this case, we directly remix the correlation matrix $GTex\_cc$. The other way is to remix the gene expression profile first and then calculate the correlation matrix where $GTex\_cc$ is remixed in an indirect way. First we need to perform SVD decomposition of $GTex\_cc$. Since it takes long to decompose $GTex\_cc$, we pre-compute the SVD decomposition of $GTex\_cc$ and load it as $GTex\_svdres$.
<<echo=TRUE, eval=TRUE>>=
load(url("https://www.dropbox.com/s/o949wkg76k0ccaw/GTex_cc.rdata?dl=1"))
load(url("https://www.dropbox.com/s/wsuze8w2rp0syqg/GTex_svdres.rdata?dl=1"))
load(url("https://github.com/wgmao/DataRemix/blob/master/inst/extdata/canonical.rdata?raw=true"))
#svdres <- svd(GTex_cc)
@
We first run $corMatToAUC()$ on the un-remixed correlation matrix $GTex\_cc$ to show what $corMatToAUC()$ outputs.
<<echo=TRUE, eval=TRUE>>=
GTex_default <- corMatToAUC(GTex_cc, canonical, objective = "mean.AUC")
GTex_default
@
The first value corresponds to the average AUPR across all pathways and the second value corresponds to the average AUC across all pathways. This is the default behavior of $corMatToAUC()$. We now try to infer the optimal combinations of k, p and $\mu$ using $DataRemix()$. In this case $GS$ is the additional input required by $corMatToAUC()$ function. Users just need to include any additional parameter like $GS$ required by the objective at the end of function input.
<<echo=TRUE, eval=TRUE>>=
rownames(GTex_svdres$u) <- rownames(GTex_cc)
rownames(GTex_svdres$v) <- colnames(GTex_cc)
DataRemix.res <- DataRemix(GTex_svdres, GTex_cc, corMatToAUC,
k_limits = c(1, length(GTex_svdres$d)%/%2),
p_limits = c(-1,1), mu_limits = c(1e-12,1),
num_of_initialization = 5, num_of_thompson = 150,
basis = "omega", basis_size = 2000, verbose = F,
GS = canonical)
@
It is highly recommended to assign $rownames$ and $colnames$ to $\textbf{svdres}$. Other parameters are explained as follows.
\begin{itemize}
\item \textbf{k\_limits = c(1, length(GTex\_svdres\$d)/2)}: The upper limit of possible $k$ is half of the rank which is 3,647 in this case.
\item \textbf{p\_limits = c(-1,1)}: This is the default range for $p$
\item \textbf{mu\_limits = c(1e-12,1)}: The is the default range for $\mu$
\item \textbf{num\_of\_initialization = 5}: Number of initialization steps before Thompson Sampling starts. It doesn't need to be a large number and 5 is the default option.
\item \textbf{num\_of\_thompson = 150}: Number of Thompson Sampling steps. Generally the performance of the objective will be improved as sampling steps increase.
\item \textbf{basis = "omega"}: The default option is to use the exponential kernel. There are also Gaussian kernel and Laplacian kernel as available options.
\item \textbf{basis\_size = 2000}: As \textbf{base\_size} increases, the approximation of kernel will be more accurate. 2,000 is a good trade-off in general.
\item \textbf{verbose = F}: If the computation takes long time to finish, it's helpful to print out intermediate results by setting \textbf{verbose} to be True.
\end{itemize}

We can convert the output from $DataRemix()$ into a ranking table and we can easily tell the best combinations of parameters by looking at this ranking table. Here are the explanations of the $DataRemix\_display()$ parameters.
\begin{itemize}
\item \textbf{$DataRemix.res$}: This is the output in the last step.
\item \textbf{$col.names = c("Rank", "k", "p", "mu", "mean AUPR", "mean AUC")$}: The first four values ("Rank", "k", "p", "mu") are fixed. Two additional values ("mean AUPR", "mean AUC") correspond to the output values of the objective function $corMatToAUC()$. These additional vaules need to be customized based on the objective function in use.
\item \textbf{$top.rank = 15$}: We want to see the top 15 best-performing combinations of parameters.
\end{itemize}
<<echo=TRUE, eval=TRUE>>=
DataRemix_display(DataRemix.res, col.names = c("Rank", "k", "p", "mu",
"mean AUPR", "mean AUC"), top.rank = 15)
@


\section{A Toy Example}
In this section, we define a simple objective function called $eval()$ which calculates the sum of a penalty term and the squared error between the DataRemix reconstruction and the original input matrix. This example illustrates how to include additional parameters which are necessary for the customized evaluation function fn() into DataRemix framework. The input matrix is a 100-by-9 matrix with random values. In this case, we know that when k=9,p=1 or $\mu=1$, p=1, DataRemix reconstruction is the same as the orginial matrix and the objective function achieves the minimal value which is qual to the penalty term we add.
<<echo=TRUE>>=
In this section, we define a simple objective function called $eval()$ which calculates the sum of a penalty term and the squared error between the DataRemix reconstruction and the original input matrix. The input matrix is a 100-by-9 matrix with random values. In this case, we know that when (k=9, p=1) or ($\mu=1$, p=1), DataRemix reconstruction is the same as the original matrix and the objective function achieves the minimal value which is equal to the penalty term we add.
<<echo=TRUE, eval=TRUE>>=
library(DataRemix)
eval <- function(X_reconstruct, X, penalty){
return(-sum((X-X_reconstruct)^2)+penalty)
}#eval
@
First we genrate a random matrix with dimension 100-by-9 and perform the SVD decomposition.
<<echo=TRUE>>=
First we generate a random matrix with dimension 100-by-9 and perform the SVD decomposition.
<<echo=TRUE, eval=TRUE>>=
set.seed(1)
num_of_row <- 100
num_of_col <- 9
X <- matrix(rnorm(num_of_row*num_of_col), nrow = num_of_row, ncol = num_of_col)
svdres <- svd(X)
@
Set mt to be 2000.
<<echo=TRUE>>=
basis_short <- omega[1:2000,]
@
Infer the optimal combinations of k, p and $\mu$. Here $X$ and $penalty$ are additional inputs for the $eval()$ function. If we have the full SVD decomposition, we can leave matrix as NULL. For some large-scale matrices, if the SVD computation is time intensive, we don't need to finish the full SVD. Instead we can just compute the SVD decomposition up to a sufficient rank and inlcude the original gene expression profile to calculate the residual.
<<echo=TRUE>>=
Here $X$ and $penalty$ are additional inputs for the $eval()$ function. If we have the full SVD decomposition, we can leave matrix as NULL. For some large-scale matrices, if the SVD computation is time intensive, we don't need to finish the full SVD. Instead we can just compute the SVD decomposition up to a sufficient rank and include the original gene expression profile to calculate the residual.
<<echo=TRUE, eval=TRUE>>=
DataRemix.res <- DataRemix(svdres, matrix = NULL, eval,
k_limits = c(1, length(svdres$d)), p_limits = c(-1,1),
mu_limits = c(1e-12,1), num_of_initialization = 5,
num_of_thompson = 50, basis = basis_short, xi = 0.1,
full = T, verbose = F, X = X, penalty = 100)
knitr::kable(cbind(1:55,DataRemix.res$para), align = "l",
col.names = c("Iteration","k","p","mu","Eval"))
@
\section{GTex Correlation Network}
In this section, we define a different task of optimizing the known pathway recovery based on the GTex gene expression data. $corMatToAUC()$ is the main objective function with two inputs: $data$ and $GS$. We formally define the objective as the average AUC across pathways and we also keep track of the average AUPR value. You can refer to the $corMatToAUC()$ document for more information.
<<echo=TRUE>>=
library(DataRemix)
@
Load the data. $GTex\_cc$ stands for the GTex gene correlation matrix with dimension 7294-by-7294 and $canonical$ represents the canonical mSigDB pathways with dimension 7294-by-1330. It takes time to decompose $GTex\_cc$, thus we pre-compute the SVD decomposition of $GTex\_cc$ and load it as $GTex\_svdres$.
<<echo=TRUE>>=
load(url("https://www.dropbox.com/s/o949wkg76k0ccaw/GTex_cc.rdata?dl=1"))
load(url("https://www.dropbox.com/s/wsuze8w2rp0syqg/GTex_svdres.rdata?dl=1"))
load(url("https://github.com/wgmao/DataRemix/blob/master/inst/extdata/canonical.rdata?raw=true"))
#svdres <- svd(GTex_cc)
@
Run corMatToAUC() on the default correlation matrix $GTex\_cc$.
<<echo=TRUE>>=
GTex_default <- corMatToAUC(GTex_cc, canonical)
GTex_default
@
Set mt to be 2000.
<<echo=TRUE>>=
basis_short <- omega[1:2000,]
num_of_thompson = 50, basis = "omega", basis_size = 2000,
xi = 0.1, full = T, verbose = F, X = X, penalty = 100)
@
Infer the optimal combinations of k, p and $\mu$. Here $GS$ is the additional input for the $corMatToAUC()$ function.
<<echo=TRUE>>=
DataRemix.res <- DataRemix(GTex_svdres, GTex_cc,corMatToAUC,
k_limits = c(1, length(GTex_svdres$d)%/%2),
p_limits = c(-1,1), mu_limits = c(1e-12,1),
num_of_initialization = 5, num_of_thompson = 150,
basis = basis_short, xi = 0.1, full = T, verbose = F,
GS = canonical)
knitr::kable(cbind(1:15,DataRemix.res$full[order(DataRemix.res$para[,4],
decreasing = T)[1:15],]), align = "l", col.names = c("Rank",
"k","p","mu","mean AUPR", "mean AUC"))
We can convert the output from DataRemix into a ranking table with the help of $DataRemix\_display()$. Here we want to check the performance of all sampling steps including initialization steps and Thompson Sampling steps.
<<echo=TRUE, eval=TRUE>>=
DataRemix_display(DataRemix.res, col.names = c("Rank", "k", "p", "mu", "Eval")
, top.rank = 55)
@

\end{document}
64 changes: 44 additions & 20 deletions inst/doc/vignette.log
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) (preloaded format=pdflatex 2018.10.12) 2 FEB 2019 01:44
This is pdfTeX, Version 3.14159265-2.6-1.40.16 (TeX Live 2015/Debian) (preloaded format=pdflatex 2018.10.12) 1 AUG 2020 16:01
entering extended mode
restricted \write18 enabled.
%&-line parsing enabled.
Expand Down Expand Up @@ -261,34 +261,58 @@ e
))
(./vignette-concordance.tex)
LaTeX Font Info: External font `cmex10' loaded for size
(Font) <7> on input line 8.
(Font) <7> on input line 7.
LaTeX Font Info: External font `cmex10' loaded for size
(Font) <5> on input line 8.
LaTeX Font Info: Try loading font information for T1+aett on input line 10.

(Font) <5> on input line 7.
LaTeX Font Info: Try loading font information for TS1+aer on input line 10.
(/usr/local/lib/R/share/texmf/tex/latex/ts1aer.fd
File: ts1aer.fd
)
LaTeX Font Info: Font shape `TS1/aer/m/n' in size <10> not available
(Font) Font shape `TS1/cmr/m/n' tried instead on input line 10.
LaTeX Font Info: Try loading font information for T1+aett on input line 21.
(/usr/share/texlive/texmf-dist/tex/latex/ae/t1aett.fd
File: t1aett.fd 1997/11/16 Font definitions for T1/aett.
) [1

{/var/lib/texmf/fonts/map/pdftex/updmap/pdftex.map}] [2] [3] [4] (./vignette.au
x) )
{/var/lib/texmf/fonts/map/pdftex/updmap/pdftex.map}]
Overfull \hbox (1.34096pt too wide) in paragraph at lines 62--63
[]\T1/aer/bx/n/10 num_of_initialization = 5\T1/aer/m/n/10 : Num-ber of ini-tial
-iza-tion steps be-fore Thomp-
[]

[2]
Overfull \hbox (9.23392pt too wide) in paragraph at lines 72--73
[]$\OML/cmm/m/it/10 col:names \OT1/cmr/m/n/10 = \OML/cmm/m/it/10 c\OT1/cmr/m/n/
10 ("\OML/cmm/m/it/10 Rank\OT1/cmr/m/n/10 "\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 "
\OML/cmm/m/it/10 k\OT1/cmr/m/n/10 "\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 "\OML/cmm
/m/it/10 p\OT1/cmr/m/n/10 "\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 "\OML/cmm/m/it/10
mu\OT1/cmr/m/n/10 "\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 "\OML/cmm/m/it/10 meanAU
PR\OT1/cmr/m/n/10 "\OML/cmm/m/it/10 ; \OT1/cmr/m/n/10 "\OML/cmm/m/it/10 meanAUC
\OT1/cmr/m/n/10 ")$\T1/aer/m/n/10 : The
[]

[3] [4] [5] (./vignette.aux) )
Here is how much of TeX's memory you used:
2669 strings out of 494910
38187 string characters out of 6179835
2698 strings out of 494910
38614 string characters out of 6179835
88955 words of memory out of 5000000
5925 multiletter control sequences out of 15000+600000
12533 words of font info for 30 fonts, out of 8000000 for 9000
5945 multiletter control sequences out of 15000+600000
15654 words of font info for 36 fonts, out of 8000000 for 9000
36 hyphenation exceptions out of 8191
37i,4n,23p,657b,276s stack positions out of 5000i,500n,10000p,200000b,80000s
</usr/share/texlive/texmf-dist/fonts/type1/public/amsfonts/cm/cmbx12.pfb></
usr/share/texlive/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi10.pfb></usr/sh
are/texlive/texmf-dist/fonts/type1/public/amsfonts/cm/cmr10.pfb></usr/share/tex
live/texmf-dist/fonts/type1/public/amsfonts/cm/cmsltt10.pfb></usr/share/texlive
/texmf-dist/fonts/type1/public/amsfonts/cm/cmtt10.pfb>
Output written on vignette.pdf (4 pages, 86826 bytes).
37i,4n,23p,695b,280s stack positions out of 5000i,500n,10000p,200000b,80000s
</home/wem26/.texmf-var/fonts/pk/ljfour/jknappen
/ec/tcrm1000.600pk></usr/share/texlive/texmf-dist/fonts/type1/public/amsfonts/c
m/cmbx10.pfb></usr/share/texlive/texmf-dist/fonts/type1/public/amsfonts/cm/cmbx
12.pfb></usr/share/texlive/texmf-dist/fonts/type1/public/amsfonts/cm/cmmi10.pfb
></usr/share/texlive/texmf-dist/fonts/type1/public/amsfonts/cm/cmr10.pfb></usr/
share/texlive/texmf-dist/fonts/type1/public/amsfonts/cm/cmsltt10.pfb></usr/shar
e/texlive/texmf-dist/fonts/type1/public/amsfonts/cm/cmsy10.pfb></usr/share/texl
ive/texmf-dist/fonts/type1/public/amsfonts/cm/cmtt10.pfb>
Output written on vignette.pdf (5 pages, 117431 bytes).
PDF statistics:
38 PDF objects out of 1000 (max. 8388607)
25 compressed objects within 1 object stream
54 PDF objects out of 1000 (max. 8388607)
37 compressed objects within 1 object stream
0 named destinations out of 1000 (max. 500000)
5 words of extra memory for PDF output out of 10000 (max. 10000000)

Binary file modified inst/doc/vignette.pdf
Binary file not shown.
Binary file modified inst/doc/vignette.synctex.gz
Binary file not shown.

0 comments on commit 68b37ac

Please sign in to comment.