Explain how summary stats works for TWAS #6

gaow · 2020-08-11T19:27:27Z

@hsun3163 how the model looks like, and what approximation has been made.

hsun3163 · 2020-08-18T00:37:57Z

For each genes there is two vectors: The expression weight of each SNP and the standardized effect sizes of SNP Z score of each SNP. The matrix multiplication of two vectors produce a Z-score of expression and trait (WZ), where under null data (no association) and a multi-variate normal assumption Z_gwas ~ N(0,Σs,s)

W_twas * Z_gwas

The variance of Z_twas is the product between square of W_twas and the Correlation matrix among all SNP(the LD Matrixs) :

W_twas * Cor.LD * W_twas

The imputation Z-score of cis-genetic effect on trait is therefore

Z_twas = (W_twas %*% Z_gwas)/( W_twas * Cor.LD * W_twas) ^1/2

It occurs that, the Z_twas score is in fact still based on the effect of the SNP on traits, which may largely via their impact on protein structures, while taking consideration of how those same SNPs may still have impact on the expression level of the protein, which will no doubt impact traits as well.

This understanding may rise issues of translating this method onto other molecular phenotype. For the effect of cis SNP on expression is directly impacting the abundance of mRNA and hence the protein, but the SNP effect on methylation and polyA tails are manifest firstly on mRNA abundance. In other word, the information from methylation and polyA tails are already encoded in the expression weights, which render them useless in this case.

gaow · 2020-08-18T14:49:33Z

@hsun3163 good work on the summary statistics version explanation. Two suggestions:

Typically when we look into summary stats methods we ask 1) how it is connected to the version using the full data, and 2) what assumptions are made, if the summary stats version is an approximation of the full data version. 2) is very important because that's usually where limitations of summary stats versions are.
Please add references -- I suggest you do that as a good habit for all scientific text you write, even just on github.

In other word, the information from methylation and polyA tails are already encoded in the expression weights, which render them useless in this case.

I agree -- it seems you already have some causality model / graph in your head which is very good. But the idea is, that if our model is good enough, the model itself should be able to tell this and decide to "use" additional information or not. This is what the multivariate-TWAS model is meant for.

hsun3163 · 2020-08-27T05:02:44Z

For each genes, the Z score after TWAS are noted as $Z_{\text{twas}}$ ,

$\mathbf{Z}_{\mathbf{\text{twas}}} = \frac{\mathbf{W}\mathbf{Z}_{\mathbf{\text{gwas}}}}{\sqrt{\mathbf{\text{Var}}\left( {\mathbf{W}\mathbf{Z}}_{\mathbf{\text{gwas}}} \right)}}\mathbf{=}\frac{\mathbf{W}\mathbf{Z}_{\mathbf{\text{gwas}}}}{\sqrt{\mathbf{W*Var}\left( \mathbf{Z}_{\mathbf{\text{gwas}}} \right)\mathbf{*}\mathbf{W}^{\mathbf{T}}}}$

Where W is $\mathbf{\Sigma}_{\mathbf{e,s}}\mathbf{\Sigma}_{\mathbf{s,s}}^{\mathbf{- 1}}$ , $\mathbf{\Sigma}_{\mathbf{e,s}}$ is the covariate matrix between SNP and gene expression, which are populated by various different algorithm in Fusion_Weight_Compute.R. $\mathbf{\Sigma}_{\mathbf{s,s}}^{}$ is the LD matrix for all SNPs.

$\mathbf{Z}_{\mathbf{\text{gwas}}}$ *is a vector of N elements containing the GWAS association statistics for each of the SNP of said genes (Gusev 2016)

$\mathbf{Z}_{\mathbf{j}\mathbf{,}\mathbf{\text{gwas}}}\mathbf{=}\frac{\beta_{j,gwas}}{\text{SE}\left( \beta_{j,gwas} \right)}$

The exact components that specify $\mathbf{Z}_{\mathbf{j}\mathbf{,}\mathbf{\text{gwas}}}$ vary based on how β was estimated, under the case of GWAS between one SNP and the phenotype,(Maier 2017)

$\beta_{j,gwas} = \frac{X_{j}^{T}y}{{X_{j}}^{T}X_{j}}\text{\ \ }$

$\text{SE}\left( \beta_{j,gwas} \right) = \sqrt{\frac{\left( \text{Var}\left( y \right) - Var\left( X_{j} \right)*\ \beta_{j,gwas}^{2}\ \right)}{{Var(X}_{j})*N}}$

Where y is the phenotype of interests, N is the total number of SNP, and $\mathbf{X}_{\mathbf{j}}$ is the a vector of the population base mean centered genotype for the $\ j^{\text{th}}$ *SNP corresponding to y

With the assumption that
$\mathbf{Z}_{\mathbf{\text{gwas}}}\sim N(0\mathbf{,}\Sigma_{s,s})$ , the full model is (Gusev 2016):

$Z_{\text{twas}}\mathbf{=}\frac{\mathbf{\Sigma}_{\mathbf{e,s}}\mathbf{\Sigma}_{\mathbf{s}\mathbf{,}\mathbf{s}}^{\mathbf{- 1}}\left\{ \frac{\frac{X_{j}^{T}y}{{X_{j}}^{T}X_{j}}}{\sqrt{\frac{\left( \text{Var}\left( y \right) - Var\left( \mathbf{X}_{\mathbf{j}} \right)*\left( \frac{\mathbf{X}_{\mathbf{j}}^{\mathbf{T}}y}{{\mathbf{X}_{\mathbf{j}}}^{\mathbf{T}}\mathbf{X}_{\mathbf{j}}} \right)^{2}\ \right)}{{Var(\mathbf{X}}_{j})*N}}} \right\}}{\sqrt{\mathbf{\Sigma}_{\mathbf{e,s}}\mathbf{\Sigma}_{\mathbf{s}\mathbf{,}\mathbf{s}}^{\mathbf{- 1}}\mathbf{*}\mathbf{\Sigma}_{\mathbf{s,s}}\mathbf{*}\left( \mathbf{\Sigma}_{\mathbf{e,s}}\mathbf{\Sigma}_{\mathbf{s}\mathbf{,}\mathbf{s}}^{\mathbf{- 1}} \right)^{\mathbf{T}}}}$

Under further assumptions that

Each SNP only explained negligible portion of the total variance,
$\text{Var}\left( \mathbf{X}_{\mathbf{j}} \right)*\left( \frac{\mathbf{X}_{\mathbf{j}}^{\mathbf{T}}y}{{\mathbf{X}_{\mathbf{j}}}^{\mathbf{T}}\mathbf{X}_{\mathbf{j}}} \right)^{2} \approx 0$
Phenotypes assumed to be Var(y) = 1
Genotypes are standardized such that Var(Xj) = 1

The model can be simplified as follow (Maier 2017)

$Z_{\text{twas}}\mathbf{=}\frac{\mathbf{\Sigma}_{\mathbf{e,s}}\mathbf{\Sigma}_{\mathbf{s}\mathbf{,}\mathbf{s}}^{\mathbf{- 1}}\left\{ \left( \frac{\mathbf{X}_{\mathbf{j}}^{\mathbf{T}}y}{{\mathbf{X}_{\mathbf{j}}}^{\mathbf{T}}\mathbf{X}_{\mathbf{j}}} \right)\mathbf{*}\sqrt{N} \right\}}{\sqrt{\mathbf{\Sigma}_{\mathbf{e,s}}\mathbf{\Sigma}_{\mathbf{s}\mathbf{,}\mathbf{s}}^{\mathbf{- 1}}\mathbf{*}\mathbf{\Sigma}_{\mathbf{s,s}}\mathbf{*}\left( \mathbf{\Sigma}_{\mathbf{e,s}}\mathbf{\Sigma}_{\mathbf{s}\mathbf{,}\mathbf{s}}^{\mathbf{- 1}} \right)^{\mathbf{T}}}}$

Gusev et al. “Integrative approaches for large-scale transcriptome-wide association studies” 2016 Nature Genetics

Maier "A practical introduction to some theoretical concepts in quantitative genetics." 2017 https://rawgit.com/uqrmaie1/statgen_equations/master/statgen_equations.html#summary-of-equations

gaow · 2020-08-27T13:52:39Z

@hsun3163 looks great! let's discuss it later today.

gaow · 2020-09-23T15:01:53Z

@hsun3163 food for thought -- given we have multivariate TWAS predictions (eg joint coefficients predicted for gene expression in 3 tissues, plus other phenotypes such as splicing etc), how do you perform then a TWAS type of test for all the molecular features combined? It is kind of obvious if you work with the full data, but not exactly obvious in the summary statistics space. I think you can try to derive the method, though

gaow added Discussions Extra attention is needed New ideas This doesn't seem right Later This issue or pull request already exists labels Feb 14, 2021

gaow assigned nickluo1313 Feb 17, 2021

gaow added Current action New feature or request and removed Later This issue or pull request already exists labels Feb 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explain how summary stats works for TWAS #6

Explain how summary stats works for TWAS #6

gaow commented Aug 11, 2020

hsun3163 commented Aug 18, 2020

gaow commented Aug 18, 2020

hsun3163 commented Aug 27, 2020 •

edited

Loading

gaow commented Aug 27, 2020

gaow commented Sep 23, 2020

Explain how summary stats works for TWAS #6

Explain how summary stats works for TWAS #6

Comments

gaow commented Aug 11, 2020

hsun3163 commented Aug 18, 2020

gaow commented Aug 18, 2020

hsun3163 commented Aug 27, 2020 • edited Loading

gaow commented Aug 27, 2020

gaow commented Sep 23, 2020

hsun3163 commented Aug 27, 2020 •

edited

Loading