Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain how summary stats works for TWAS #6

Open
gaow opened this issue Aug 11, 2020 · 5 comments
Open

Explain how summary stats works for TWAS #6

gaow opened this issue Aug 11, 2020 · 5 comments
Assignees
Labels
Current action New feature or request Discussions Extra attention is needed New ideas This doesn't seem right

Comments

@gaow
Copy link
Owner

gaow commented Aug 11, 2020

@hsun3163 how the model looks like, and what approximation has been made.

@hsun3163
Copy link
Collaborator

For each genes there is two vectors: The expression weight of each SNP and the standardized effect sizes of SNP Z score of each SNP. The matrix multiplication of two vectors produce a Z-score of expression and trait (WZ), where under null data (no association) and a multi-variate normal assumption Z_gwas ~ N(0,Σs,s)

W_twas * Z_gwas

The variance of Z_twas is the product between square of W_twas and the Correlation matrix among all SNP(the LD Matrixs) :

W_twas * Cor.LD * W_twas

The imputation Z-score of cis-genetic effect on trait is therefore

Z_twas = (W_twas %*% Z_gwas)/( W_twas * Cor.LD * W_twas) ^1/2

It occurs that, the Z_twas score is in fact still based on the effect of the SNP on traits, which may largely via their impact on protein structures, while taking consideration of how those same SNPs may still have impact on the expression level of the protein, which will no doubt impact traits as well.

This understanding may rise issues of translating this method onto other molecular phenotype. For the effect of cis SNP on expression is directly impacting the abundance of mRNA and hence the protein, but the SNP effect on methylation and polyA tails are manifest firstly on mRNA abundance. In other word, the information from methylation and polyA tails are already encoded in the expression weights, which render them useless in this case.

@gaow
Copy link
Owner Author

gaow commented Aug 18, 2020

@hsun3163 good work on the summary statistics version explanation. Two suggestions:

  1. Typically when we look into summary stats methods we ask 1) how it is connected to the version using the full data, and 2) what assumptions are made, if the summary stats version is an approximation of the full data version. 2) is very important because that's usually where limitations of summary stats versions are.
  2. Please add references -- I suggest you do that as a good habit for all scientific text you write, even just on github.

In other word, the information from methylation and polyA tails are already encoded in the expression weights, which render them useless in this case.

I agree -- it seems you already have some causality model / graph in your head which is very good. But the idea is, that if our model is good enough, the model itself should be able to tell this and decide to "use" additional information or not. This is what the multivariate-TWAS model is meant for.

@hsun3163
Copy link
Collaborator

hsun3163 commented Aug 27, 2020

For each genes, the Z score after TWAS are noted as ,

Where W is , is the covariate matrix between SNP and gene expression, which are populated by various different algorithm in Fusion_Weight_Compute.R. is the LD matrix for all SNPs.

*is a vector of N elements containing the GWAS association statistics for each of the SNP of said genes (Gusev 2016)

The exact components that specify vary based on how β was estimated, under the case of GWAS between one SNP and the phenotype,(Maier 2017)

Where y is the phenotype of interests, N is the total number of SNP, and is the a vector of the population base mean centered genotype for the *SNP corresponding to y

With the assumption that
, the full model is (Gusev 2016):

Under further assumptions that

  1. Each SNP only explained negligible portion of the total variance,

  2. Phenotypes assumed to be Var(y) = 1

  3. Genotypes are standardized such that Var(Xj) = 1

The model can be simplified as follow (Maier 2017)


Gusev et al. “Integrative approaches for large-scale transcriptome-wide association studies” 2016 Nature Genetics

Maier "A practical introduction to some theoretical concepts in quantitative genetics." 2017 https://rawgit.com/uqrmaie1/statgen_equations/master/statgen_equations.html#summary-of-equations

@gaow
Copy link
Owner Author

gaow commented Aug 27, 2020

@hsun3163 looks great! let's discuss it later today.

@gaow
Copy link
Owner Author

gaow commented Sep 23, 2020

@hsun3163 food for thought -- given we have multivariate TWAS predictions (eg joint coefficients predicted for gene expression in 3 tissues, plus other phenotypes such as splicing etc), how do you perform then a TWAS type of test for all the molecular features combined? It is kind of obvious if you work with the full data, but not exactly obvious in the summary statistics space. I think you can try to derive the method, though

@gaow gaow added Discussions Extra attention is needed New ideas This doesn't seem right Later This issue or pull request already exists labels Feb 14, 2021
@gaow gaow added Current action New feature or request and removed Later This issue or pull request already exists labels Feb 17, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Current action New feature or request Discussions Extra attention is needed New ideas This doesn't seem right
Projects
None yet
Development

No branches or pull requests

3 participants