Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quality control protocol for TWAS #2

Open
gaow opened this issue Jun 30, 2020 · 4 comments
Open

Quality control protocol for TWAS #2

gaow opened this issue Jun 30, 2020 · 4 comments
Assignees
Labels
Current action New feature or request Discussions Extra attention is needed Reference Improvements or additions to documentation

Comments

@gaow
Copy link
Owner

gaow commented Jun 30, 2020

No description provided.

@hsun3163
Copy link
Collaborator

hsun3163 commented Jul 14, 2020

QC, Normalization, and batch effect correction of RNASeq data for TWAS

Reference: NIHMS1518564, detailed protocol they used are included.

This article summarized the RNASeq QC and data preprocessing steps to be feed into TWAS. For RNASeq data that passed the initial fastQC procedures, they shall be normalized into CPM via TMM and then passed into sva comBat procedure to control for batch effect. The outcome then can be feeds into the FUSION for weight generation and TWAS

Target for Input to the TWAS pipeline:

RNA Seq data:

  1. Passed both RNA-seq and genotyping quality control (as in raw read by fastQC )
  2. The log-transformed, SSVA-corrected expression data
  3. Inverse-normal transformed (rank offset = 3/8)

SNP data: For each gene, all the SNPs [within 1 Mb of its start or end site] (Optional) that had GWAS statistics

RNA Seq data preprocessing protocols

RNA-seq and genotyping quality control

  1. Normalization was performed using Trimmed Mean of M-values (TMM) in Counts per Million (CPM) using edgeR (version 3.18.1). then converted into log2 CPM with an offset of 1. normal quantile transformation was applied instead to log2(CPM) values.
  2. RPKM actually builds on CPM by adding feature length normalization, edgeR's implementation calculates RPKM by simply dividing each feature's CPM by that feature's length multiplied by one thousand. This adds feature length normalization to sequencing depth-normalized counts.
  3. From RPKM to CPM : RPKM*1000/(Feature length), need to know feature length

SSVA-corrected expression data (Batch correction)

  1. Compute the sva corrected data using comBat
    i. Find the batch effect (Primary variable: SNP, can include other potential covariate like ages, the idea is to remove the )
    ii. Apply comBat function
  2. Exclusion criteria for negative control genes in SSVA based on known literature, Examples are
    i. Genes within 100 Kb of linkage disequilibrium of known 34 AMD susceptibility loci identified in the most recent GWAS study for AMD20
    ii. RetNet (retinal Information Network) genes (see URLs),
    iii. AMD candidate genes from PubMed literature search over the last five years (see Weighted Gene-correlation Network Analysis in Methods),
    iv. aging- and gender-associated genes from GTEx analysis21,
    v. X and Y chromosomal genes, and
    vi. genes that did not meet the expression-level threshold >= 1 CPM in >= 10% of all samples.
  3. Gave up PEER because it can not be installed in anyway including dockers and no more maintenance are available

NIHMS1518564-supplement-1.docx

@gaow
Copy link
Owner Author

gaow commented Jul 29, 2020

What ROSMAP eQTL analysis did for covariates, according to Annie Lee at Columbia Neurology, is the following:

we started with a model including batch and then successively add more technical variables if the technical variables are significantly associated with the PCs calculated from the residuals of the previous model. The technical variables we considered are "pmi, study, Batch, LOG_ESTIMATED_LIBRARY_SIZE, LOG_PF_READS_ALIGNED, LogAlignedReads, PCT_PF_READS_ALIGNED, PCT_USABLE_BASES, PCT_CODING_BASES, PCT_UTR_BASES, PCT_MRNA_BASES, PCT_INTRONIC_BASES, PCT_INTERGENIC_BASES, PCT_RIBOSOMAL_BASES, MEDIAN_CV_COVERAGE, MEDIAN_5PRIME_BIAS, MEDIAN_3PRIME_BIAS, MEDIAN_5PRIME_TO_3PRIME_BIAS, PERCENT_DUPLICATION". Somehow we did not included the RIN score for consideration. Hans may know the reason for this.

@gaow
Copy link
Owner Author

gaow commented Jul 31, 2020

Turns out the reason they dont control for hidden covariates are:

  1. They are confident that the carefully curated technical variables are good enough to remove artifacts
  2. They would like to "clean" the RNA-seq data only once and use it for various other types of analysis. Although removing hidden confounders is justified in QTL analysis, for other tasks where the biological signals captured by hidden covariates actually matters, they don't want to remove such covariates.

@hsun3163 I think we can possibly get away with just using these explicit technical variables for reasons discussed above. However I think we should add one more explicit covariate -- Alzheimer's disease status. Because if some SNPs are correlated with AD status, and if some gene experssion are also correlated with AD status, then conditional on AD status the SNP and expression should not have a correlation, thus the SNP is not an eQTL. However without controlling for AD status the SNP will have a spurious association with the expression level. This would not be a concern if we use SVA type of analysis because some factors learned should already capture and account for AD status. When we use explicit features we just have to be more careful.

@gaow gaow added Discussions Extra attention is needed Reference Improvements or additions to documentation Current action New feature or request labels Feb 14, 2021
@gaow
Copy link
Owner Author

gaow commented Feb 14, 2021

@hsun3163 no rush but let's organize this into a notebook as you revamp your repo, then close the issue. We can visit this after next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Current action New feature or request Discussions Extra attention is needed Reference Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests

2 participants