Quality control protocol for TWAS #2

gaow · 2020-06-30T17:51:07Z

No description provided.

hsun3163 · 2020-07-14T17:56:37Z

QC, Normalization, and batch effect correction of RNASeq data for TWAS

Reference: NIHMS1518564, detailed protocol they used are included.

This article summarized the RNASeq QC and data preprocessing steps to be feed into TWAS. For RNASeq data that passed the initial fastQC procedures, they shall be normalized into CPM via TMM and then passed into sva comBat procedure to control for batch effect. The outcome then can be feeds into the FUSION for weight generation and TWAS

Target for Input to the TWAS pipeline:

RNA Seq data:

Passed both RNA-seq and genotyping quality control (as in raw read by fastQC )
The log-transformed, SSVA-corrected expression data
Inverse-normal transformed (rank offset = 3/8)

SNP data: For each gene, all the SNPs [within 1 Mb of its start or end site] (Optional) that had GWAS statistics

RNA Seq data preprocessing protocols

RNA-seq and genotyping quality control

Normalization was performed using Trimmed Mean of M-values (TMM) in Counts per Million (CPM) using edgeR (version 3.18.1). then converted into log2 CPM with an offset of 1. normal quantile transformation was applied instead to log2(CPM) values.
RPKM actually builds on CPM by adding feature length normalization, edgeR's implementation calculates RPKM by simply dividing each feature's CPM by that feature's length multiplied by one thousand. This adds feature length normalization to sequencing depth-normalized counts.
From RPKM to CPM : RPKM*1000/(Feature length), need to know feature length

SSVA-corrected expression data (Batch correction)

Compute the sva corrected data using comBat
i. Find the batch effect (Primary variable: SNP, can include other potential covariate like ages, the idea is to remove the )
ii. Apply comBat function
Exclusion criteria for negative control genes in SSVA based on known literature, Examples are
i. Genes within 100 Kb of linkage disequilibrium of known 34 AMD susceptibility loci identified in the most recent GWAS study for AMD20
ii. RetNet (retinal Information Network) genes (see URLs),
iii. AMD candidate genes from PubMed literature search over the last five years (see Weighted Gene-correlation Network Analysis in Methods),
iv. aging- and gender-associated genes from GTEx analysis21,
v. X and Y chromosomal genes, and
vi. genes that did not meet the expression-level threshold >= 1 CPM in >= 10% of all samples.
Gave up PEER because it can not be installed in anyway including dockers and no more maintenance are available

NIHMS1518564-supplement-1.docx

gaow · 2020-07-29T17:09:16Z

What ROSMAP eQTL analysis did for covariates, according to Annie Lee at Columbia Neurology, is the following:

we started with a model including batch and then successively add more technical variables if the technical variables are significantly associated with the PCs calculated from the residuals of the previous model. The technical variables we considered are "pmi, study, Batch, LOG_ESTIMATED_LIBRARY_SIZE, LOG_PF_READS_ALIGNED, LogAlignedReads, PCT_PF_READS_ALIGNED, PCT_USABLE_BASES, PCT_CODING_BASES, PCT_UTR_BASES, PCT_MRNA_BASES, PCT_INTRONIC_BASES, PCT_INTERGENIC_BASES, PCT_RIBOSOMAL_BASES, MEDIAN_CV_COVERAGE, MEDIAN_5PRIME_BIAS, MEDIAN_3PRIME_BIAS, MEDIAN_5PRIME_TO_3PRIME_BIAS, PERCENT_DUPLICATION". Somehow we did not included the RIN score for consideration. Hans may know the reason for this.

gaow · 2020-07-31T03:35:32Z

Turns out the reason they dont control for hidden covariates are:

They are confident that the carefully curated technical variables are good enough to remove artifacts
They would like to "clean" the RNA-seq data only once and use it for various other types of analysis. Although removing hidden confounders is justified in QTL analysis, for other tasks where the biological signals captured by hidden covariates actually matters, they don't want to remove such covariates.

@hsun3163 I think we can possibly get away with just using these explicit technical variables for reasons discussed above. However I think we should add one more explicit covariate -- Alzheimer's disease status. Because if some SNPs are correlated with AD status, and if some gene experssion are also correlated with AD status, then conditional on AD status the SNP and expression should not have a correlation, thus the SNP is not an eQTL. However without controlling for AD status the SNP will have a spurious association with the expression level. This would not be a concern if we use SVA type of analysis because some factors learned should already capture and account for AD status. When we use explicit features we just have to be more careful.

gaow · 2021-02-14T15:23:30Z

@hsun3163 no rush but let's organize this into a notebook as you revamp your repo, then close the issue. We can visit this after next week.

gaow added Discussions Extra attention is needed Reference Improvements or additions to documentation Current action New feature or request labels Feb 14, 2021

gaow assigned hsun3163 Feb 14, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Quality control protocol for TWAS #2

Quality control protocol for TWAS #2

gaow commented Jun 30, 2020

hsun3163 commented Jul 14, 2020 •

edited

Loading

gaow commented Jul 29, 2020

gaow commented Jul 31, 2020

gaow commented Feb 14, 2021

Quality control protocol for TWAS #2

Quality control protocol for TWAS #2

Comments

gaow commented Jun 30, 2020

hsun3163 commented Jul 14, 2020 • edited Loading

QC, Normalization, and batch effect correction of RNASeq data for TWAS

Target for Input to the TWAS pipeline:

RNA Seq data preprocessing protocols

RNA-seq and genotyping quality control

SSVA-corrected expression data (Batch correction)

gaow commented Jul 29, 2020

gaow commented Jul 31, 2020

gaow commented Feb 14, 2021

hsun3163 commented Jul 14, 2020 •

edited

Loading