-
Notifications
You must be signed in to change notification settings - Fork 11
Example 4
This example illustrates how to fit an RSS-BVSR model using variational Bayes (VB) approximation.
Based on the theoretical derivations, we further set the input values of se
and R
such that the VB analysis of summary-level data is equivalent to the VB analysis of individual-level data (Carbonetto and Stephens, 2012). Hence, this example provides an in silico sanity check for our theoretical work.
The summary-level data are computed from a simulated GWAS dataset. The GWAS data are simulated by enrich_datamaker.m
, which makes "signal-enriched" data based on the gene set. Specifically, SNPs outside the gene set are selected to be causal ones with probability 10^theta0
, whereas SNPs inside the gene set are selected with a higher probability 10^(theta0+theta)
, where theta>0
. For more details, see Carbonetto and Stephens (2013).
After generating the data, we feed the individual-level and summary-level data to varbvs
and rss-varbvsr
respectively, and then compare the VB output from these two methods.
To reproduce results of Example 4, please use example4.m.
Before running example4.m, please make sure the VB subroutines of RSS are installed. See instructions here.
Step 1. Download the input data genotype.mat
and AH_chr16.mat
for enrich_datamaker.m
. Please contact us if you have trouble accessing this file.
Step 2. Install the MATLAB
implementation of varbvs
. Please follow the instruction here. After the installation, add varbvs
to the search path (see the example below).
addpath('/home/xiangzhu/varbvs-master/varbvs-MATLAB/');
Step 3. Extract the SNPs that inside the gene set. This step is where we need the input data AH_chr16.mat
. The index of SNPs inside the gene set is stored as snps
.
AH = matfile('AH_chr16.mat');
H = AH.H; % hypothesis matrix 3323x3160
A = AH.A; % annotation matrix 12758x3323
paths = find(H(:,end)); % signal transduction (Biosystem, Reactome)
snps = find(sum(A(:,paths),2)); % index of variants inside the pathway
Step 4. Simulate the enriched dataset. In addition to genotype.mat
and AH_chr16.mat
, four more input data are required in order to run example4.m
. You can provide them through keyboard.
% set the number of replications
prompt = 'What is the number of replications? ';
Nrep = input(prompt);
% set the genome-wide log-odds
prompt1 = 'What is the genome-wide log-odds? ';
theta0 = input(prompt1);
% set the log-fold enrichment
prompt2 = 'What is the log-fold enrichment? ';
theta = input(prompt2);
% set the true pve
prompt3 = 'What is the pve (between 0 and 1)? ';
pve = input(prompt3);
With this in place, the data generation step is as follows.
for i = 1:Nrep
myseed = 617 + i;
% generate data under enrichment hypothesis
[true_para,individual_data,summary_data] = enrich_datamaker(C,thetatype,pve,myseed,snps);
... ...
end
Step 5. Ensure that varbvs
and rss-varbvsr
run in an almost identical environment. We fix all the hyper-parameters as their true values, and give the same parameter initialization.
% fix all hyper-parameters as their true values
sigma = true_para{4}^2; % the true residual variance
sa = 1/sigma; % sa*sigma being the true prior variance of causal effects
sigb = 1; % the true prior SD of causal effects
theta0 = thetatype(1); % the true genome-wide log-odds (base 10)
theta = thetatype(2); % the true log-fold enrichment (base 10)
% initialize the variational parameters
rng(myseed, 'twister');
alpha0 = rand(p,1);
alpha0 = alpha0 ./ sum(alpha0);
rng(myseed+1, 'twister');
mu0 = randn(p,1);
Step 6. Run varbvs
on the individual-level genotypes and phenotypes. The VB analysis involves two steps: first fit the model assuming no enrichment; second fit the model using the enrichment information. We then calculate a Bayes factor based on the marginal likelihoods of these two models, and use it to test whether the gene set is enriched for genetic associations.
fit_null = varbvs(X,[],y,[],'gaussian',options_n);
fit_gsea = varbvs(X,[],y,[],'gaussian',options_e);
bf_b = exp(fit_gsea.logw - fit_null.logw);
Step 7. Run rss-varbvsr
on the single-SNP association summary statistics. Here we do not specify se
and R
as we actually do in our real data analyses. Instead, we set their values such that rss-varbvsr
and varbvs
are expected to produce the same output. Based on our derivation, we find that rss-varbvsr
is equivalent to varbvs
if and only if
- the variance of phenotype is the same as residual variance
sigma.^2
; - the input LD matrix
R
is the same as the sample correlation matrix of cohort genotypesX
Interestingly, these two assumptions are exactly the same assumptions in Proposition 2.1 of RSS paper, which guarantees that the RSS likelihood is equivalent to the Gaussian likelihood of individual-level data.
These two assumptions are implemented as follows.
% set the summary-level data for a perfect match b/t rss-varbvsr and varbvs
betahat = summary_data{1};
se = sqrt(sigma ./ diag(X'*X)); % condition 1 for perfect matching
Si = 1 ./ se(:);
R = corrcoef(X); % condition 2 for perfect matching
SiRiS = sparse(repmat(Si, 1, p) .* R .* repmat(Si', p, 1));
Now we perform the VB analysis of summary data via rss-varbvsr
.
[logw_nr,alpha_nr,mu_nr,s_nr] = rss_varbvsr(betahat,se,SiRiS,sigb,logodds_n,options);
[logw_er,alpha_er,mu_er,s_er] = rss_varbvsr(betahat,se,SiRiS,sigb,logodds_e,options);
bf_r = exp(logw_er-logw_nr);
Step 8. Compare VB output from varbvs
and rss-varbvsr
. Both varbvs
and rss-varbvsr
output an estimated posterior distribution of beta
(the multiple regression coefficients, or, the multiple-SNP effects). Specifically, the estimated distributions have the same analytical form (Equations 6-7 in Carbonetto and Stephens (2012)).
Hence, it suffices to compare the estimated [alpha, mu, s]
, and the estimated Bayes factors based on the VB output. Here is the result when I only run example4.m
with one replication.
>> run example4.m
What is the number of replications? 1
What is the genome-wide log-odds? -4
What is the log-fold enrichment? 2
What is the pve (between 0 and 1)? 0.3
The log10 maximum absolute differences of [alpha, mu, s]
between varbvs
and rss-varbvsr
, under the null model.
>> disp(log10([alpha_n_diff mu_n_diff s_n_diff]))
-4.7084 -4.5409 -7.6421
The log10 maximum absolute differences of [alpha, mu, s]
between varbvs
and rss-varbvsr
, under the enrichment model.
>> disp(log10([alpha_e_diff mu_e_diff s_e_diff]))
-3.9865 -4.0157 -7.6421
The log10 Bayes factors estimated from varbvs
and rss-varbvsr
are 4.0193 and 4.0193 respectively, and the log10 relative difference between them is -6.8345.
To see how example4.m
behave on average, we need run more replication. Below are the results from 100 replications.
>> run example4.m
What is the number of replications? 100
What is the genome-wide log-odds? -4
What is the log-fold enrichment? 2
What is the pve (between 0 and 1)? 0.3