-
Notifications
You must be signed in to change notification settings - Fork 15
FAQ
Welcome to the MungeSumstats wiki! Here, we will list out frequently asked questions for formatting summary statistics with MungeSumstats which will help ensure you run MungeSumstats in the most suitable way based on your data and needs.
MungeSumstats can be used for the standardisation and quality control of most every summary statistics from GWAS (and for xQTL studies). To run MungeSumstats, download the latest release version from Bioconductor (BiocManager::install("MunegSumstats")
) and run MungeSumstats::format_sumstats(path_to_sumstats)
. Doing this will result in Mungesumstats standardising the column headers using a mapping file, mapping to different genome builds if required and performing many checks such as, are the listed SNPs' reference alleles on the reference genome. More information on all quality control checks and the defaults are given in our getting started vignette.
Although MungeSumstats is designed to work by default for most users, there are cases where certaqin changes to these defaults may be optimal for certain users. Moreover, it can be daunting to get through all the documentation when you may have a specific question about a use case. As such, we have collated FAQs below which have come up the most since we released MungeSumstats.
The answer, for most cases, is no. MungeSumstats can handle many column name formats without issue (it is case insensitive and has mappings for all common synonyms, e.g. BP
, Pos
, Position
,Base Pair Position
all represent the genomic location of a SNP). However, if you run MungeSumstats and it does not recognise a column header as a standardised one but correspond to one of
SNP, BP, CHR, A1, A2, P, Z, OR, BETA, LOG_ODDS, SIGNED_SUMSTAT, N, N_CAS, N_CON,
NSTUDY, INFO or FRQ,
You can either update the mapping file data("sumstatsColHeaders")
following the approach in the data.R
file to add the necessary mapping. Then use a Pull Request on GitHub and we will incorporate this change into the package. Or create an issue explaining the missing mapping with a executable, small example and we will add it for you.
One caveat to the above when it may be best to reformat the column headers for the summary statistic file is if your file contains the columns A1 and A2 for the effect and non-effect alleles. The issue with this naming is that different groups infer A1/A2 differently. For one group, A2 may be the effect allele and for another, it may be A1. There are checks for this in MungeSumstats which will be discussed in more detail in these FAQs (TODO add Q) but if you know which column is which, it is best to rename these to effect_allele
/non-effect_allele
so no misinterpretation can happen. When MungeSumstats formats your file, A2 will be the effect allele and A1 will be the non-effect allele.
Yes! MungeSumstats uses the same allele naming convention as GWAS VCF and IEU GWAS, where A2 is the effect allele and A1 is the reference. See more about this in our publication.
If this happens to you, you should look into what caused a lot of the SNPs to be removed. MungeSumstats outputs very helpful information where it tells you exactly how many SNPs are removed after each quality control step and at the end, the overall proportion of SNPs remaining. By default, MungeSumstats have some checks which can remove large proportions of SNPs by default such as the info quality filter for 0.9 (MungeSumstats::format_sumstats(..., INFO_filter = 0.9)
) or removing non-bi-allelic SNPs (MungeSumstats::format_sumstats(..., bi_allelic_filter = TRUE)
). More on non-bi-allelic SNPs and what to do with these in Q.8.
4. Can I use MungeSumstats to have my summary statistics formatted for use with LDSC?
Yes! MungeSumstats standardises summary statistics which makes them usable for a whole host of downstream analysis. If your goal is to run LDSC or its variants like s-LDSC, just set MungeSumstats::format_sumstats(..., save_format="LDSC")
. This will update the set parameters for the run to ensure they are compatible with LDSC format.
MungeSumstats works for GRCh37 and GRCh38 data and uses dbSNP to validate SNPs. The available versions of dbSNP for use are 144 and 155. To change the used version set MungeSumstats::format_sumstats(..., dbSNP = 155)
. Note we have plans to enable a user to input their own dbSNP build but this has not yet been implemented - Keep up-to-date on this here.
MungeSumstats represents EAF
as FRQ
in its output. So for some cases EAF and MAF will be the same - where all SNPs being tested are not the allele most commonly found in the population at a given position (MAF = EAF). However, there are cases where some SNPs tested in a GWAS are testing the most frequently found allele (the reference/major allele). MungeSumstats allows for this and does not assume MAF=EAF and imposes no hard constraints on it. It will however print a message to let the user know how many SNPs have FRQs that likely relate to the major allele, for example:
427,945 SNPs (4.6%) have FRQ values > 0.5. Conventionally the FRQ column is intended to show the minor/effect allele frequency.
7. How does MungeSumstats handle allele discrepancies? What if one SNP in my file has mislabeled alleles?
Once MungeSumstats has inferred the effect and non-effect alleles, it proceeds to perform a SNP level check to ensure all SNPs' reference allale matches the reference genome. For those that don't but their effect allele does (for example where the SNP being tested is the major allele), MungeSumstats will flip these SNPs and their effect columns to match the other SNPs (MungeSumstats::format_sumstats(..., allele_flip_check = TRUE)
).
Non-bi-allelic SNPs are SNPs in a location of the genome where more than one alternative base has been seen in a population. Unsurprisingly, as we sequence more people from diverse populations, we have found more and more non-bi-allelic SNPs. So when you use later versions of dbSNP, more are found - MungeSumstats by default uses a very recent version of dbSNP; 155. By default, MungeSumstats removes non-bi-allelic SNPs (MungeSumstats::format_sumstats(..., bi_allelic_filter = TRUE)
) as downstream analysis tools often require this. However, given the growing numbers of these SNPs, I don't believe this is sensible choice if your downstream analysis does not require it - have a look at MungeSumstats output to see just how many SNPs you lose on this step. I would choose to keep these unless completely necessary to remove. (A lot) More on this here.
Yes! Overall, mungeSumstats is suitable for use with GWAS summary statistics from any ancestry/population. The reference datasets it uses could be the only place there could be any issue. These are:
- Reference genomes GRCh37/GRCh38 for checks like is the SNP found on the reference genome (reference allele) and inferring the genome build
- dbSNP differing versions for checks like is the SNP present in dbSNP version of interest (based on RS ID/position & reference/alternative allele) and imputing any missing data for the SNP
Overall, this will not cause an issue as both the reference genome and the dbSNP databases account for non-European versions. If you want to be specially careful, just use the latest version of dbSNP we have v155 (it's the default) and read MungeSumstats output messages to make sure a lot of SNPs aren't removed as they aren't found.
See more on this here and here.
Yes! You can do this when standardising and performing quality control checks with MungeSumstats::format_sumstats(..., convert_ref_genome='GRCh38')
which will convert your data to GRCh38 if it is not already GRCh38. Or you can just liftover your data without performing any other checks:
sumstats_dt <- MungeSumstats::formatted_example()
sumstats_dt_hg38 <- liftover(sumstats_dt=sumstats_dt,
ref_genome = "hg19",
convert_ref_genome="hg38")
Note that MungeSumstats uses chain files to perform liftover from either UCSC or ensembl. These require internet connection to be downloaded and cached the first time. MungeSumstats::format_sumstats(..., chain_source='ucsc')
("ucsc" or "ensembl"). Note that the UCSC chain files require a license for commercial use so the Ensembl chain is used by default. Note local chain files can also be used, specifying the path with MungeSumstats::format_sumstats(..., local_chain='some_path')
Yes! If you leave MungeSumstats::format_sumstats(..., ref_genome = NULL)
, MungeSumstats will infer the build from your data.
If you use MungeSumstats, please cite the original authors of the GWAS as well as:
Alan E Murphy, Brian M Schilder, Nathan G Skene (2021) MungeSumstats: A Bioconductor package for the standardisation and quality control of many GWAS summary statistics. Bioinformatics, btab665, https://doi.org/10.1093/bioinformatics/btab665
See the FAQ for some helpful pointers for running MungeSumstats