Skip to content

A collection of commonly used formats for quick lookup.

Notifications You must be signed in to change notification settings

Cloufield/formatbook

Repository files navigation

FormatBook

This is a companion repo for gwaslab.

A collection of commonly used formats for GWAS summmary statistics.

All the formats are stored as json files.

Each format consists of the following info (manually curated):

  1. meta_data: meta data, inlcluding software name, source urls, version and so on.
  2. format_dict: target format to gwaslab format column-name conversion dictionary

For example : format for metal software

{
"meta_data":{"format_name":"metal",
            "format_source":"https://genome.sph.umich.edu/wiki/METAL_Documentation",
            "format_version":"20220726"
            },
"format_dict":{
            "MarkerName":"SNPID",
            "Allele1":"EA",
            "Allele2":"NEA",
            "Freq1":"EAF",
            "Effect":"BETA",
            "StdErr":"SE",
            "P-value":"P",
            "Direction": "DIRECTION"
            }
}

Supported formats:

Keyword Software Description Citation
ssf GWAS Catalog GWAS-SSF Hayhurst, J., Buniello, A., Harris, L., Mosaku, A., Chang, C., Gignoux, C. R., ... & Barroso, I. (2022). A community driven GWAS summary statistics standard. bioRxiv.
gwascatalog GWAS Catalog GWAS Catalog format (outdated; please use GWAS-SSF) Buniello, A., MacArthur, J. A. L., Cerezo, M., Harris, L. W., Hayhurst, J., Malangone, C., ... & Parkinson, H. (2019). The NHGRI-EBI GWAS Catalog of published genome-wide association studies, targeted arrays and summary statistics 2019. Nucleic acids research, 47(D1), D1005-D1012.
pgscatalog PGS Catalog PGS Catalog format Lambert, S. A., Gil, L., Jupp, S., Ritchie, S. C., Xu, Y., Buniello, A., ... & Inouye, M. (2021). The Polygenic Score Catalog as an open database for reproducibility and systematic evaluation. Nature Genetics, 53(4), 420-425.
plink PLINK PLINK output format (only unambiguous headers) Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81(3), 559-575.
plink_assoc / plink_fisher PLINK .assoc, .assoc.fisher Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81(3), 559-575.
plink_dosage PLINK .assoc.dosage Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81(3), 559-575.
plink_logistic PLINK .assoc.logistic Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81(3), 559-575.
plink_linear PLINK .assoc.linear Purcell, S., Neale, B., Todd-Brown, K., Thomas, L., Ferreira, M. A., Bender, D., ... & Sham, P. C. (2007). PLINK: a tool set for whole-genome association and population-based linkage analyses. The American journal of human genetics, 81(3), 559-575.
plink2 PLINK2 PLINK2 output format (only unambiguous headers) Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4(1), s13742-015.
plink2_linear PLINK2 .glm.linear Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4(1), s13742-015.
plink2_logistic / plink2_firth PLINK2 .glm.firth, .glm.logistic[.hybrid] Chang, C. C., Chow, C. C., Tellier, L. C., Vattikuti, S., Purcell, S. M., & Lee, J. J. (2015). Second-generation PLINK: rising to the challenge of larger and richer datasets. Gigascience, 4(1), s13742-015.
saige SAIGE output format Zhou, W., Nielsen, J. B., Fritsche, L. G., Dey, R., Gabrielsen, M. E., Wolford, B. N., ... & Lee, S. (2018). Efficiently controlling for case-control imbalance and sample relatedness in large-scale genetic association studies. Nature genetics, 50(9), 1335-1341.
regenie REGENIE output format Mbatchou, J., Barnard, L., Backman, J., Marcketta, A., Kosmicki, J. A., Ziyatdinov, A., ... & Marchini, J. (2021). Computationally efficient whole-genome regression for quantitative and binary traits. Nature genetics, 53(7), 1097-1103.
fastgwa FASTGWA output format Jiang, L., Zheng, Z., Qi, T., Kemper, K. E., Wray, N. R., Visscher, P. M., & Yang, J. (2019). A resource-efficient tool for mixed model association analysis of large-scale data. Nature genetics, 51(12), 1749-1755.
metal METAL output format Willer, C. J., Li, Y., & Abecasis, G. R. (2010). METAL: fast and efficient meta-analysis of genomewide association scans. Bioinformatics, 26(17), 2190-2191.
mrmega MRMEGA output format Mägi, R., Horikoshi, M., Sofer, T., Mahajan, A., Kitajima, H., Franceschini, N., ... & Morris, A. P. (2017). Trans-ethnic meta-regression of genome-wide association studies accounting for ancestry increases power for discovery and improves fine-mapping resolution. Human molecular genetics, 26(18), 3639-3650.
fuma FUMA input format atanabe, K., Taskesen, E., Van Bochoven, A., & Posthuma, D. (2017). Functional mapping and annotation of genetic associations with FUMA. Nature communications, 8(1), 1-11.
ldsc LDSC input format Bulik-Sullivan, B. K., Loh, P. R., Finucane, H. K., Ripke, S., Yang, J., Patterson, N., ... & Neale, B. M. (2015). LD Score regression distinguishes confounding from polygenicity in genome-wide association studies. Nature genetics, 47(3), 291-295.
locuszoom LocusZoom input format Pruim, R. J., Welch, R. P., Sanna, S., Teslovich, T. M., Chines, P. S., Gliedt, T. P., ... & Willer, C. J. (2010). LocusZoom: regional visualization of genome-wide association scan results. Bioinformatics, 26(18), 2336-2337.
vcf GWAS-VCF gwas-vcf format Lyon, M. S., Andrews, S. J., Elsworth, B., Gaunt, T. R., Hemani, G., & Marcora, E. (2021). The variant call format provides efficient and robust storage of GWAS summary statistics. Genome biology, 22(1), 1-10.
bolt_lmm BOLT-LMM output format Loh, P. R., Tucker, G., Bulik-Sullivan, B. K., Vilhjalmsson, B. J., Finucane, H. K., Salem, R. M., ... & Price, A. L. (2015). Efficient Bayesian mixed-model analysis increases association power in large cohorts. Nature genetics, 47(3), 284-290.
popcorn popcorn input format Brown, B. C., Ye, C. J., Price, A. L., & Zaitlen, N. (2016). Transethnic genetic-correlation estimates from summary statistics. The American Journal of Human Genetics, 99(1), 76-88.
cojo cojo input format Yang et al. (2012) Conditional and joint multiple-SNP analysis of GWAS summary statistics identifies additional variants influencing complex traits. Nat Genet 44(4):369-375.

Future update: To add fields in meta_data:

  1. format_cite_name : formal name of the format, e.g. GWAS-SSF v0.1
  2. format_separator : separator used in the format, e.g. \t
  3. format_na : NA notation in the format, e.g. #NA
  4. format_comment : comment line, e.g. #
  5. format_col_order: column order

About

A collection of commonly used formats for quick lookup.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages