Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Human Genome Variation Map (HGVM) Pilot Project

adamnovak edited this page Jun 1, 2016 · 35 revisions

The HGVM Pilot project aims to create a draft reference structure that represents all “common” genetic variation, providing a means to stably name and canonically identify each variant. It aims to demonstrate such a structure can be used to improve upon current standard methods in genomics and create new ones. It is being run by members of the reference-variation GA4GH task team, and is discussed on their regular biweekly calls. To join please contact @benedictpaten or @skeenan.

Submission Guidelines

Submitting Graphs

Please produce, for each of the pilot regions you are contributing, a GA4GH data set, containing the following:

  1. A ReferenceSet, containing References, Sequences, and Joins describing your graph topology.

  2. One VariantSet for each of the reference and alternate sequences in the Pilot Test Data for the pilot. Each VariantSet should contain exactly one Allele, and reference, via the Allele's name field, the exact FASTA ID of a Pilot Test Data sequence. That Allele should trace, in the graph, a path that is base-for-base identical to the test data sequence it corresponds to, giving an embedding of that sequence in the graph. It is critical that these paths be specified and that they correspond to the test data sequences, as the relationships between these paths in the graph are analyzed by the downstream evaluations.. Once again, the sequences traced out by the paths should be base-for-base identical to, and oriented the same as, the provided input sequences.

This data set may be provided in the form of either a GA4GH graph schemas compatible API endpoint, or as an SQL database.

All graph submissions should be added below as a download or API endpoint link, or submitted to @benedictpaten directly.

Pilot Test Data

The project is starting with 6 pilot regions, which will be used to test approaches at a scale more tractable than the complete human genome. The test regions are:

  1. Major Histocompatibility Complex (8 alt haps): chr6:28510119-33480577
  2. Killer Cell Immunoglobulin-Like Receptor (KIR) Gene Cluster (35 alt haps): chr19:54025633-55084318
  3. Spinal Muscular Atrophy (SMA) locus (2 alt haps): chr5:69216818-71614443
  4. BRCA1 locus (reference, CHM1 mole, and LRG sequences): chr17:43044293-43125482
  5. BRCA2 locus (reference, CHM1 mole, and LRG sequences): chr13:32314860-32399849
  6. X chromosome centromere (CENX) reference repeat unit and reads

We are using the GRCh38 Human assembly, and are including available alternative haplotype sequences.

Full reference and alt haplotype sequences are available from Adam Novak here or as an archive here for all regions. You can use any data you want to build your graph, but some evaluations are looking for paths in the graph named for (and spelling out) the sequences provided here. If you do use the provided data, ignore the end coordinates given in the FASTA headers; they do not actually reflect the end coordinates of the provided sequences.

Pilot Test Data Details

For the first 5 regions, each region corresponds to a directory, and within that directory there is one ref.fa with the clipped-out GRCh38.p2 primary path sequence for the region (assigned FASTA ID "ref"), and a number of FASTAs with filenames and record IDs bearing the GI numbers of alternate sequences.

MHC, SMA, and KIR sequences were extracted using this script. BRCA1 and BRCA2 sequences (including the LRG sequences) were obtained using Nancy Ouyang's script below, and formatted using this script.

The CENX data has been provided by Karen Miga, and is in a slightly different format: ref.fa contains a the reference repeat unit, "DXZ1", with its FASTA ID set to ref, while reads.fa contains several independent read sequences from repeats in the region in question. The format is different because there are a few thousand reads, and each could not be realistically presented in its own file.

Per-gene sequences for the first 5 regions are available from Nancy Ouyang at Curoverse here, collected using the scripts and methodology described here. FASTA files are named by gene name and ncbi gene id, e.g. BRCA1-672.fa. She did not mirror the IMGT HLA contents (which can be gotten here ), even though that was requested on the minutes from the DWG meeting, due to their policy.

Structure of the pilot

The GA4GH API now supports a graph model of the reference in which variations can be described. A description of the reference model is contained in the common.avdl file within the schemas.

Currently the plan for the pilot has three parts. Signup for contributions is below.

  1. An implementation of the GA4GH API incorporating the graph model. A minimal graph server is being developed as a branch of the GA4GH reference server. This is being lead by the reference server team (see below). The implementation effort should complete all necessary end-points for the evaluation by the end of May 2015.

  2. The construction of a set of graph implementations, each represented by the GA4GH API. These will be provided by community members. Either the implementor can create their own implementation of the GA4GH API serving their graph, or they can use the reference server implementation developed by (1). The data format to create a graph genome within the reference server is described below (see Graph Format). For groups unable to host their own server, UCSC will host the server upon request.

  3. The construction of a set of analyses using the GA4GH APIs. This will be provided by members of the group. These may lead to further extension of the APIs.

At the end of the pilot people who have made a contribution to any of these three areas will be included as authors on a marker paper describing the graphs, the implementations and assessments. We are seeking provisional commitments to provide contributions to these three aspects (please add your name/group below with a brief description).

##Graph Format

To represent a graph in the GA4GH reference server it must be converted into a SQLite based format, from which the server serves. The format is described here. It closely reflects the AVRO schema present in the GA4GH API. An example graph in this format is shown [here] (https://github.com/ga4gh/server/blob/graph/tests/data/graphs/graphData_v023.sql).

Note that for the purposes of the pilot, all sequences and joins can be reported as part of a single ReferenceSet or VariantSet. All declared CallSets are part of that one VariantSet, each CallSet representing a single original sequence used to generate the graph. Then, we can map that CallSet to its corresponding Allele by looking for the unique AlleleCall with ploidy equal to 1.

The provided example dataset demonstrates this kind of setup.

Thus, the following will not be expected of datasets provided for the pilot:

  • multiple reference/variant sets
  • allele calls with ploidy > 1

##Time-line

Due to the exploratory and ambitious nature of the pilot, we propose to have two rounds of evaluation. In the first prototypes will be submitted and evaluated by the group informally without wider sharing - there should be no problem in submitting experimental graphs. In the second round the submitted graphs will be evaluated for publication. The timeline is as follows:

  1. Submission of 1st round prototype graphs - 1st of June 2015
  2. Completion of evaluations of 1st round of prototype graphs - 22nd of June 2015 (Monday before call).
  3. Submission of 2nd round prototype graphs - 15th of July 2015

After the 2nd round of submissions we anticipate working iteratively toward a publication targeted for September 2015.

##Contributors to the pilot (signup below!)

###Graph Contributors

  • Team-UCSC (Adam Novak, Maciek Smuga-Otto, Glenn Hickey, Benedict Paten, Karen Miga, David Haussler). Will provide implementations for all pilot regions. Plans to provide 2 different implementations for five of the regions. One based upon the Cactus multiple sequence aligner, and one based upon the context-based mapping scheme that we call Camel (sticking with the desert theme). For the CENX region will provide a graph built by Karen Miga.
  • Team-Hinx - (Erik Garrison @Sanger) Will provide implementations for all pilot regions using the variation graph assembler/aligner vg.
  • Team-BDG - (Frank Nothaft @ Berkeley) will provide implementations for parts 2,3 for all pilot regions using the avocado assembler.
  • Team-Oxford - (Phelim Bradley, Alexander Dilthey, Zamin Iqbal, Jerome Kelleher, Sorina Maciuca, Gil McVean). Will provide implementations for some pilot regions, plus some non-human examples, e.g. the MSP3.4 gene in P. falciparum.
  • Team-Curoverse - (Abram Connelly, Sarah Guthrie, Nancy Ouyang, Jiayong Li, Alexander (Sasha) Wait Zaranek.) Graphs for the BRCA1 and BRCA2 regions. [Interactive visualization of BRCA1/2 using tiling method] (http://science.curoverse.com/tiling/brca/pgp-graph); Youtube summary.

###API Implementation Contributions

  • Reference-server team (Jerome Kelleher, Danny Colligan, Maciek Smuga-Otto, et al.)

###Analysis Contributors

##Relevant Publications

Members of the group have been developing some theory and prototypes relevant to the pilot. These can be listed here (feel free to add).

Paten, Novak, Haussler paper describing approaches to constructing a reference structure

[Novak, Rosen, Haussler, Paten paper describing mapping to a reference structure (only deals with string to string case, but concept generalises)] (http://arxiv.org/pdf/1501.04128v1.pdf)

[Dilthey, Cox, Iqbal, Nelson, McVean paper describing applications of a graph-based reference structure to inference in the MHC] (http://biorxiv.org/content/early/2014/07/08/006973)

[Guthrie, Connelly, et. al paper describing tiling approach in 680 public whole genomes] (https://dx.doi.org/10.7287/peerj.preprints.1426v1)