-
Notifications
You must be signed in to change notification settings - Fork 195
Path Metadata Model
Jouni Siren edited this page Nov 9, 2023
·
7 revisions
There is a PathMetadata
interface which all PathHandleGraph
s, and consequently all of the graph types and files used in vg, implement. This page explains the way in which we model path metadata, how that model is implemented in different graph implementations and formats, and how this affects end users of vg tools trying to do analyses.
See also: Changing References
You can see an example of path metadata by running, from the repository's test
directory:
vg paths --metadata -x test/graphs/rgfa_with_reference.rgfa
This will produce TSV data:
#NAME SENSE SAMPLE HAPLOTYPE LOCUS PHASE_BLOCK SUBRANGE
sample1#2#chr1#0 HAPLOTYPE sample1 2 chr1 0 NO_SUBRANGE
CHM13#0#chr1#0 HAPLOTYPE CHM13 0 chr1 0 NO_SUBRANGE
coolgene[1] GENERIC NO_SAMPLE_NAME NO_HAPLOTYPE coolgene NO_PHASE_BLOCK 1
GRCh38#0#chr1 REFERENCE GRCh38 0 chr1 NO_PHASE_BLOCK NO_SUBRANGE
GRCh37#0#chr1#0 HAPLOTYPE GRCh37 0 chr1 0 NO_SUBRANGE
sample1#1#chr1#0 HAPLOTYPE sample1 1 chr1 0 NO_SUBRANGE
coolgene[7] GENERIC NO_SAMPLE_NAME NO_HAPLOTYPE coolgene NO_PHASE_BLOCK 7
Formatted as a table, that is:
#NAME | SENSE | SAMPLE | HAPLOTYPE | LOCUS | PHASE_BLOCK | SUBRANGE |
---|---|---|---|---|---|---|
sample1#2#chr1#0 | HAPLOTYPE | sample1 | 2 | chr1 | 0 | NO_SUBRANGE |
CHM13#0#chr1#0 | HAPLOTYPE | CHM13 | 0 | chr1 | 0 | NO_SUBRANGE |
coolgene[1] | GENERIC | NO_SAMPLE_NAME | NO_HAPLOTYPE | coolgene | NO_PHASE_BLOCK | 1 |
GRCh38#0#chr1 | REFERENCE | GRCh38 | 0 | chr1 | NO_PHASE_BLOCK | NO_SUBRANGE |
GRCh37#0#chr1#0 | HAPLOTYPE | GRCh37 | 0 | chr1 | 0 | NO_SUBRANGE |
sample1#1#chr1#0 | HAPLOTYPE | sample1 | 1 | chr1 | 0 | NO_SUBRANGE |
coolgene[7] | GENERIC | NO_SAMPLE_NAME | NO_HAPLOTYPE | coolgene | NO_PHASE_BLOCK | 7 |
From this, we can see that every path has:
- A name, which is a string that uniquely identifies the path. This can be in PanSN format, and may have an additional trailing
#
-delimited or[]
-enclosed field. - A sense. A pathc can be exactly one of haplotype sense (representing a haplotype that a particular individual has for part of a contig), reference sense (representing a path taken as part of a haploid or diploid linear reference like
GRCh38
orCHM13
), or generic sense (representing something else, like a gene or an aligned read). - A sample. For haplotypes, this is the identifier for the sampled individual, like
NA19239
orHG003
. For references, this is the name of the reference assembly, likeGRCh38
. For generic paths, this is unset. - A haplotype number, identifying which haplotype of a sample the path belongs to. For haplotype paths, this would be
0
or1
in a diploid organism. For reference paths, this is meant to be0
in a haploid reference, and1
or2
as appropriate in a diploid reference. For generic paths, this is unset. - A locus name. This indicates the chromosome or contig, within an assembly, which the path relates to. For a haplotype path derived from a VCF, this would be the VCF contig name that the haplotype is on, like
chr1
. For a haplotype path derived from an assembly, this would be the assembly contig name, likeJAHALY010000007.1
. For a reference path, this is the name of the contig within the reference assembly being expressed. For a generic path, this is the name of the thing that the generic path represents, such as a gene name or user-provided string. - A phase block. For haplotype paths, this is used when a contig is not phased through end to end. In that case, there willb e multiple haplotype paths on the contig with different phase block values, with the paths cut apart where phasing is unknown. For reference and generic paths, this is unset; for those paths, you should instead use subrange when multiple pieces of some longer path are present.
- A subrange, which has a start and an optional end coordinate. Positions are 0-based, start-inclusive, and end-exclusive. When this field is used, the path in the graph is part of some larger path that is not entirely in the graph. Multiple paths in the graph can have the same values for all the other metadata fields, as long as their subranges do not overlap. This field is only used for reference and generic paths; it is always unset for haplotype paths. For haplotype paths, the phase block field does much the same thing and should be used instead.