[BUG]: hvcfs have repeated haplotype IDs #293

jesse-hill · 2025-03-12T14:59:20Z

Description

When exporting individual hvcfs from Tiledb with create-maf-vcf I've noticed that across hvcf files, I can get the same haplotype ID hashes returned. For example, if I have a Reference line, and a few members within my PHG (lines A, B, C, D), I can run into instances in individual hvcf files where I get the same hapid returned.

For example, I can get the MD5 hash: 6cdd65b8cc357b8e38f8de43908bb783 returned as the hap ID for each of those genomes:

grep "6cdd65b8cc357b8e38f8de43908bb783" *.h.vcf

A.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: A",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=A,Regions=chr26:151475374-151480693,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
A.h.vcf:chr26      151778105       .       A       <6cdd65b8cc357b8e38f8de43908bb783>      .       .       END=151783421   GT:AD:DP        1:0,2,0:2
B.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: B",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=B,Regions=chr26:152817069-152822388,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
B.h.vcf:chr26      151778105       .       A       <6cdd65b8cc357b8e38f8de43908bb783>      .       .       END=151783421   GT:AD:DP        1:0,2,0:2
C.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: C",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=C,Regions=chr26:151741696-151747015,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
C.h.vcf:chr26       151778105       .       A       <6cdd65b8cc357b8e38f8de43908bb783>      .       .       END=151783421   GT:AD:DP        1:0,2,0:2
D.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: D",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=D,Regions=chr26:151838955-151844274,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
D.h.vcf:chr26       151778105       .       A       <6cdd65b8cc357b8e38f8de43908bb783>      .       .       END=151783421   GT:AD:DP        1:0,2,0:2

My assumption was that the hapids were supposed to be unique between the different hvcfs. I think this may also be leading to some downstream issues with rPHG2 as well, specifically with readHapIds() and readHapIdMetaData() because when the hap IDs are overwritten in the hvcf, you can't tell which source genome a hapID came from, and thus, you can't get it's corresponding coordinates to do other stuff with. It also gives off the impression that when the hapIDs are the same, the haplotypes are shared/collapsed, which isn't the case. I would expect shared hapids for the RefChecksums, but not the genome specific hapids.

Expected behavior

Unique MD5 hashes for each hvcf.

PHG version

2.4.45.200

The text was updated successfully, but these errors were encountered:

zrm22 · 2025-03-12T15:13:45Z

HapIds are by design not unique across different samples. The PHGv2 will hash the sequence formed from the assembly and will then use that hash as the id. The haplotypes are in fact shared as their sequence does match based on the anchor wave alignment. You can view the Metadata as a single representative Haplotype for that ID. There is the possibility due to MD5 hashing that you could have different sequences showing the same hash ID but this is quite rare.

@btmonier can likely provide examples of how to get the information you are looking for out of rPHG2.

btmonier · 2025-03-12T22:56:29Z

@zrm22 - this leads into an issue with rPHG2 since the HaplotypeGraph object that is made from the hVCF files will only report one set of assembly coordinate regions for samples that share the same haplotype ID. I think this could be resolved if we create a data object that will create a collection of reported regions. For example:

val hapIdAsmRegions = mapOf(
    "6cdd65b8cc357b8e38f8de43908bb783" to  mapOf(
        "A" to List<Pair<Position, Position>>, // e.g., `Regions` key for sample "A"
        "B" to List<Pair<Position, Position>>,
        "C" to List<Pair<Position, Position>>,
        "D" to List<Pair<Position, Position>>
    ),
    //...
)

Doing something like this, I could evaluate that in R when I construct the HaplotypeGraph object and make something simpler like a data.frame object.

jesse-hill added the bug Something isn't working label Mar 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: hvcfs have repeated haplotype IDs #293

[BUG]: hvcfs have repeated haplotype IDs #293

jesse-hill commented Mar 12, 2025 •

edited by btmonier

Loading

zrm22 commented Mar 12, 2025

btmonier commented Mar 12, 2025 •

edited

Loading

[BUG]: hvcfs have repeated haplotype IDs #293

[BUG]: hvcfs have repeated haplotype IDs #293

Comments

jesse-hill commented Mar 12, 2025 • edited by btmonier Loading

Description

Expected behavior

PHG version

zrm22 commented Mar 12, 2025

btmonier commented Mar 12, 2025 • edited Loading

jesse-hill commented Mar 12, 2025 •

edited by btmonier

Loading

btmonier commented Mar 12, 2025 •

edited

Loading