Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: hvcfs have repeated haplotype IDs #293

Open
jesse-hill opened this issue Mar 12, 2025 · 2 comments
Open

[BUG]: hvcfs have repeated haplotype IDs #293

jesse-hill opened this issue Mar 12, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@jesse-hill
Copy link

jesse-hill commented Mar 12, 2025

Description

When exporting individual hvcfs from Tiledb with create-maf-vcf I've noticed that across hvcf files, I can get the same haplotype ID hashes returned. For example, if I have a Reference line, and a few members within my PHG (lines A, B, C, D), I can run into instances in individual hvcf files where I get the same hapid returned.

For example, I can get the MD5 hash: 6cdd65b8cc357b8e38f8de43908bb783 returned as the hap ID for each of those genomes:

grep "6cdd65b8cc357b8e38f8de43908bb783" *.h.vcf

A.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: A",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=A,Regions=chr26:151475374-151480693,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
A.h.vcf:chr26      151778105       .       A       <6cdd65b8cc357b8e38f8de43908bb783>      .       .       END=151783421   GT:AD:DP        1:0,2,0:2
B.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: B",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=B,Regions=chr26:152817069-152822388,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
B.h.vcf:chr26      151778105       .       A       <6cdd65b8cc357b8e38f8de43908bb783>      .       .       END=151783421   GT:AD:DP        1:0,2,0:2
C.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: C",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=C,Regions=chr26:151741696-151747015,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
C.h.vcf:chr26       151778105       .       A       <6cdd65b8cc357b8e38f8de43908bb783>      .       .       END=151783421   GT:AD:DP        1:0,2,0:2
D.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: D",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=D,Regions=chr26:151838955-151844274,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
D.h.vcf:chr26       151778105       .       A       <6cdd65b8cc357b8e38f8de43908bb783>      .       .       END=151783421   GT:AD:DP        1:0,2,0:2

My assumption was that the hapids were supposed to be unique between the different hvcfs. I think this may also be leading to some downstream issues with rPHG2 as well, specifically with readHapIds() and readHapIdMetaData() because when the hap IDs are overwritten in the hvcf, you can't tell which source genome a hapID came from, and thus, you can't get it's corresponding coordinates to do other stuff with. It also gives off the impression that when the hapIDs are the same, the haplotypes are shared/collapsed, which isn't the case. I would expect shared hapids for the RefChecksums, but not the genome specific hapids.

Expected behavior

Unique MD5 hashes for each hvcf.

PHG version

2.4.45.200

@jesse-hill jesse-hill added the bug Something isn't working label Mar 12, 2025
@zrm22
Copy link
Collaborator

zrm22 commented Mar 12, 2025

HapIds are by design not unique across different samples. The PHGv2 will hash the sequence formed from the assembly and will then use that hash as the id. The haplotypes are in fact shared as their sequence does match based on the anchor wave alignment. You can view the Metadata as a single representative Haplotype for that ID. There is the possibility due to MD5 hashing that you could have different sequences showing the same hash ID but this is quite rare.

@btmonier can likely provide examples of how to get the information you are looking for out of rPHG2.

@btmonier
Copy link
Member

btmonier commented Mar 12, 2025

@zrm22 - this leads into an issue with rPHG2 since the HaplotypeGraph object that is made from the hVCF files will only report one set of assembly coordinate regions for samples that share the same haplotype ID. I think this could be resolved if we create a data object that will create a collection of reported regions. For example:

val hapIdAsmRegions = mapOf(
    "6cdd65b8cc357b8e38f8de43908bb783" to  mapOf(
        "A" to List<Pair<Position, Position>>, // e.g., `Regions` key for sample "A"
        "B" to List<Pair<Position, Position>>,
        "C" to List<Pair<Position, Position>>,
        "D" to List<Pair<Position, Position>>
    ),
    //...
)

Doing something like this, I could evaluate that in R when I construct the HaplotypeGraph object and make something simpler like a data.frame object.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants