You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When exporting individual hvcfs from Tiledb with create-maf-vcf I've noticed that across hvcf files, I can get the same haplotype ID hashes returned. For example, if I have a Reference line, and a few members within my PHG (lines A, B, C, D), I can run into instances in individual hvcf files where I get the same hapid returned.
For example, I can get the MD5 hash: 6cdd65b8cc357b8e38f8de43908bb783 returned as the hap ID for each of those genomes:
grep "6cdd65b8cc357b8e38f8de43908bb783" *.h.vcf
A.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: A",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=A,Regions=chr26:151475374-151480693,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
A.h.vcf:chr26 151778105 . A <6cdd65b8cc357b8e38f8de43908bb783> . . END=151783421 GT:AD:DP 1:0,2,0:2
B.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: B",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=B,Regions=chr26:152817069-152822388,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
B.h.vcf:chr26 151778105 . A <6cdd65b8cc357b8e38f8de43908bb783> . . END=151783421 GT:AD:DP 1:0,2,0:2
C.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: C",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=C,Regions=chr26:151741696-151747015,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
C.h.vcf:chr26 151778105 . A <6cdd65b8cc357b8e38f8de43908bb783> . . END=151783421 GT:AD:DP 1:0,2,0:2
D.h.vcf:##ALT=<ID=6cdd65b8cc357b8e38f8de43908bb783,Description="haplotype data for line: D",Source="/phg_v2/vcf_dbs/assemblies.agc",SampleName=D,Regions=chr26:151838955-151844274,Checksum=6cdd65b8cc357b8e38f8de43908bb783,RefChecksum=b04415d387d6f3505c9ba736c3816a85,RefRange=chr26:151778105-151783421>
D.h.vcf:chr26 151778105 . A <6cdd65b8cc357b8e38f8de43908bb783> . . END=151783421 GT:AD:DP 1:0,2,0:2
My assumption was that the hapids were supposed to be unique between the different hvcfs. I think this may also be leading to some downstream issues with rPHG2 as well, specifically with readHapIds() and readHapIdMetaData() because when the hap IDs are overwritten in the hvcf, you can't tell which source genome a hapID came from, and thus, you can't get it's corresponding coordinates to do other stuff with. It also gives off the impression that when the hapIDs are the same, the haplotypes are shared/collapsed, which isn't the case. I would expect shared hapids for the RefChecksums, but not the genome specific hapids.
Expected behavior
Unique MD5 hashes for each hvcf.
PHG version
2.4.45.200
The text was updated successfully, but these errors were encountered:
HapIds are by design not unique across different samples. The PHGv2 will hash the sequence formed from the assembly and will then use that hash as the id. The haplotypes are in fact shared as their sequence does match based on the anchor wave alignment. You can view the Metadata as a single representative Haplotype for that ID. There is the possibility due to MD5 hashing that you could have different sequences showing the same hash ID but this is quite rare.
@btmonier can likely provide examples of how to get the information you are looking for out of rPHG2.
@zrm22 - this leads into an issue with rPHG2 since the HaplotypeGraph object that is made from the hVCF files will only report one set of assembly coordinate regions for samples that share the same haplotype ID. I think this could be resolved if we create a data object that will create a collection of reported regions. For example:
val hapIdAsmRegions =mapOf(
"6cdd65b8cc357b8e38f8de43908bb783" to mapOf(
"A" to List<Pair<Position, Position>>, // e.g., `Regions` key for sample "A""B" to List<Pair<Position, Position>>,
"C" to List<Pair<Position, Position>>,
"D" to List<Pair<Position, Position>>
),
//...
)
Doing something like this, I could evaluate that in R when I construct the HaplotypeGraph object and make something simpler like a data.frame object.
Description
When exporting individual hvcfs from Tiledb with
create-maf-vcf
I've noticed that across hvcf files, I can get the same haplotype ID hashes returned. For example, if I have a Reference line, and a few members within my PHG (lines A, B, C, D), I can run into instances in individual hvcf files where I get the same hapid returned.For example, I can get the MD5 hash: 6cdd65b8cc357b8e38f8de43908bb783 returned as the hap ID for each of those genomes:
My assumption was that the hapids were supposed to be unique between the different hvcfs. I think this may also be leading to some downstream issues with rPHG2 as well, specifically with
readHapIds()
andreadHapIdMetaData()
because when the hap IDs are overwritten in the hvcf, you can't tell which source genome a hapID came from, and thus, you can't get it's corresponding coordinates to do other stuff with. It also gives off the impression that when the hapIDs are the same, the haplotypes are shared/collapsed, which isn't the case. I would expect shared hapids for the RefChecksums, but not the genome specific hapids.Expected behavior
Unique MD5 hashes for each hvcf.
PHG version
2.4.45.200
The text was updated successfully, but these errors were encountered: