Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

norm -m- does not split FORMAT field with missing value correctly #1818

Closed
ikarus97 opened this issue Nov 15, 2022 · 1 comment
Closed

norm -m- does not split FORMAT field with missing value correctly #1818

ikarus97 opened this issue Nov 15, 2022 · 1 comment

Comments

@ikarus97
Copy link

Hi,
I have a VCF file with the following line:

##fileformat=VCFv4.2
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Samp1   Samp2
chr1    939398  .       GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA       G,GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCACCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA     5075.01 .       .       GT:AD:AF:DP     0/0:73,0,0:.:35 0/2:36,2,50:0.023,0.568:88

For the first sample Samp1, the AF field in FORMAT column is missing(.).

After bcftools norm -m -any -f [reference], I've got:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##contig=<ID=chr1>
##bcftools_normVersion=1.16+htslib-1.16
##bcftools_normCommand=norm -m -any -f /media/NFS/ref/b38/Homo_sapiens_assembly38.fasta -O z -o demo1.norm.vcf.gz demo1.vcf.gz; Date=Tue Nov 15 15:58:25 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Samp1   Samp2
chr1    939398  .       GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA       G       5075.01 .       .       GT:AD:AF:DP     0/0:73,0:.:35   0/0:36,2:0.023:88
chr1    939398  .       G       GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA       5075.01 .       .       GT:AD:AF:DP     0/0:73,0::35    0/1:36,50:0.568:88

Samp2 output is as I expected.
But for Samp1, I expected that both lines should have missing value (.) for AF
(its value was missing before split, thus it makes sense to have missing values for both lines after split).
The --force option didn't make any difference, here.

However, when I ran the same command with only Samp1, I got results as I expected:

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths (counting only informative reads out of the total reads) for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Allele fractions for alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##contig=<ID=chr1>
##bcftools_normVersion=1.16+htslib-1.16
##bcftools_normCommand=norm -m -any -f /media/NFS/ref/b38/Homo_sapiens_assembly38.fasta -O z -o demo2.norm.vcf.gz demo2.vcf.gz; Date=Tue Nov 15 15:58:32 2022
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  Samp1
chr1    939398  .       GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA       G       5075.01 .       .       GT:AD:AF:DP     0/0:73,0:.:35
chr1    939398  .       G       GCCTCCCCAGCCACGGTGAGGACCCACCCTGGCATGATCCCCCTCATCA       5075.01 .       .       GT:AD:AF:DP     0/0:73,0:.:35

I've experimented with various inputs, and concluded that the issue happens only when the field-to-be-split is missing for some samples. I had no problem when all samples had values or when all samples were missing.

Thank you,
In-Hee Lee

@pd3 pd3 closed this as completed in b5cbcd5 Nov 16, 2022
@pd3
Copy link
Member

pd3 commented Nov 16, 2022

This should be fixed now in b5cbcd5, please try it out. Thank you for the bug report and the test case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants