Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

the result consists of the character of "RY". #24

Closed
U201412486 opened this issue May 11, 2020 · 5 comments
Closed

the result consists of the character of "RY". #24

U201412486 opened this issue May 11, 2020 · 5 comments

Comments

@U201412486
Copy link

Hi,
When I run the code below, It produced the result file named "input.min1.phy".
python vcf2phylip.py -i input.vcf
However,the file "input.min1.phy" consists of the character of "RY" which is not the part of DNA nucleotide "AGTC".
Is it normal?
input.zip
sun,

@edgardomortiz
Copy link
Owner

edgardomortiz commented May 11, 2020

Hi Sun,

The script was designed for converting VCFs into phylogenetic matrices, these matrices are usually made of several samples and collapse the alleles of a genotype into an ambiguity code. Your VCF only has one sample, and the RY correspond to IUPAC ambiguities, (R = G or A, Y = C or T). In other words, your heterozygous genotypes are now represented by an ambiguity code in the output.

As per another user request (#23) I will add an option so you can get only the ALT nucleotide instead of the heterozygote represented as an ambiguity, but I haven't had time yet to add that code. Is this a feature that you would find useful? or, what kind of output were you expecting?

Edgardo

@U201412486
Copy link
Author

Hi Edgardo,
Thank you for your reply.
I expect the output represented by the ambiguity code.However, it can not recognize the condition that the homozygous genotypes in the vcf file which is produced by merging several individual vcf files are represented by the point.
sun

@edgardomortiz
Copy link
Owner

Would you mind to elaborate? I don't quite get the "represented by the point" part. Also, is this VCF merged from several individuals? I am sure I am misunderstanding something.

@U201412486
Copy link
Author

Homozygous genotypes are represented by "0/0" in the vcf format,which means the nucleotides are as same as the reference.However homozygous genotypes are represented by "." after merging multiple VCF files into a single VCF file using vcf-merge software.The vcf file is just like this below.
NC_0009 4411093 . G A 226.50 PASS AC=8;AN=8;BQB=0.786616;DP4=4,0,316,349;DP=778;MQ0F=0;MQ=60;MQB=0.997475;MQSB=1;RPB=0.708333;SF=25,26,27,28;SGB=-0.693147;VDB=0.156993 GT:PL . . . . . . . . . . . . . . . . . . . . . . . . . 1/1:255,255,0 1/1:255,255,0 1/1:255,132,0 1/1:255,255,0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

@edgardomortiz
Copy link
Owner

The . represents missing genotypes in the VCF format. The genotypes are not modified after the merging, what you are seeing is that not all your samples have the same set of genotypes. Just in case, here is the format specification:
http://samtools.github.io/hts-specs/VCFv4.2.pdf (Page 6)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants