Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem using the provided GFF3 annotations #31

Open
mvelinder opened this issue Aug 31, 2021 · 11 comments
Open

Problem using the provided GFF3 annotations #31

mvelinder opened this issue Aug 31, 2021 · 11 comments

Comments

@mvelinder
Copy link

mvelinder commented Aug 31, 2021

I'm having trouble using the provided GFF3 annotations from https://s3-us-west-2.amazonaws.com/human-pangenomics/T2T/CHM13/assemblies/annotation/chm13.draft_v1.1.gene_annotation.v4.gff3.gz

Here's what I'm doing:
bcftools csq $VCF -f chm13.draft_v1.1.fasta -g chm13.draft_v1.1.gene_annotation.v4.gff3.gz -O z -o $VCF.csq.vcf.gz

And here's my ouput:

Parsing chm13.draft_v1.1.gene_annotation.v4.gff3.gz ...
ignored: chr1	CAT	gene	14253	21099	.	+	.	source_gene_common_name=AC114498.1;source_gene=ENSG00000235146.2;gene_biotype=lncRNA;gene_alternate_contigs=chr6:172104635-172111468;gene_id=CHM13_G0000001;gene_name=AC114498.1;transcript_modes=transMap;ID=CHM13_G0000001;Name=AC114498.1;source_transcript=N/A;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;paralogy=N/A;unfiltered_paralogy=N/A;alignment_id=N/A;frameshift=N/A;exon_anotation_support=N/A;intron_annotation_support=N/A;transcript_class=N/A;valid_start=N/A;valid_stop=N/A;proper_orf=N/A;extra_paralog=False
ignored: chr1	CAT	transcript	14253	21099	8940	+	.	source_transcript=ENST00000423796.1;source_transcript_name=AC114498.1-201;source_gene=ENSG00000235146.2;transcript_modes=transMap;gene_biotype=lncRNA;transcript_biotype=lncRNA;alignment_id=ENST00000423796.1-1;frameshift=nan;exon_annotation_support=1,1;intron_annotation_support=1;transcript_class=ortholog;valid_start=True;valid_stop=True;adj_start=nan;adj_stop=nan;proper_orf=True;level=2;transcript_support_level=5;tag=not_best_in_genome_evidence,basic;havana_gene=OTTHUMG00000002329.1;havana_transcript=OTTHUMT00000006707.1;paralogy=nan;unfiltered_paralogy=ENST00000423796.1-2;gene_alternate_contigs=chr6:172104635-172111468;source_gene_common_name=AC114498.1;transcript_id=CHM13_T0000001;gene_id=CHM13_G0000001;Parent=CHM13_G0000001;transcript_name=AC114498.1-201;ID=CHM13_T0000001;Name=AC114498.1;gene_name=CHM13_G0000001;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;extra_paralog=False
[csq.c:715 gff_id_parse] Could not parse the line, "Parent=transcript:" not present: chr1	CAT	exon	14253	14325	.	+	.	source_transcript=ENST00000423796.1;source_transcript_name=AC114498.1-201;source_gene=ENSG00000235146.2;transcript_modes=transMap;gene_biotype=lncRNA;transcript_biotype=lncRNA;alignment_id=ENST00000423796.1-1;exon_annotation_support=1,1;intron_annotation_support=1;transcript_class=ortholog;valid_start=True;valid_stop=True;adj_start=nan;adj_stop=nan;proper_orf=True;level=2;transcript_support_level=5;tag=not_best_in_genome_evidence,basic;havana_gene=OTTHUMG00000002329.1;havana_transcript=OTTHUMT00000006707.1;paralogy=nan;unfiltered_paralogy=ENST00000423796.1-2;gene_alternate_contigs=chr6:172104635-172111468;source_gene_common_name=AC114498.1;transcript_id=CHM13_T0000001;gene_id=CHM13_G0000001;Parent=CHM13_T0000001;transcript_name=AC114498.1-201;ID=exon:CHM13_T0000001:0;Name=AC114498.1;rna_support=N/A;reference_support=True;gene_name=CHM13_G0000001;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;extra_paralog=False

Any guidance would be much appreciated! Thanks!

@diekhans
Copy link

diekhans commented Sep 3, 2021

This appears to be a bug in bcftools. The record it is complaining about has Parent=CHM13_T0000001 and there is a transcript record ID=CHM13_T0000001

@brentp
Copy link

brentp commented Sep 6, 2021

I have adjusted the gff3 with the following horrible code (to include "transcript:", "gene:", prefixes for Parent and ID:

zcat chm13.draft_v1.1.gene_annotation.v4.gff3.gz \
    | gawk 'BEGIN{IGNORECASE=0} $0 ~/^#/ || ($3 != "gene" && $3 != "transcript" ) { $0=gensub(/Parent=(.+?_T[^;]+)/, "Parent=transcript:\\1", "g", $0); print; next} $0 !~/^#/ { gsub(/ID=/, "ID="$3":"); $0=gensub(/Parent=(.+?_G[^;]+)/, "Parent=gene:\\1", "g", $0);  print }' \
    | bgzip -c > chm13.draft_v1.1.gene_annotation.v4.fix.gff3.gz

but then I still get:

Error: GFF3 assumption failed for transcript CHM13_T0000003, CDS=111940: phase!=len%3 (phase=2, len=379). Use the --force option to proceed anyway (at your own risk).

from bcftools csq. Then ensembl gffs have an ensembl_phase=\d.

Is it safe to use --force? What did you use to annotate VCFs called against this reference?
thanks.

@mvelinder
Copy link
Author

Using @brentp 's fix I get the following. Looks like it's just ignoring everything? I get no file written to disk either

bcftools csq -f chm13.draft_v1.1.fasta -g chm13.draft_v1.1.gene_annotation.v4.fix.gff3.gz $VCF -O z -o $VCF.csq.gz

Example tail of the output:

ignored: chr21	CAT	intron	3130264	3130424	.	+	.	source_transcript=ENST00000651813.1;source_transcript_name=FP236383.3-202;source_gene=ENSG00000280441.3;transcript_modes=transMap;gene_biotype=lncRNA;transcript_biotype=lncRNA;alignment_id=ENST00000651813.1-8;exon_annotation_support=1,1,1,1,1;intron_annotation_support=1,1,1,1;transcript_class=ortholog;valid_start=True;valid_stop=True;adj_start=nan;adj_stop=nan;proper_orf=True;level=2;tag=basic;havana_gene=OTTHUMG00000189419.3;havana_transcript=OTTHUMT00000504484.1;paralogy=nan;unfiltered_paralogy=ENST00000651813.1-5;gene_alternate_contigs=chr13:9846569-9851263,chr13:9891327-9896019,chr13:9936209-9940881,chr13:9985468-9990129,chr14:2095590-2099024,chr15:2501991-2530372,chr21:6313707-6318384,chr22:5713991-5718656;possible_split_gene_locations=chr21:6313707-6318384,chr22:5713991-5718656,chr13:9846569-9851263,chr13:9891327-9896019,chr13:9936209-9940881,chr13:9985468-9990129,chr14:2095590-2099024;source_gene_common_name=FP236383.3;transcript_id=CHM13_T0140977;gene_id=CHM13_G0035621;Parent=transcript:CHM13_T0140977;transcript_name=FP236383.3-202;ID=intron:CHM13_T0140977:1;Name=FP236383.3;rna_support=N/A;reference_support=True;gene_name=CHM13_G0035621;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;extra_paralog=False;nucmer_liftover=complete;nucmer_liftover_ed=11
ignored: chr21	CAT	intron	3130517	3130744	.	+	.	source_transcript=ENST00000651813.1;source_transcript_name=FP236383.3-202;source_gene=ENSG00000280441.3;transcript_modes=transMap;gene_biotype=lncRNA;transcript_biotype=lncRNA;alignment_id=ENST00000651813.1-8;exon_annotation_support=1,1,1,1,1;intron_annotation_support=1,1,1,1;transcript_class=ortholog;valid_start=True;valid_stop=True;adj_start=nan;adj_stop=nan;proper_orf=True;level=2;tag=basic;havana_gene=OTTHUMG00000189419.3;havana_transcript=OTTHUMT00000504484.1;paralogy=nan;unfiltered_paralogy=ENST00000651813.1-5;gene_alternate_contigs=chr13:9846569-9851263,chr13:9891327-9896019,chr13:9936209-9940881,chr13:9985468-9990129,chr14:2095590-2099024,chr15:2501991-2530372,chr21:6313707-6318384,chr22:5713991-5718656;possible_split_gene_locations=chr21:6313707-6318384,chr22:5713991-5718656,chr13:9846569-9851263,chr13:9891327-9896019,chr13:9936209-9940881,chr13:9985468-9990129,chr14:2095590-2099024;source_gene_common_name=FP236383.3;transcript_id=CHM13_T0140977;gene_id=CHM13_G0035621;Parent=transcript:CHM13_T0140977;transcript_name=FP236383.3-202;ID=intron:CHM13_T0140977:2;Name=FP236383.3;rna_support=N/A;reference_support=True;gene_name=CHM13_G0035621;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;extra_paralog=False;nucmer_liftover=3130750;nucmer_liftover_ed=3130523
ignored: chr22	CAT	intron	5731187	5749914	.	-	.	source_transcript=ENST00000623236.1;source_transcript_name=CR392039.4-201;source_gene=ENSG00000279501.1;transcript_modes=transMap;gene_biotype=lncRNA;transcript_biotype=lncRNA;alignment_id=ENST00000623236.1-1;exon_annotation_support=1,1;intron_annotation_support=1;transcript_class=ortholog;valid_start=True;valid_stop=True;adj_start=nan;adj_stop=nan;proper_orf=True;level=2;transcript_support_level=5;tag=basic;havana_gene=OTTHUMG00000189478.1;havana_transcript=OTTHUMT00000479772.1;paralogy=nan;unfiltered_paralogy=ENST00000623236.1-0,ENST00000623236.1-2,ENST00000623236.1-3,ENST00000623236.1-4;source_gene_common_name=CR392039.4;transcript_id=CHM13_T0143797;gene_id=CHM13_G0036414;Parent=transcript:CHM13_T0143797;transcript_name=CR392039.4-201;ID=intron:CHM13_T0143797:0;Name=CR392039.4;rna_support=N/A;reference_support=True;gene_name=CHM13_G0036414;alternative_source_transcripts=N/A;collapsed_gene_ids=N/A;collapsed_gene_names=N/A;frameshift=N/A;extra_paralog=False;nucmer_liftover=complete;nucmer_liftover_ed=3

@brentp
Copy link

brentp commented Sep 8, 2021

@mvelinder I saw that as well, but it's ignoring all introns, I think that's OK since those can be inferred.

It seems the phase error is the one that remains.

@mvelinder
Copy link
Author

@diekhans To Brent's question, if this is a bcftools problem, what is the recommended way to annotate variant calls on CHM13?

Thanks for the guidance!

@dKlee99
Copy link

dKlee99 commented Sep 17, 2021

Is there any updates on this issue?

Thank you very much!
DK

@brentp
Copy link

brentp commented Sep 23, 2021

Hi @diekhans @mschatz can you provide some guidance here?

@diekhans
Copy link

bcftools recommendation is to modify:

https://github.com/samtools/bcftools/blob/develop/misc/gff2gff.py

to support other GFF3 files. Converting the biotypes should be straightforward, as mostly the GENCODE/Ensembl ones are used. For the additional ones added by CAT, they can be mapped to Ensembl ones

@diekhans
Copy link

@diekhans To Brent's question, if this is a bcftools problem, what is the recommended way to annotate variant calls on CHM13?

I am not familiar with the current variant calling tools, so I can't make a recommendation. However, it seems that modifying gff2gff is the recommended way to use bcftools.

@mvelinder
Copy link
Author

I got this to work using VEP instead of bcftools csq
vep -gff /path/to/chm13.gff.gz
https://uswest.ensembl.org/info/docs/tools/vep/script/vep_cache.html#gff

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
@brentp @diekhans @mvelinder @dKlee99 and others