Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature requests for VCF2DB compatible with GEMINI built-in tools #36

Open
Phillip-a-richmond opened this issue Nov 21, 2017 · 8 comments

Comments

@Phillip-a-richmond
Copy link

Hello,
I would like to request that the tables used by GEMINI's built-in analysis tools be added into VCF2DB.

Ideally, all tables that are default loaded with the command:

$ gemini load

that are not inherently third party annotations, would be added into the resulting database.

Examples I have run into so far:
depths
sample_genotype_counts

More complex features that are ideal for our pipeline, but we may need to resort to standard GEMINI load to use:
gene_summary
pathways and gene detailed analyses

Thanks,
Phil

brentp added a commit that referenced this issue Dec 11, 2017
@brentp
Copy link
Member

brentp commented Dec 11, 2017

I have added the sample_genotype_counts table. I am not sure what you mean by depths table. I don't intend to add the gene tables to vcf2db, but I could be convinced to change my mind given a reasonable use-case.

@Phillip-a-richmond
Copy link
Author

Thanks Brent, I'll pull this version and test the GEMINI built-in functions.

Essentially,
I am striving to produce a fully functional single variant database for identifying pathogenic variants underlying rare mendelian genetic diseases (and the Quinlan lab tools are excellent for this). The problem with only using GEMINI load is the lack of flexibility when it comes to the annotations available (e.g. TrAP, FATHMM-XF, in-house variant databases, monthly updating ClinVar vcfs). And annotating after-the-fact with GEMINI annotate is prohibitively slow for large annotation databases (especially genome-wide databases) across WGS variant datasets.

VCFAnno+VCF2DB provides flexibility with this respect, and it's fast. However, lacking some of the tables which are loaded in GEMINI by default, like the one that you fixed above, causes the GEMINI built-in functions to fail. Ideally, a workflow that goes from VCF-->DB, and then can use GEMINI to query the DB for inheritance patterns, runs of homozygosity, variants within specific gene sets, harmony between noncoding and coding variants, would be ideal.

Thanks for your help on this,
Phil

@brentp
Copy link
Member

brentp commented Jan 18, 2018

could you enumerate what is missing for you so I can prioritize?

@naumenko-sa
Copy link
Contributor

Hi Brent!

The difference between gemini load and vcf2db (loaded in bcbio 1.0.7 with vcfanno: [gemini] and by default):
variants table:
pfam_domain = domains?
aaf_gnomad_all = gnomad_af
gnomad_num_het = absent, possible to add?
ghomad_num_hom = absent, possible to add?
cadd_scaled = absent, possible to add?
vep_hgvsc = hgvsc
vep_hgvsp = hgvsp
aaf_esp_aa = af_esp_aa
aaf_esp_ea = af_esp_ea
aaf_esp_all = af_esp_all
is_conserved = absent, possible to add?

variant_impacts table:
vep_canonical = canonical
vep_ccds = ccds
vep_hgvsc = hgvsc
vep_hgvsp = hgvsp
vep_maxentscan_diff = maxentscan_diff
vep_maxentscan_alt = maxentscan_alt
vep_maxentscan_ref = maxentscan_ref
vep_spliceregion = spliceregion

Is there a way for downstream scripts to get the creator of gemini.db (gemini load or vcf2db) to apply different processing logic?

Is it possible to add gnomad_num_hemi?
https://groups.google.com/forum/#!topic/gemini-variation/knRmriYXDW4

Thanks!
Sergey

@brentp
Copy link
Member

brentp commented Jan 29, 2018

but these are things that you have control over, correct?
in most cases, vcf2db.py just pull what's present in the INFO field. You can change the vcfanno conf if you want different names. Am I missing something?

@naumenko-sa
Copy link
Contributor

Thanks Brent! yes, you are right, it is not an issue of vcfanno/vcfdb, it is a way of wrapping annotation in bcbio. SN

@Phillip-a-richmond
Copy link
Author

Pulled on January 22nd 2018.

Tested and confirmed to work:

  • gemini autosomal_dominant
  • gemini autosomal_recessive
  • gemini comp_hets
  • gemini de_novo
  • gemini db_info
  • gemini query
  • gemini x_linked_de_novo
  • gemini x_linked_dominant
  • gemini x_linked_recessive
  • gemini burden
  • gemini region
  • gemini stats
  • gemini lof_sieve
  • gemini mendel_errors

Tested and failed:

gemini roh

Details:
"Depths" table, as referenced from gemini roh
Example:

$ gemini roh T008.db
LOG: Querying and ordering variants by chromosomal position.
SQL error: (sqlite3.OperationalError) no such column: depth [SQL: u"select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants WHERE type = 'snp' AND filter is NULL AND depth >= 20 ORDER BY chrom, end"]

SQL error: (sqlite3.OperationalError) no such column: depth [SQL: u"select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants WHERE type = 'snp' AND filter is NULL AND depth >= 20 ORDER BY chrom, end"]
Traceback (most recent call last):
File "/opt/tools/gemini/bin/gemini", line 7, in
gemini_main.main()
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1244, in main
args.func(parser, args)
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1136, in homozygosity_runs_fn
run(parser, args)
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_homozygosity_runs.py", line 215, in run
get_homozygosity_runs(args)
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_homozygosity_runs.py", line 162, in get_homozygosity_runs
gq.run(query, needs_genotypes=True)
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 653, in run
self.result_proxy = res = iter(self._apply_query())
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 906, in _apply_query
res = self._execute_query()
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 883, in _execute_query
raise ValueError("The query issued (%s) has a syntax error." % self.query)
ValueError: The query issued (select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants WHERE type = 'snp' AND filter is NULL AND depth >= 20 ORDER BY chrom, end) has a syntax error.

gemini pathways

$ gemini pathways --lof -v 71 T008.db
chrom start end ref alt impact sample genotype gene transcript pathway
Traceback (most recent call last):
File "/opt/tools/gemini/bin/gemini", line 7, in
gemini_main.main()
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1244, in main
args.func(parser, args)
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 768, in pathway_fn
tool_pathways.pathways(parser, args)
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_pathways.py", line 155, in pathways
get_ind_lof_pathways(conn, metadata, args)
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_pathways.py", line 143, in get_ind_lof_pathways
_report_variant_pathways(res, args, idx_to_sample)
File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_pathways.py", line 103, in _report_variant_pathways
pathlist]))
TypeError: sequence item 5: expected string or Unicode, NoneType found

Priority for our application purposes would include fixing gemini ROH. The pathways-based analysis is a very low priority for us at this time.

Thanks,
Phil

@robinvanderlee
Copy link

Hi,

Following up on @Phillip-a-richmond last comment, I tried running gemini roh on a gemini database produced with vcf2db.

I am getting the same errors, indicating that the depth column is missing:

$ gemini roh gemini_db_produced_by_vcf2db.db
LOG: Querying and ordering variants by chromosomal position.
SQL error: (sqlite3.OperationalError) no such column: depth [SQL: u"select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants               WHERE type = 'snp'               AND   filter is NULL               AND   depth >= 20 ORDER BY chrom,  end"]
SQL error: (sqlite3.OperationalError) no such column: depth [SQL: u"select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants               WHERE type = 'snp'               AND   filter is NULL               AND   depth >= 20 ORDER BY chrom,  end"]
Traceback (most recent call last):
  File "/opt/tools/gemini/bin/gemini", line 7, in <module>
    gemini_main.main()
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1244, in main
    args.func(parser, args)
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/gemini_main.py", line 1136, in homozygosity_runs_fn
    run(parser, args)
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_homozygosity_runs.py", line 215, in run
    get_homozygosity_runs(args)
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/tool_homozygosity_runs.py", line 162, in get_homozygosity_runs
    gq.run(query, needs_genotypes=True)
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 653, in run
    self.result_proxy = res = iter(self._apply_query())
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 906, in _apply_query
    res = self._execute_query()
  File "/opt/tools/gemini/thirdparty/anaconda/lib/python2.7/site-packages/gemini/GeminiQuery.py", line 883, in _execute_query
    raise ValueError("The query issued (%s) has a syntax error." % self.query)
ValueError: The query issued (select chrom, start, end,gts,gt_types,gt_phases,gt_depths,gt_ref_depths,gt_alt_depths,gt_quals,gt_alt_freqs FROM variants               WHERE type = 'snp'               AND   filter is NULL               AND   depth >= 20 ORDER BY chrom,  end) has a syntax error.

I think perhaps some of the previous confusion stemmed from called depth a table whereas it seems to be a column.
Would it be possible to include the depth column to the list of annotations that vcf2db builds into the gemini db?

Thanks for all the hard work on these tools!
Robin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants