Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GO associations not shown #398

Closed
kevinschaper opened this issue Oct 12, 2023 · 12 comments
Closed

GO associations not shown #398

kevinschaper opened this issue Oct 12, 2023 · 12 comments

Comments

@kevinschaper
Copy link
Member

GO BP and MF are shown in figure 3 as ingest sources, but I did not find these ( I only found component under "anatomy" are there plans to include these 2 aspects?

@kevinschaper
Copy link
Member Author

Bringing forward behavior from V2, we're configured to show only Gene -> MF associations by limiting to the biolink category biolink:MacromolecularMachineToMolecularActivityAssociation here

It turns out that we're missing that category completely:
select distinct category, count(*) from denormalized_edges where object_namespace = 'GO' group by 1

category count(*)
biolink:Association 14202
biolink:GeneToExpressionSiteAssociation 40264
biolink:MacromolecularMachineToBiologicalProcessAssociation 63894
biolink:MacromolecularMachineToCellularComponentAssociation 739303

@kevinschaper
Copy link
Member Author

They do show up in dangling edges.

select distinct category, count(*) from dangling_edges where object like 'GO:%' group by 1

category count(*)
biolink:Association 103134
biolink:GeneToExpressionSiteAssociation 60
biolink:MacromolecularMachineToBiologicalProcessAssociation 1082703
biolink:MacromolecularMachineToCellularComponentAssociation 94066
biolink:MacromolecularMachineToMolecularActivityAssociation 935832

@kevinschaper
Copy link
Member Author

Looking at which side of the association is actually missing, it does seem to be the GO side in most cases

select 'subject' as field, substr(subject, 1, instr(subject,':') -1) as prefix, count(*) from dangling_edges where category = 'biolink:MacromolecularMachineToMolecularActivityAssociation' and subject not in (select id from nodes) group by 1,2 union select 'object' as field, substr(object, 1, instr(object,':') -1) as prefix, count(*) from dangling_edges where category = 'biolink:MacromolecularMachineToMolecularActivityAssociation' and object not in (select id from nodes) group by 1,2

field prefix count(*)
object GO 935827
subject AspGD 6740
subject FB 29
subject MGI 10
subject NCBIGene 68931
subject PR 626
subject PomBase 291
subject RGD 273
subject RefSeq 1
subject SGD 514
subject UniProtKB 11272
subject WB 1
subject ZFIN 51

@kevinschaper
Copy link
Member Author

@caufieldjh Is it possible that phenio is missing all GO MF terms?

@caufieldjh
Copy link
Member

It's certainly not what I'd expect. Will take a look.

@kevinschaper
Copy link
Member Author

kevinschaper commented Oct 12, 2023

The counts I have in the kg do look a bit low:

select category, count(*) from nodes where id like 'GO:%' group by 1

category count(*)
biolink:BiologicalProcess 2989
biolink:Cell 1
biolink:CellularComponent 2359
biolink:MacromolecularComplex 2126
biolink:MolecularActivity 1280
biolink:NamedThing 3646
biolink:Pathway 677

I'll check further back to see if I have a filtering mishap.

@kevinschaper
Copy link
Member Author

Maybe a filtering mishap:

cut -f 1,2 merged-kg_nodes.tsv | grep ^GO | cut -f 2 | sort | uniq -c | sort -rn
38221 biolink:Occurrent
3648 biolink:NamedThing
2862 biolink:BiologicalProcess
2356 biolink:CellularComponent
2128 biolink:MacromolecularComplex
1274 biolink:MolecularActivity
 716 biolink:Pathway
   2 biolink:related_to
   1 biolink:superclass_of
   1 biolink:subclass_of
   1 biolink:located_in
   1 biolink:causes
   1 biolink:affects
   1 biolink:Cell

@kevinschaper
Copy link
Member Author

I think biolink:Occurrent is a mixin, and that's why the extreme filtering is happening.

@kevinschaper
Copy link
Member Author

The Jenkins output that I'm producing as I filter is maybe not the most helpful, but it does a whole dump of invalid categories.

�[32m2023-09-27_22:37:13�[0m | �[31m�[1mERROR   �[0m | �[36mmonarch_ingest.cli_utils�[0m | �[31m�[1mInvalid node categories: {'biolink:actively_involved_in', 'biolink:has_input', 'biolink:has_sequence_location', 'biolink:has_decreased_amount', 'biolink:capable_of', 'biolink:is_input_of', 'biolink:has_variant_part', 'biolink:biological_role_mixin', 'biolink:gene_product_of', 'biolink:SequenceFeature', 'biolink:GeneProductMixin', 'biolink:author', 'biolink:superclass_of', 'biolink:has_molecular_consequence', 'biolink:expressed_in', 'biolink:transcribed_from', 'biolink:has_output', 'biolink:reaction_direction', 'biolink:is_missense_variant_of', 'biolink:has_unit', 'biolink:version_of', 'biolink:has_part', 'biolink:distribution_download_url', 'biolink:causes', 'biolink:acts_upstream_of_or_within_positive_effect', 'biolink:manifestation_of', 'biolink:regulates', 'biolink:has_plasma_membrane_part', 'biolink:derives_from', 'biolink:has_active_ingredient', 'biolink:disease_has_basis_in', 'biolink:develops_from', 'biolink:PairwiseGeneToGeneInteraction', 'biolink:ChemicalGeneInteractionAssociation', 'biolink:xenologous_to', 'biolink:Occurrent', 'biolink:produced_by', 'biolink:has_topic', 'biolink:related_condition', 'biolink:disrupts', 'biolink:transcribed_to', 'biolink:is_output_of', 'biolink:has_stressor', 'biolink:disease_has_location', 'biolink:ExposureEvent', 'biolink:treated_by', 'biolink:has_attribute', 'biolink:is_active_ingredient_of', 'biolink:location_of', 'biolink:has_phenotype', 'biolink:enables', 'biolink:acts_upstream_of_negative_effect', 'biolink:exact_synonym', 'biolink:is_synonymous_variant_of', 'biolink:directly_physically_interacts_with', 'biolink:quantifier_qualifier', 'biolink:condition_associated_with_gene', 'biolink:has_increased_amount', 'biolink:ChemicalSubstance', 'biolink:is_splice_site_variant_of', 'biolink:rights', 'biolink:narrow_match', 'biolink:license', 'biolink:produces', 'biolink:correlated_with', 'biolink:part_of', 'biolink:temporally_related_to', 'biolink:chi_squared_statistic', 'biolink:genetically_interacts_with', 'biolink:affects', 'biolink:paralogous_to', 'biolink:publisher', 'biolink:associated_with', 'biolink:opposite_of', 'biolink:in_taxon', 'biolink:ChemicalToPathwayAssociation', 'biolink:mechanism_of_action', 'biolink:GeneToDiseaseOrPhenotypicFeatureAssociation', 'biolink:orthologous_to', 'biolink:participates_in', 'biolink:occurs_in', 'biolink:has_evidence', 'biolink:format', 'biolink:drug_regulatory_status_world_wide', 'biolink:treats', 'biolink:has_route', 'biolink:caused_by', 'biolink:mentions', 'biolink:gene_associated_with_condition', 'biolink:exacerbates', 'biolink:contraindicated_for', 'biolink:has_quantitative_value', 'biolink:retrieved_on', 'biolink:contributes_to', 'biolink:contributor', 'biolink:acts_upstream_of_positive_effect', 'biolink:interacting_molecules_category', 'biolink:biomarker_for', 'biolink:prevents', 'biolink:enabled_by', 'biolink:coexists_with', 'biolink:located_in', 'biolink:ameliorates', 'biolink:GenomicSequenceLocalization', 'biolink:narrow_synonym', 'biolink:has_completed', 'biolink:has_not_completed', 'biolink:chemically_similar_to', 'biolink:preceded_by', 'biolink:similar_to', 'biolink:overlaps', 'biolink:has_participant', 'biolink:subclass_of', 'biolink:chemical_role_mixin', 'biolink:related_synonym', 'biolink:lacks_part', 'biolink:precedes', 'biolink:broad_synonym', 'biolink:has_member', 'biolink:acts_upstream_of_or_within_negative_effect', 'biolink:model_of', 'biolink:summary', 'biolink:is_metabolite_of', 'biolink:has_gene_product', 'biolink:Association', 'biolink:homologous_to', 'biolink:related_to', 'biolink:created_with', 'biolink:colocalizes_with', 'biolink:active_in', 'biolink:GenomicEntity', 'biolink:expresses', 'biolink:same_as', 'biolink:is_frameshift_variant_of', 'biolink:broad_match', 'biolink:creation_date', 'biolink:derives_into', 'biolink:physically_interacts_with', 'biolink:close_match', 'biolink:has_receptor', 'biolink:has_side_effect', 'biolink:synonym', 'biolink:p_value', 'biolink:PathologicalEntityMixin', 'biolink:mesh_terms', 'biolink:in_linkage_disequilibrium_with', 'biolink:has_biomarker', 'biolink:ChemicalToDiseaseOrPhenotypicFeatureAssociation'}�[0m

I had totally forgotten, but I'm also exporting the filtered out nodes: https://data.monarchinitiative.org/monarch-kg-dev/2023-09-28/qc/excluded_phenio_nodes.tsv

@kevinschaper
Copy link
Member Author

GO Gene->MF associations are showing again now

@monicacecilia
Copy link
Contributor

And what about BP?

@monicacecilia monicacecilia reopened this Oct 21, 2023
@kevinschaper
Copy link
Member Author

All 3 GO associations are shown now as a result of #425

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants