Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(release): generate descendant mapping for tissues and cells #100

Merged
merged 49 commits into from
Mar 15, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
Show all changes
49 commits
Select commit Hold shift + click to select a range
fbc9e71
Initial commit
Bento007 Mar 8, 2024
a6d2fcc
AUTO: update ontologies
invalid-email-address Mar 8, 2024
063474e
add tests
Bento007 Mar 9, 2024
cc1723e
Add tissue and cell descendant mappings
Bento007 Mar 11, 2024
0c3faa1
Merge remote-tracking branch 'origin/tsmith/decendent-mappings' into …
Bento007 Mar 11, 2024
ff42080
fix tests
Bento007 Mar 12, 2024
6cdcbb9
fix tests
Bento007 Mar 12, 2024
ee61205
update .gitignore
Bento007 Mar 12, 2024
78c5780
at parity
Bento007 Mar 12, 2024
b01f3f8
at parity
Bento007 Mar 12, 2024
7dbb732
Merge branch 'main' into tsmith/decendent-mappings
Bento007 Mar 12, 2024
4b9590e
fix GHA to run tests
Bento007 Mar 12, 2024
800a6be
Merge remote-tracking branch 'origin/tsmith/decendent-mappings' into …
Bento007 Mar 12, 2024
33da424
fixng tests
Bento007 Mar 12, 2024
11b303c
fix GHA
Bento007 Mar 12, 2024
2ebc67a
add GHA to generate descendant mappings
Bento007 Mar 12, 2024
0300fa2
add GHA to generate descendant mappings
Bento007 Mar 12, 2024
842b66b
add GHA to generate descendant mappings
Bento007 Mar 12, 2024
7e4224c
add GHA to generate descendant mappings
Bento007 Mar 12, 2024
4bf1f45
fix name
Bento007 Mar 13, 2024
9c9b5b4
try again
Bento007 Mar 13, 2024
59930ef
suggested changes
Bento007 Mar 13, 2024
b146258
update gha dependencies
Bento007 Mar 13, 2024
f2019b9
update gha
Bento007 Mar 13, 2024
5f2a5c4
update gha
Bento007 Mar 13, 2024
645820c
update gha
Bento007 Mar 13, 2024
0913e19
update gha
Bento007 Mar 13, 2024
faa7195
Merge branch 'main' into tsmith/decendent-mappings
Bento007 Mar 13, 2024
76dfa40
Merge branch 'main' into tsmith/decendent-mappings
Bento007 Mar 13, 2024
a33f63a
fix the schema
Bento007 Mar 13, 2024
097e246
fix ontology_generator
Bento007 Mar 13, 2024
5ea7b1d
Merge branch 'main' into tsmith/decendent-mappings
Bento007 Mar 15, 2024
2c22957
Merge branch 'main' into tsmith/decendent-mappings
Bento007 Mar 15, 2024
8a513b4
fix GHA
Bento007 Mar 15, 2024
7fe8429
add caching and compare again last
Bento007 Mar 15, 2024
ceef635
remove concurrency
Bento007 Mar 15, 2024
2117b7b
default to latest version
Bento007 Mar 15, 2024
9992a38
fix gha
Bento007 Mar 15, 2024
b5dfbf0
fix gha
Bento007 Mar 15, 2024
fb3a45f
rename to decendant_mappings_generator
Bento007 Mar 15, 2024
4ed5492
install local api version
Bento007 Mar 15, 2024
859e880
fix ontology parser
Bento007 Mar 15, 2024
f3a2ffa
fix tests and GHA
Bento007 Mar 15, 2024
4bfd5bc
Merge branch 'main' into tsmith/decendent-mappings
Bento007 Mar 15, 2024
0ea3d54
fix GHA
Bento007 Mar 15, 2024
8daa30c
Merge remote-tracking branch 'origin/tsmith/decendent-mappings' into …
Bento007 Mar 15, 2024
99d56fc
fix GHA
Bento007 Mar 15, 2024
e7eaed2
fix GHA
Bento007 Mar 15, 2024
37e8763
remove todos
Bento007 Mar 15, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion api/python/src/cellxgene_ontology_guide/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
import _version
import cellxgene_ontology_guide._version as _version

__version__ = _version.__version__
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@
import os
from typing import Any, Dict, List, Optional

from constants import DATA_ROOT, ONTOLOGY_FILENAME_SUFFIX, ONTOLOGY_INFO_FILENAME
from semantic_version import Version

from cellxgene_ontology_guide.constants import DATA_ROOT, ONTOLOGY_FILENAME_SUFFIX, ONTOLOGY_INFO_FILENAME
from cellxgene_ontology_guide.entities import Ontology


Expand Down
Binary file modified ontology-assets/CL-ontology-v2024-01-04.json.gz
Binary file not shown.
Binary file modified ontology-assets/EFO-ontology-v3.62.0.json.gz
Binary file not shown.
Binary file modified ontology-assets/HANCESTRO-ontology-3.0.json.gz
Binary file not shown.
Binary file modified ontology-assets/HsapDv-ontology-11.json.gz
Binary file not shown.
Binary file modified ontology-assets/MONDO-ontology-v2024-01-03.json.gz
Binary file not shown.
Binary file modified ontology-assets/MmusDv-ontology-9.json.gz
Binary file not shown.
Binary file modified ontology-assets/NCBITaxon-ontology-v2023-06-20.json.gz
Binary file not shown.
Binary file modified ontology-assets/PATO-ontology-v2023-05-18.json.gz
Binary file not shown.
Binary file modified ontology-assets/UBERON-ontology-v2024-01-18.json.gz
Binary file not shown.
279 changes: 279 additions & 0 deletions tools/ontology-builder/src/compute_descendent_mappings.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,279 @@
#!/usr/bin/env python
"""
# Descendant Mappings for Tissues and Cell Types

## Overview

The ontology-aware tissue and cell type filters in the Single Cell Data Portal each require artifacts generated
by this script.

#### Descendant Mappings
To facilitate in-filter, cross-panel restriction of filter values, a descendant hierarchy dictionary is required by
the Single Cell Data Portal frontend. For example, if a user selects `hematopoietic system` in the tissue filter's
`System` panel, the values in the tissue filter's `Organ` and `Tissue` panels must be restricted by `hematopoietic
system`.

This script generates a dictionary of descendants keyed by tissue or cell type ontology term ID. The dictionary
is stored as a JSON file and copied to cellxgene-ontology-guide/ontology_assets directory. A versioned github release is
created to simplify referencing in the Single Cell Data Portal.

The descendant mappings should be updated when:

1. The ontology version is updated,
2. A new tissue or cell type is added to the production corpus, or,
3. The hand-curated systems, organs, cell classes or cell subclasses are updated.
"""

import json
import os
from typing import Any, Dict, List
from urllib.request import urlopen

import env
from cellxgene_ontology_guide.ontology_parser import OntologyParser


def load_prod_datasets() -> Any:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this gets us to parity with the current system, but I'm still concerned about the fact this step means the mappings become outdated as soon as a new CL term is introduced.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to resolve, we'd either need to make the artifacts larger (how much larger?) to map all CL terms or perhaps we can set-up a mechanism to run this script periodically and update the mappings regularly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the needs of the frontend end, we can update the descendant mappings outside of schema update. We should run this at a regular cadence.

"""
Request datasets the production corpus.
"""
response = urlopen("https://api.cellxgene.cziscience.com/dp/v1/datasets/index").read().decode("utf-8")
return json.loads(response)


def extract_cell_types(datasets: List[Dict[str, Any]]) -> List[str]:
"""
List the set of cell type values for the given datasets.

:param datasets: a list of datasets from the production corpus.
:return: a list formated of cell type values
"""
cell_types = set()
for dataset in datasets:
for cell_type in dataset["cell_type"]:
cell_types.add(cell_type["ontology_term_id"].replace("_", ":", False))
return list(cell_types)


def extract_tissues(datasets: List[Dict[str, Any]]) -> List[str]:
"""
List the set of tissue values for the given datasets.

:param datasets: a list of datasets from the production corpus.
:return: a list of formated tissue values with tags for tissue type.
"""
tissues = set()
for dataset in datasets:
for tissue in dataset["tissue"]:
formatted_entity_name = tissue["ontology_term_id"].replace("_", ":", False)
tissue_type = tissue.get("tissue_type")
tissues.add(tag_tissue_type(formatted_entity_name, tissue_type))

return list(tissues)


def tag_tissue_type(entity_name: str, tissue_type: str) -> str:
"""
Append the tissue type to the given entity name if the tissue type is cell
culture or organoid, otherwise return the entity name as is.

:param entity_name: str entity name
:param tissue_type: str tissue type
:return: str entity name with tissue type appended
"""
# Tissue types
tissue_type_cell_culture = "cell culture"
tissue_type_organoid = "organoid"

# Handle error case (possible if tissue has not been migrated to 4.0.0+ schema).
if tissue_type is None:
return entity_name

if tissue_type == tissue_type_cell_culture:
# true if the given tissue type is "cell culture".
return f"{entity_name} ({tissue_type_cell_culture})"

if tissue_type == tissue_type_organoid:
# true if the given tissue type is "organoid".
return f"{entity_name} ({tissue_type_organoid})"

return entity_name


def key_organoids_by_ontology_term_id(entity_names: List[str]) -> Dict[str, str]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there an equivalent need for cell culture? if not, why do we also tag cell culture terms?

"""
Returns a dictionary of organoid ontology term IDs by stem ontology term ID.

:param entity_names: List of entity names
:return: Dict of organoid ontology term IDs by ontology term ID
"""

organoids_by_ontology_term_id = {}
for entity_name in entity_names:
if "(organoid)" in entity_name:
"""
Historically (i.e. before schema 4.0.0 and the introduction of
`tissue_type`), tissues of type "organoid" were tagged with "(organoid)"
in their labels and ontology IDs. The post-4.0.0 `tissue_type` value is
mapped to this tagged version in order to minimize downstream updates to
the filter functionality.
"""
ontology_term_id = entity_name.replace(" (organoid)", "")
organoids_by_ontology_term_id[ontology_term_id] = entity_name
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be a dictionary if the value can be derived from the key? thinking this should be a set and, where needed, we can append " (organoid)"


return organoids_by_ontology_term_id


def build_descendants_by_entity(
entity_hierarchy: List[List[str]], ontology_parser: OntologyParser
) -> Dict[str, List[str]]:
"""
Create descendant relationships between the given entity hierarchy.

:param entity_hierarchy: List of lists of entity names
:param ontology_parser: OntologyParser instance
:return: Dict of descendants by term_id
"""
all_descendants = {}
for idx, entity_set in enumerate(entity_hierarchy):
# Create the set of descendants that can be included for this entity set.
# For example, systems can include organs or tissues,
# organs can only include tissues, tissues can't have descendants.
accept_lists = entity_hierarchy[idx + 1 :]

# Tissue or cell type for example will not have any descendants.
if not accept_lists:
continue

accept_list = [i for sublist in accept_lists for i in sublist]
organoids_by_ontology_term_id = key_organoids_by_ontology_term_id(accept_list)

# List descendants of entity in this set.
for entity_name in entity_set:
descendants = set(ontology_parser.get_terms_descendants(entity_name)[entity_name])
# TODO: change get_terms_descendants return an iterator or add a single term version.

# Determine the set of descendants that be included.
descendant_accept_list = []
for descendant in descendants:
# Include all entities in the accept list.
if descendant in accept_list:
descendant_accept_list.append(descendant)

# Add organoid descendants, if any.
if descendant in organoids_by_ontology_term_id:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a comp bio question but--in doing this, we will always mark a term as having both ontology_term_id and ontology_term_id (organoid) as descendants, if the ontology_term_id is in the accept list and is an organoid. Is that intended? It sounds like that could make sense, but I'm not certain

descendant_accept_list.append(organoids_by_ontology_term_id[descendant])

# Add organoid entity, if any.
if entity_name in organoids_by_ontology_term_id:
descendant_accept_list.append(organoids_by_ontology_term_id[entity_name])

if not descendant_accept_list:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm assuming doing this achieves parity with the current set-up, but just to confirm--we don't want to include self as a descendant nor do we want to include an empty list? wouldn't this cause certain terms to be "orphaned" and not appear in the filters? or is that not how it works?

continue

# Add descendants to dictionary.
all_descendants[entity_name] = descendant_accept_list
return all_descendants


def generate_cell_descendant_mapping(ontology_parser: OntologyParser, datasets: List[Dict[str, Any]]) -> None:
"""
Extracts a descendant mapping of CL starting with a set of hand-curated cell classes and subclasses. Cell types
from the production corpus are also included in the mapping. The resulting mapping is saved to a JSON file.

:param ontology_parser: OntologyParser instance
:param datasets: a list of datasets from the production corpus.

"""
# Load curated list of cell classes and cell subclasses.
with open(os.path.join(env.ONTOLOGY_ASSETS_DIR, "cell_class_list.json"), "r") as f:
cell_classes = json.load(f)

with open(os.path.join(env.ONTOLOGY_ASSETS_DIR, "cell_subclass_list.json"), "r") as f:
cell_subclasses = json.load(f)

# extract the cell types from the datasets in the production corpus
prod_cell_types = extract_cell_types(datasets)
# establish the hierarchy of terms
heirarchy = [cell_classes, cell_subclasses, prod_cell_types]
# build the descendants mapping
descendent_mapping = build_descendants_by_entity(heirarchy, ontology_parser)
# save the mapping to a file
file_name = os.path.join(env.ONTOLOGY_ASSETS_DIR, "tissue_descendants.json")
save_json(descendent_mapping, file_name)


def generate_tissue_descendant_mapping(ontology_parser: OntologyParser, datasets: List[Dict[str, Any]]) -> None:
"""
Extracts a descendant mapping of UBERON starting with a set of hand-curated system and organ tissue. Tissues types
from the production corpus are also included in the mapping. The resulting mapping is saved to a JSON file.

:param ontology_parser: OntologyParser instance
:param datasets: a list of datasets from the production corpus.
:return:
"""
# Load curated list of systems and organ tissues.
with open(os.path.join(env.ONTOLOGY_ASSETS_DIR, "system_list.json"), "r") as f:
system_tissues = json.load(f)

with open(os.path.join(env.ONTOLOGY_ASSETS_DIR, "organ_list.json"), "r") as f:
organ_tissues = json.load(f)

# extract the tissue types from the datasets in the production corpus
prod_tissues = extract_tissues(datasets)
# establish the hierarchy of terms
heirarchy = [system_tissues, organ_tissues, prod_tissues]
# build the descendants mapping
descendent_mapping = build_descendants_by_entity(heirarchy, ontology_parser)
# save the mapping to a file
file_name = os.path.join(env.ONTOLOGY_ASSETS_DIR, "tissue_descendants.json")
save_json(descendent_mapping, file_name)


def compare_descendant_mappings(file_1: str, file_2: str) -> None:
# Testing
with open(os.path.join(env.ONTOLOGY_ASSETS_DIR, file_1), "r") as f:
mapping_1 = json.load(f)

with open(os.path.join(env.ONTOLOGY_ASSETS_DIR, file_2), "r") as f:
mapping_2 = json.load(f)

print(f"In {file_1} not in {file_2}")
print(mapping_1.keys() - mapping_2.keys())

print(f"In {file_2} not in {file_1}")
print(mapping_2.keys() - mapping_1.keys())

matching_keys = mapping_1.keys() & mapping_2.keys()
print(f"Not in {file_2}")
for key in matching_keys:
decendents_1 = set(mapping_1[key])
decendents_2 = set(mapping_2[key])
if decendents_1 != decendents_2:
print(key, decendents_2 - decendents_1)

print(f"Not in {file_1}")
for key in matching_keys:
decendents_1 = set(mapping_1[key])
decendents_2 = set(mapping_2[key])
if decendents_1 != decendents_2:
print(key, decendents_1 - decendents_2)


def save_json(data: Any, file_name: str) -> None:
"""
Save the given data to a JSON file.
:param data: Any data compatiblewith JSON
:param file_name: The name of the file to save the data to.
"""
with open(file_name, "w") as f:
json.dump(data, f, indent=2)


if __name__ == "__main__":
ONTOLOGY_PARSER = OntologyParser("v5.0.0") # TODO: this should default to the latest supported schema version
PROD_DATASETS = load_prod_datasets()
generate_cell_descendant_mapping(ONTOLOGY_PARSER, PROD_DATASETS)
compare_descendant_mappings("cell_type_descendants.json", "cell_type_descendants_cxg.json")
generate_tissue_descendant_mapping(ONTOLOGY_PARSER, PROD_DATASETS)
compare_descendant_mappings("tissue_descendants.json", "tissue_descendants_cxg.json")