Endpoint for multi-entity association query #365

oneilsh · 2023-10-03T20:46:10Z

Now that we have a multi-entity association query on the backend, we need to expose it via the api-v3.monarchinitiative.org API :D Ref #270

glass-ships · 2023-10-04T21:38:33Z

See #369

### Related issues - Closes #365 ### Summary - adds CLI command for multi-entity associations, for example: ```bash $ monarch multi-entity-associations -e MONDO:0012933 -e MONDO:0005439 -e MANGO:0023456 -c biolink:Gene -c biolink:Disease ``` - adds API endpoint for same thing, for example: ``` <MONARCH_API_URL>/v3/api/association/multi?entity=MONDO%3A0012933&entity=MONDO%3A0005439&entity=MANDO%3A0001138&counterpart_category=biolink%3AGene&counterpart_category=biolink%3ADisease&limit=20&offset=0 ``` - adds solr integration test for multi-entity associations implementation

oneilsh · 2023-10-11T17:22:16Z

Heya @glass-ships, @kevinschaper, would it be possible to make the API results look more like the example in #270? Here's how the result currently looks (just the beginning, requesting a limit of 5 and offset of 0 for a few entities and categories):

[
    {
      "limit": 5,
      "offset": 0,
      "total": 1,
      "id": "MONDO:0017309",
      "name": "neonatal Marfan syndrome",
      "associated_categories": [
        {
          "limit": 20,
          "offset": 0,
          "total": 1,
          "counterpart_category": "biolink:Gene",
          "items": [
            {
              "id": "uuid:c30c59d2-5d8c-11ee-9b27-2b20ed86a9d9",
              "subject": "HGNC:3603",
              "original_subject": "NCBIGene:2200",
              "subject_namespace": "HGNC",
              "subject_category": "biolink:Gene",
              "subject_closure": [],
              "subject_label": "FBN1",
              "subject_closure_label": [],
              "subject_taxon": "NCBITaxon:9606",
              "subject_taxon_label": "Homo sapiens",
              "predicate": "biolink:gene_associated_with_condition",
              "object": "MONDO:0017309",
              "original_object": "Orphanet:284979",

A few notes:

I'm not sure if we need to return the requested limit and offset to the user. But maybe doing so is standard in REST though, like if defaults are used? In which case I would do so at the very top level only (there seems to be a bug as well, since its reported at multiple levels, and as 5 as requested and 20)
We might want to rename the offset and limit params to make it clear that those are per entity/category group
Maybe rename 'items' to associations?
I don't think we should report subject and object info in the items, which requires logic to determine if the info I'm looking for is in the subject or the object (which depends on the direction of the association, and repeats a lot of info). Rather, we should report just the counterpart info; we could call it 'counterpart', 'counterpart_label', etc. to make it clear. We could add info about the direction maybe, for example if the predicate is 'biolink:gene_associated_with_condition; then the gene is the subject, so we could report something like 'counterpart_role': 'subject'.
All the other association-specific info (onset, stage, sex etc) can stay at the lowest association level
If we want to keep the metadata for the searched entity, rather than having it as subject/object info in the lowest level, it could be at the level of the entity, so 'entity_category', 'entity_label', 'entity_closure' etc.
Let's not include totals that can be computed by summing over lower-level totals to avoid confusion

So, I'm thinking a search on association/multi?entity=MONDO%3A0017309&entity=HP%3A0000973&counterpart_category=biolink%3AGene&counterpart_category=biolink%3APhenotypicFeature&limit=5&offset=0 might look something like:

{
    "offset_per_associated_category": 0,
    "limit_per_associated_category": 5,
    "entities": [
        {
          "entity": "MONDO:0017309",
          "entity_name": "neonatal Marfan syndrome",
          "original_entity": "Orphanet:284979",
          "entity_namespace": "MONDO",
          "entity_category": "biolink:Disease",
          "entity_closure": [
            "MONDO:0000001",
            "MONDO:0018230",
            ...
          ],
          "entity_label": "neonatal Marfan syndrome",
          "entity_closure_label": [
            "entity",
            "continuant",
            ...
          ],
          "entity_taxon": null,
          "entity_taxon_label": null,

          "associated_categories": [
            {
              "total": 1,
              "counterpart_category": "biolink:Gene",
              "associations": [
                {
                  "id": "uuid:c30c59d2-5d8c-11ee-9b27-2b20ed86a9d9",

                  "counterpart": "HGNC:3603",
                  "original_counterpart": "NCBIGene:2200",
                  "counterpart_namespace": "HGNC",
                  "counterpart_category": "biolink:Gene",
                  "counterpart_closure": [],
                  "counterpart_label": "FBN1",
                  "counterpart_closure_label": [],
                  "counterpart_taxon": "NCBITaxon:9606",
                  "counterpart_taxon_label": "Homo sapiens",

                  "counterpart_role": "subject",
                  "predicate": "biolink:gene_associated_with_condition",

                  "primary_knowledge_source": "infores:orphanet",
                  "aggregator_knowledge_source": [
                    "infores:monarchinitiative"
                  ],
                  "category": "biolink:CorrelatedGeneToDiseaseAssociation",
                  "negated": null,
                  "provided_by": "hpoa_gene_to_disease_edges",
                  "provided_by_link": {
                    "id": "hpoa_gene_to_disease",
                    "url": "https://monarch-initiative.github.io/monarch-ingest/Sources/hpoa/#gene_to_disease"
                  },
                  "publications": [],
                  "qualifiers": [],
                  "frequency_qualifier": null,
                  "has_evidence": [],
                  "onset_qualifier": null,
                  "sex_qualifier": null,
                  ... # other association-specific data
                  "stage_qualifier_closure": [],
                  "stage_qualifier_closure_label": []
                }
              ]
            },
            {
              "total": 43,
              "counterpart_category": "biolink:PhenotypicFeature",
              "associations": [
                ... (as above, showing only first 5)
              ]
            }
          ]
        },
        {
          "entity": "HP:0000973",
          "entity_name": "Cutis laxa (HPO)",
          ... (as above, including entity_namespace, entity_category, etc)
          
          "associated_categories": [
            {
              "total": 75,
              "counterpart_category": "biolink:Gene",
              "associations": [
                ... (as above, showing only 5)
              ]
            },
            {
              "total": 3,
              "counterpart_category": "biolink:PhenotypicFeature",
              "associations": [
                ... (as above, showing only 5)
              ]
            }
          ]
        }
    ]
}

I could possibly see some other minor UX improvements in the future, e.g. closure as a list of dict instead of entity_closure and entity_closure_label as two separate lists, but those would be for another day; here I'm just looking for the data model to match the intent of directionless-association search :)

glass-ships · 2023-10-11T18:40:59Z

I'm not sure if we need to return the requested limit and offset to the user. But maybe doing so is standard in REST though, like if defaults are used? In which case I would do so at the very top level only (there seems to be a bug as well, since its reported at multiple levels, and as 5 as requested and 20)

It being reported at multiple levels is a side effect of the LinkML data model we use - both the MultiEntityAssociationResults and CategoryGroupedAssociationResults extend the Results class, which contains required limit, offset, and total field.

We might want to rename the offset and limit params to make it clear that those are per entity/category group

Same thing here.

To address both these concerns, we could switch to creating entirely unique data classes for each of these that would deviate from the established Results model to which all other monarch-py responses adhere. For example, we'd need a secondary Entities class, and a secondary Associations class, each of which contains similar but distinctly different slots from their primary counterparts (since LinkML doesn't really allow the "renaming" of slots depending on what class they're a part of ((@kevinschaper feel free to correct me if I'm wrong on that last point))).
But we definitely could.

Maybe rename 'items' to associations?

This should be doable without the above mentioned changes.

I don't think we should report subject and object info in the items, which requires logic to determine if the info I'm looking for is in the subject or the object (which depends on the direction of the association, and repeats a lot of info). Rather, we should report just the counterpart info; we could call it 'counterpart', 'counterpart_label', etc. to make it clear. We could add info about the direction maybe, for example if the predicate is 'biolink:gene_associated_with_condition; then the gene is the subject, so we could report something like 'counterpart_role': 'subject'.

This is a tricky part [for me, anyway], as I don't know the best way to generalize figuring out the "direction". Unless we have a complete set of all predicates in the Monarch KG and their predefined directions.
Or maybe it's as simple as "if entity is subject, counter_part role is object, and vice versa"?

All the other association-specific info (onset, stage, sex etc) can stay at the lowest association level

This is another tricky part, as Associations have a lot of slots:

See Here

classes: Association: slots: - id - category - subject - original_subject - subject_namespace - subject_category - subject_closure - subject_label - subject_closure_label - subject_taxon - subject_taxon_label - predicate - object - original_object - object_namespace - object_category - object_closure - object_label - object_closure_label - object_taxon - object_taxon_label - primary_knowledge_source - aggregator_knowledge_source - negated - pathway - provided_by - provided_by_link - publications - qualifiers - has_evidence - evidence_count - frequency_qualifier - onset_qualifier - sex_qualifier - stage_qualifier - frequency_qualifier_label - frequency_qualifier_namespace - frequency_qualifier_category - frequency_qualifier_closure - frequency_qualifier_closure_label - onset_qualifier_label - onset_qualifier_namespace - onset_qualifier_category - onset_qualifier_closure - onset_qualifier_closure_label - sex_qualifier_label - sex_qualifier_namespace - sex_qualifier_category - sex_qualifier_closure - sex_qualifier_closure_label - stage_qualifier_label - stage_qualifier_namespace - stage_qualifier_category - stage_qualifier_closure - stage_qualifier_closure_label

So I'm not immediately sure what would be copy-paste and what would be changed, or what a good name for this secondary Association class would be. I'll mull that over a bit.

If we want to keep the metadata for the searched entity, rather than having it as subject/object info in the lowest level, it could be at the level of the entity, so 'entity_category', 'entity_label', 'entity_closure' etc.

Let's not include totals that can be computed by summing over lower-level totals to avoid confusion

These last two points may just get handled as a consequence of creating these new dataclasses to use, but the logic will definitely take some chewing on to get right.

oneilsh · 2023-10-11T19:43:48Z

Thanks for thinking it over :) I see what you are getting at - I'm asking for something that isn't a great fit for the existing data models. Creating new classes might be needed, but let me think a bit as well to see if there's a more data-modely way to organize what I'm trying for.

One easy answer:

"Or maybe it's as simple as "if entity is subject, counter_part role is object, and vice versa"?

Yes, that's exactly what I was thinking :) But then, with a different data model this might be straightforward.

oneilsh · 2023-10-11T19:52:30Z

RE slots in classes - can slots be optional for some classes and required for others? I've been thinking of how to do a "light" return for common cases vs full-data return for everything-and-kitchen sink. Considering a very straightforward use where I want to do a keyword search to get a list of Entities back, if I set full_info to True in the query I get everything, but if I set it to False I just get back ID and label. But for another type of query maybe the description slot is included no matter what.

Edit: I suppose this could be done with like an EntityMinimal and EntityFull that inherits from it...

glass-ships · 2023-10-11T20:00:43Z

In the meantime I just opened a PR (#385) where I've just gone ahead and started making those required classes, and am in the process of changing up the logic

oneilsh · 2024-01-02T21:40:17Z

Asked @sagehrke to put a pin in this, my understanding of the KG data model has evolved and I now better understand why this is a tricky ask. What I'm looking for may be better suited to answering via the new neo4j deployment and if what I come up with is more broadly useful perhaps it can go to the API from there.

oneilsh assigned glass-ships Oct 3, 2023

glass-ships added this to the 2023-10 Release milestone Oct 3, 2023

glass-ships mentioned this issue Oct 4, 2023

Add multi-entity associations CLI cmd and api endpoint #369

Merged

kevinschaper closed this as completed in #369 Oct 6, 2023

oneilsh reopened this Oct 11, 2023

glass-ships mentioned this issue Oct 11, 2023

Update multi-entity endpoint #385

Closed

sagehrke modified the milestones: 2023-10 Release, 2023-12 Release Oct 23, 2023

glass-ships modified the milestones: 2023-12 Release, 2024-01 Release Nov 21, 2023

sagehrke removed this from the 2024-01 Release milestone Jan 2, 2024

sagehrke unassigned glass-ships Nov 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Endpoint for multi-entity association query #365

Endpoint for multi-entity association query #365

oneilsh commented Oct 3, 2023

glass-ships commented Oct 4, 2023

oneilsh commented Oct 11, 2023

glass-ships commented Oct 11, 2023

oneilsh commented Oct 11, 2023

oneilsh commented Oct 11, 2023 •

edited

Loading

glass-ships commented Oct 11, 2023

oneilsh commented Jan 2, 2024

Endpoint for multi-entity association query #365

Endpoint for multi-entity association query #365

Comments

oneilsh commented Oct 3, 2023

glass-ships commented Oct 4, 2023

oneilsh commented Oct 11, 2023

glass-ships commented Oct 11, 2023

oneilsh commented Oct 11, 2023

oneilsh commented Oct 11, 2023 • edited Loading

glass-ships commented Oct 11, 2023

oneilsh commented Jan 2, 2024

oneilsh commented Oct 11, 2023 •

edited

Loading