Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Endpoint for multi-entity association query #365

Open
oneilsh opened this issue Oct 3, 2023 · 7 comments · Fixed by #369
Open

Endpoint for multi-entity association query #365

oneilsh opened this issue Oct 3, 2023 · 7 comments · Fixed by #369

Comments

@oneilsh
Copy link

oneilsh commented Oct 3, 2023

Now that we have a multi-entity association query on the backend, we need to expose it via the api-v3.monarchinitiative.org API :D Ref #270

@glass-ships
Copy link
Collaborator

See #369

kevinschaper pushed a commit that referenced this issue Oct 6, 2023
### Related issues

- Closes #365 

### Summary

- adds CLI command for multi-entity associations, for example:  
```bash
$ monarch multi-entity-associations -e MONDO:0012933 -e MONDO:0005439 -e MANGO:0023456 -c biolink:Gene -c biolink:Disease 
```
- adds API endpoint for same thing, for example:  
```
<MONARCH_API_URL>/v3/api/association/multi?entity=MONDO%3A0012933&entity=MONDO%3A0005439&entity=MANDO%3A0001138&counterpart_category=biolink%3AGene&counterpart_category=biolink%3ADisease&limit=20&offset=0
```
- adds solr integration test for multi-entity associations
implementation
@oneilsh
Copy link
Author

oneilsh commented Oct 11, 2023

Heya @glass-ships, @kevinschaper, would it be possible to make the API results look more like the example in #270? Here's how the result currently looks (just the beginning, requesting a limit of 5 and offset of 0 for a few entities and categories):

[
    {
      "limit": 5,
      "offset": 0,
      "total": 1,
      "id": "MONDO:0017309",
      "name": "neonatal Marfan syndrome",
      "associated_categories": [
        {
          "limit": 20,
          "offset": 0,
          "total": 1,
          "counterpart_category": "biolink:Gene",
          "items": [
            {
              "id": "uuid:c30c59d2-5d8c-11ee-9b27-2b20ed86a9d9",
              "subject": "HGNC:3603",
              "original_subject": "NCBIGene:2200",
              "subject_namespace": "HGNC",
              "subject_category": "biolink:Gene",
              "subject_closure": [],
              "subject_label": "FBN1",
              "subject_closure_label": [],
              "subject_taxon": "NCBITaxon:9606",
              "subject_taxon_label": "Homo sapiens",
              "predicate": "biolink:gene_associated_with_condition",
              "object": "MONDO:0017309",
              "original_object": "Orphanet:284979",

A few notes:

  • I'm not sure if we need to return the requested limit and offset to the user. But maybe doing so is standard in REST though, like if defaults are used? In which case I would do so at the very top level only (there seems to be a bug as well, since its reported at multiple levels, and as 5 as requested and 20)
  • We might want to rename the offset and limit params to make it clear that those are per entity/category group
  • Maybe rename 'items' to associations?
  • I don't think we should report subject and object info in the items, which requires logic to determine if the info I'm looking for is in the subject or the object (which depends on the direction of the association, and repeats a lot of info). Rather, we should report just the counterpart info; we could call it 'counterpart', 'counterpart_label', etc. to make it clear. We could add info about the direction maybe, for example if the predicate is 'biolink:gene_associated_with_condition; then the gene is the subject, so we could report something like 'counterpart_role': 'subject'.
  • All the other association-specific info (onset, stage, sex etc) can stay at the lowest association level
  • If we want to keep the metadata for the searched entity, rather than having it as subject/object info in the lowest level, it could be at the level of the entity, so 'entity_category', 'entity_label', 'entity_closure' etc.
  • Let's not include totals that can be computed by summing over lower-level totals to avoid confusion

So, I'm thinking a search on association/multi?entity=MONDO%3A0017309&entity=HP%3A0000973&counterpart_category=biolink%3AGene&counterpart_category=biolink%3APhenotypicFeature&limit=5&offset=0 might look something like:

{
    "offset_per_associated_category": 0,
    "limit_per_associated_category": 5,
    "entities": [
        {
          "entity": "MONDO:0017309",
          "entity_name": "neonatal Marfan syndrome",
          "original_entity": "Orphanet:284979",
          "entity_namespace": "MONDO",
          "entity_category": "biolink:Disease",
          "entity_closure": [
            "MONDO:0000001",
            "MONDO:0018230",
            ...
          ],
          "entity_label": "neonatal Marfan syndrome",
          "entity_closure_label": [
            "entity",
            "continuant",
            ...
          ],
          "entity_taxon": null,
          "entity_taxon_label": null,

          "associated_categories": [
            {
              "total": 1,
              "counterpart_category": "biolink:Gene",
              "associations": [
                {
                  "id": "uuid:c30c59d2-5d8c-11ee-9b27-2b20ed86a9d9",

                  "counterpart": "HGNC:3603",
                  "original_counterpart": "NCBIGene:2200",
                  "counterpart_namespace": "HGNC",
                  "counterpart_category": "biolink:Gene",
                  "counterpart_closure": [],
                  "counterpart_label": "FBN1",
                  "counterpart_closure_label": [],
                  "counterpart_taxon": "NCBITaxon:9606",
                  "counterpart_taxon_label": "Homo sapiens",

                  "counterpart_role": "subject",
                  "predicate": "biolink:gene_associated_with_condition",

                  "primary_knowledge_source": "infores:orphanet",
                  "aggregator_knowledge_source": [
                    "infores:monarchinitiative"
                  ],
                  "category": "biolink:CorrelatedGeneToDiseaseAssociation",
                  "negated": null,
                  "provided_by": "hpoa_gene_to_disease_edges",
                  "provided_by_link": {
                    "id": "hpoa_gene_to_disease",
                    "url": "https://monarch-initiative.github.io/monarch-ingest/Sources/hpoa/#gene_to_disease"
                  },
                  "publications": [],
                  "qualifiers": [],
                  "frequency_qualifier": null,
                  "has_evidence": [],
                  "onset_qualifier": null,
                  "sex_qualifier": null,
                  ... # other association-specific data
                  "stage_qualifier_closure": [],
                  "stage_qualifier_closure_label": []
                }
              ]
            },
            {
              "total": 43,
              "counterpart_category": "biolink:PhenotypicFeature",
              "associations": [
                ... (as above, showing only first 5)
              ]
            }
          ]
        },
        {
          "entity": "HP:0000973",
          "entity_name": "Cutis laxa (HPO)",
          ... (as above, including entity_namespace, entity_category, etc)
          
          "associated_categories": [
            {
              "total": 75,
              "counterpart_category": "biolink:Gene",
              "associations": [
                ... (as above, showing only 5)
              ]
            },
            {
              "total": 3,
              "counterpart_category": "biolink:PhenotypicFeature",
              "associations": [
                ... (as above, showing only 5)
              ]
            }
          ]
        }
    ]
}

I could possibly see some other minor UX improvements in the future, e.g. closure as a list of dict instead of entity_closure and entity_closure_label as two separate lists, but those would be for another day; here I'm just looking for the data model to match the intent of directionless-association search :)

@oneilsh oneilsh reopened this Oct 11, 2023
@glass-ships
Copy link
Collaborator

I'm not sure if we need to return the requested limit and offset to the user. But maybe doing so is standard in REST though, like if defaults are used? In which case I would do so at the very top level only (there seems to be a bug as well, since its reported at multiple levels, and as 5 as requested and 20)

It being reported at multiple levels is a side effect of the LinkML data model we use - both the MultiEntityAssociationResults and CategoryGroupedAssociationResults extend the Results class, which contains required limit, offset, and total field.

We might want to rename the offset and limit params to make it clear that those are per entity/category group

Same thing here.

To address both these concerns, we could switch to creating entirely unique data classes for each of these that would deviate from the established Results model to which all other monarch-py responses adhere. For example, we'd need a secondary Entities class, and a secondary Associations class, each of which contains similar but distinctly different slots from their primary counterparts (since LinkML doesn't really allow the "renaming" of slots depending on what class they're a part of ((@kevinschaper feel free to correct me if I'm wrong on that last point))).
But we definitely could.

Maybe rename 'items' to associations?

This should be doable without the above mentioned changes.

I don't think we should report subject and object info in the items, which requires logic to determine if the info I'm looking for is in the subject or the object (which depends on the direction of the association, and repeats a lot of info). Rather, we should report just the counterpart info; we could call it 'counterpart', 'counterpart_label', etc. to make it clear. We could add info about the direction maybe, for example if the predicate is 'biolink:gene_associated_with_condition; then the gene is the subject, so we could report something like 'counterpart_role': 'subject'.

This is a tricky part [for me, anyway], as I don't know the best way to generalize figuring out the "direction". Unless we have a complete set of all predicates in the Monarch KG and their predefined directions.
Or maybe it's as simple as "if entity is subject, counter_part role is object, and vice versa"?

All the other association-specific info (onset, stage, sex etc) can stay at the lowest association level

This is another tricky part, as Associations have a lot of slots:

See Here classes: Association: slots: - id - category - subject - original_subject - subject_namespace - subject_category - subject_closure - subject_label - subject_closure_label - subject_taxon - subject_taxon_label - predicate - object - original_object - object_namespace - object_category - object_closure - object_label - object_closure_label - object_taxon - object_taxon_label - primary_knowledge_source - aggregator_knowledge_source - negated - pathway - provided_by - provided_by_link - publications - qualifiers - has_evidence - evidence_count - frequency_qualifier - onset_qualifier - sex_qualifier - stage_qualifier - frequency_qualifier_label - frequency_qualifier_namespace - frequency_qualifier_category - frequency_qualifier_closure - frequency_qualifier_closure_label - onset_qualifier_label - onset_qualifier_namespace - onset_qualifier_category - onset_qualifier_closure - onset_qualifier_closure_label - sex_qualifier_label - sex_qualifier_namespace - sex_qualifier_category - sex_qualifier_closure - sex_qualifier_closure_label - stage_qualifier_label - stage_qualifier_namespace - stage_qualifier_category - stage_qualifier_closure - stage_qualifier_closure_label

So I'm not immediately sure what would be copy-paste and what would be changed, or what a good name for this secondary Association class would be. I'll mull that over a bit.

If we want to keep the metadata for the searched entity, rather than having it as subject/object info in the lowest level, it could be at the level of the entity, so 'entity_category', 'entity_label', 'entity_closure' etc.

Let's not include totals that can be computed by summing over lower-level totals to avoid confusion

These last two points may just get handled as a consequence of creating these new dataclasses to use, but the logic will definitely take some chewing on to get right.

@oneilsh
Copy link
Author

oneilsh commented Oct 11, 2023

Thanks for thinking it over :) I see what you are getting at - I'm asking for something that isn't a great fit for the existing data models. Creating new classes might be needed, but let me think a bit as well to see if there's a more data-modely way to organize what I'm trying for.

One easy answer:

"Or maybe it's as simple as "if entity is subject, counter_part role is object, and vice versa"?

Yes, that's exactly what I was thinking :) But then, with a different data model this might be straightforward.

@oneilsh
Copy link
Author

oneilsh commented Oct 11, 2023

RE slots in classes - can slots be optional for some classes and required for others? I've been thinking of how to do a "light" return for common cases vs full-data return for everything-and-kitchen sink. Considering a very straightforward use where I want to do a keyword search to get a list of Entities back, if I set full_info to True in the query I get everything, but if I set it to False I just get back ID and label. But for another type of query maybe the description slot is included no matter what.

Edit: I suppose this could be done with like an EntityMinimal and EntityFull that inherits from it...

@glass-ships
Copy link
Collaborator

In the meantime I just opened a PR (#385) where I've just gone ahead and started making those required classes, and am in the process of changing up the logic

@sagehrke sagehrke removed this from the 2024-01 Release milestone Jan 2, 2024
@oneilsh
Copy link
Author

oneilsh commented Jan 2, 2024

Asked @sagehrke to put a pin in this, my understanding of the KG data model has evolved and I now better understand why this is a tricky ask. What I'm looking for may be better suited to answering via the new neo4j deployment and if what I come up with is more broadly useful perhaps it can go to the API from there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants