General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation #94

tskir · 2021-09-08T09:00:48Z

No description provided.

common/ontology.py

DSuveges

All looks good and reasonable, I have only a minor comment/question.

common/ontology.py

ireneisdoomed · 2021-09-09T13:59:24Z

I have just submitted a PR to your branch that uses the ontology.py common util. I have been able to successfully run the scripts using your class with a very similar number of generated evidence strings, although I have some notes:

As I mentioned in the previous comment, I explicitly caught the AttributeError exceptions to address the cases where diseaseFromSourceId is not there.
In the current implementation to avoid running into the Zooma API error we are retrying the query a maximum of 5 times. This causes the disease mapping to be extremely time consuming, even more than with the old OnToma. It took almost 4 hours to generate Gene2Phenotype's 4000 evidence strings. This is also causing trouble for PanelApp's parser.
There is a bug in the current class where whenever mapping is not successful, the returned mapping is diseaseFromSourceMappedId == 'nan'

Orphanet and Clingen runtimes were fantastic, you were right about pandarallel's performance.
I have the feeling that these mappings are not throwing errors because the labels are very clean (Orphanet and MONDO labels, respectively). By the time the query is more difficult, e.g. with PanelApp labels, the process is everlasting.

tskir · 2021-09-09T14:47:12Z

@ireneisdoomed Great, thank you for all of the comments & for testing the module! I'll take a look at your PR right away.

Regarding the performance issue, I presume this is only because records with missing diseaseFromSourceId are being needlessly retried. Once this is fixed (either way), I imagine the problem should disappear.

ireneisdoomed

Can you confirm the bug to fix diseaseFromSourceMappedId == 'nan' cases is not yet implemented?

On the performance issue, I don't think it's -just- a matter of null diseaseFromSourceId. You mentioned yesterday there might be problems with strings containing semicolons.

I tested the class using this df:

+-------------------------------------------------------------------------------------------------+-------------------+
|diseaseFromSource                                                                                |diseaseFromSourceId|
+-------------------------------------------------------------------------------------------------+-------------------+
|MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE|614894             |
+-------------------------------------------------------------------------------------------------+-------------------+

Here's how the disease mapping step behaves: the query is retried 5 times returning nan. It seems that it doesn't even try with the ID.

INFO     - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO     - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO     - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO     - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO     - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
ERROR:root:OnToma lookup failed for 'MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE' / '614894'

I tested the semicolon hypothesis by removing them from the string with the same behaviour and result.

When you were developing OnToma v1, I remember you were using PanelApp's labels for benchmarking, which are also very dirty. Did you experience such performance issues then? I want to know if it's a problem with OnToma or with the implementation of the tool.

ireneisdoomed · 2021-09-09T18:20:33Z

I think I know what the problem is, at least what is happening on my side.

def _ontoma_udf(row, ontoma_instance):
     disease_name, disease_id = row['diseaseFromSource'], row['diseaseFromSourceId']
     for attempt in range(1, ONTOMA_MAX_ATTEMPTS + 1):
         # Try to map first by disease name (because that branch of OnToma is more stable), then by disease ID.
         try:
             mappings = []
             if disease_name:
                 mappings = ontoma_instance.find_term(query=disease_name, code=False)
             if disease_id and not mappings:
                 mappings = ontoma_instance.find_term(query=disease_id, code=True)
             return [m.id_ot_schema for m in mappings]
         except:
             # If this is not the last attempt, wait until the next one
             if attempt != ONTOMA_MAX_ATTEMPTS:
                 time.sleep(10 + 30 * random.random())
     logging.error(f'OnToma lookup failed for {disease_name!r} / {disease_id!r}')
     return []

Every time a label is queried without a valid result (mappings = []) we are falling in the second if block, querying using the ID. This query fails causing it to fall in the exception block, which is the one causing the delay.
This is the traceback of the exception:

HTTPError: 500 Server Error: Internal Server Error for url: https://www.ebi.ac.uk/spot/oxo/api/search

There are two problems here:

Related with Ontoma. The query is not failing because of the label, but because of the ID. When using the steps for mapping codes (eg. 6148940), this will always fail because no ontology prefix is given. On the contrary, otmap.find_term('ORDO:455', code=True) works great. This scenario should be fixed in OnToma.
Related with the ontology class. As the previous error will always happen, there's no point in retrying this query. This is what is making the process extremely inefficient. We should stop querying IDs without a prefix.

tskir · 2021-09-10T05:50:21Z

@ireneisdoomed Thank you for the additional comments!

Regarding performance issues, there are several things to unpack:

The entire semicolon thing is only a hypothesis; I don't yet know what exactly causes ZOOMA lookup to fail in some cases. It is true that retrying those queries will have some performance hit, however, it's important to note that (1) those cases are very rare and (2) we can't easily distinguish them from other failure modes such as transient ZOOMA server errors. I will investigate this issue separately and report it to SPOT. In the meantime, retrying those queries with some delay seems like a sensible approach. However, to make the performance hit of even those rare cases less severe, I have reduced the number of attempts and intervals between them in the latest version of this PR.
Indeed, if lookup by label fails, then lookup by ID is not performed at all. I agree this is not ideal, and this is changed in the latest version of this PR, with lookups by label and ID being retried separately. The long term solution will be to implement proper retry policy directly within OnToma.
Identifier queries without an ontology namespace, such as 6148940, are not and should not be supported by OnToma, because this does not represent a complete description of an ontology term. If it is known that the identifiers in the data definitely come from OMIM, this should be pre-processed before feeding into OnToma by adding OMIM: prefix. However, this is indeed a case which should fail fast and not attempt to retry anything. This will be eventually solved properly by the OnToma internal retry policy, but for now I have added an additional check to not even attempt to map identifiers without a namespace.

I've pushed all related updates just now: 2707c71.

As for the nan issue, I'm still investigating. It strangely happens only for PhenoDigm, but not for PanelApp.

tskir · 2021-09-10T07:27:30Z

@ireneisdoomed The nan thing turned out to be an open bug in Pandas: pandas-dev/pandas#25353

For now, fixed it using a workaround: d36380d

All pending comments from this PR are now addressed, resubmitting for review. PhenoDigm evidence generation is underway, should take about an hour.

ireneisdoomed · 2021-09-10T16:00:23Z

common/ontology.py

+    mappings = []
+    if disease_name:
+        mappings = _simple_retry(ontoma_instance.find_term, query=disease_name, code=False)
+    if not mappings and disease_id and ':' in disease_id:


This way of looking for an ontology prefix won't work for all sources. For example, Orphanet's are represented with underscores (Orphanet_2301).

I suggest translating all underscores to a colon for now and when the bugs are fixed at the OnToma level, manually account for all the different ontology prefixes.

You're right, thank you for pointing it out. Once we fix OnToma properly, this problem will disappear, because OnToma already performs identifier normalisation internally

common/ontology.py

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation

tskir requested review from DSuveges and ireneisdoomed September 8, 2021 09:00

ireneisdoomed reviewed Sep 8, 2021

View reviewed changes

common/ontology.py Show resolved Hide resolved

DSuveges requested changes Sep 8, 2021

View reviewed changes

common/ontology.py Show resolved Hide resolved

tskir added 7 commits September 9, 2021 17:49

Update execution environments

e23f1a5

Implement OnToma mapping for PhenoDigm

9042668

Added the module for OnToma mapping

6394b03

Simplify OnToma instance construction

7d5859d

Fix PySpark dataframe construction issue

69ce776

Implement EFO mapping for PanelApp

7599e82

Address review comments

b103ca8

tskir force-pushed the tskir-1703-ontoma-implementation branch from 101c532 to b103ca8 Compare September 9, 2021 14:49

tskir requested review from ireneisdoomed and DSuveges September 9, 2021 14:51

tskir changed the title ~~General OnToma module for PySpark parsers + PhenoDigm implementation~~ General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation Sep 9, 2021

ireneisdoomed requested changes Sep 9, 2021

View reviewed changes

Improve performance of the temporary retry policy

2707c71

Workaround for Pandas null conversion bug

d36380d

tskir requested a review from ireneisdoomed September 10, 2021 07:27

ireneisdoomed reviewed Sep 10, 2021

View reviewed changes

ireneisdoomed requested changes Sep 13, 2021

View reviewed changes

common/ontology.py Outdated Show resolved Hide resolved

Update common/ontology.py

e8b66bc

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>

tskir requested a review from ireneisdoomed September 15, 2021 07:16

ireneisdoomed approved these changes Sep 15, 2021

View reviewed changes

ireneisdoomed merged commit fc48086 into master Sep 15, 2021

ireneisdoomed deleted the tskir-1703-ontoma-implementation branch September 15, 2021 07:40

ireneisdoomed added a commit that referenced this pull request Feb 11, 2025

Merge pull request #94 from opentargets/tskir-1703-ontoma-implementation

1292b7a

General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation #94

General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation #94

tskir commented Sep 8, 2021

DSuveges left a comment •

edited

Loading

ireneisdoomed commented Sep 9, 2021

tskir commented Sep 9, 2021

ireneisdoomed left a comment

ireneisdoomed commented Sep 9, 2021

tskir commented Sep 10, 2021

tskir commented Sep 10, 2021

ireneisdoomed Sep 10, 2021

tskir Sep 15, 2021

General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation #94

General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation #94

Conversation

tskir commented Sep 8, 2021

DSuveges left a comment • edited Loading

Choose a reason for hiding this comment

ireneisdoomed commented Sep 9, 2021

tskir commented Sep 9, 2021

ireneisdoomed left a comment

Choose a reason for hiding this comment

ireneisdoomed commented Sep 9, 2021

tskir commented Sep 10, 2021

tskir commented Sep 10, 2021

ireneisdoomed Sep 10, 2021

Choose a reason for hiding this comment

tskir Sep 15, 2021

Choose a reason for hiding this comment

DSuveges left a comment •

edited

Loading