Skip to content

General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation #94

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Sep 15, 2021

Conversation

tskir
Copy link
Contributor

@tskir tskir commented Sep 8, 2021

No description provided.

Copy link
Contributor

@DSuveges DSuveges left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All looks good and reasonable, I have only a minor comment/question.

@ireneisdoomed
Copy link
Contributor

I have just submitted a PR to your branch that uses the ontology.py common util. I have been able to successfully run the scripts using your class with a very similar number of generated evidence strings, although I have some notes:

  • As I mentioned in the previous comment, I explicitly caught the AttributeError exceptions to address the cases where diseaseFromSourceId is not there.
  • In the current implementation to avoid running into the Zooma API error we are retrying the query a maximum of 5 times. This causes the disease mapping to be extremely time consuming, even more than with the old OnToma. It took almost 4 hours to generate Gene2Phenotype's 4000 evidence strings. This is also causing trouble for PanelApp's parser.
  • There is a bug in the current class where whenever mapping is not successful, the returned mapping is diseaseFromSourceMappedId == 'nan'

Orphanet and Clingen runtimes were fantastic, you were right about pandarallel's performance.
I have the feeling that these mappings are not throwing errors because the labels are very clean (Orphanet and MONDO labels, respectively). By the time the query is more difficult, e.g. with PanelApp labels, the process is everlasting.

@tskir
Copy link
Contributor Author

tskir commented Sep 9, 2021

@ireneisdoomed Great, thank you for all of the comments & for testing the module! I'll take a look at your PR right away.

Regarding the performance issue, I presume this is only because records with missing diseaseFromSourceId are being needlessly retried. Once this is fixed (either way), I imagine the problem should disappear.

@tskir tskir force-pushed the tskir-1703-ontoma-implementation branch from 101c532 to b103ca8 Compare September 9, 2021 14:49
@tskir tskir changed the title General OnToma module for PySpark parsers + PhenoDigm implementation General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation Sep 9, 2021
Copy link
Contributor

@ireneisdoomed ireneisdoomed left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you confirm the bug to fix diseaseFromSourceMappedId == 'nan' cases is not yet implemented?

On the performance issue, I don't think it's -just- a matter of null diseaseFromSourceId. You mentioned yesterday there might be problems with strings containing semicolons.

I tested the class using this df:

+-------------------------------------------------------------------------------------------------+-------------------+
|diseaseFromSource                                                                                |diseaseFromSourceId|
+-------------------------------------------------------------------------------------------------+-------------------+
|MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE|614894             |
+-------------------------------------------------------------------------------------------------+-------------------+

Here's how the disease mapping step behaves: the query is retried 5 times returning nan. It seems that it doesn't even try with the ID.

INFO     - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO     - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO     - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO     - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO     - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
ERROR:root:OnToma lookup failed for 'MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE' / '614894'

I tested the semicolon hypothesis by removing them from the string with the same behaviour and result.

When you were developing OnToma v1, I remember you were using PanelApp's labels for benchmarking, which are also very dirty. Did you experience such performance issues then? I want to know if it's a problem with OnToma or with the implementation of the tool.

@ireneisdoomed
Copy link
Contributor

I think I know what the problem is, at least what is happening on my side.

def _ontoma_udf(row, ontoma_instance):
     disease_name, disease_id = row['diseaseFromSource'], row['diseaseFromSourceId']
     for attempt in range(1, ONTOMA_MAX_ATTEMPTS + 1):
         # Try to map first by disease name (because that branch of OnToma is more stable), then by disease ID.
         try:
             mappings = []
             if disease_name:
                 mappings = ontoma_instance.find_term(query=disease_name, code=False)
             if disease_id and not mappings:
                 mappings = ontoma_instance.find_term(query=disease_id, code=True)
             return [m.id_ot_schema for m in mappings]
         except:
             # If this is not the last attempt, wait until the next one
             if attempt != ONTOMA_MAX_ATTEMPTS:
                 time.sleep(10 + 30 * random.random())
     logging.error(f'OnToma lookup failed for {disease_name!r} / {disease_id!r}')
     return []

Every time a label is queried without a valid result (mappings = []) we are falling in the second if block, querying using the ID. This query fails causing it to fall in the exception block, which is the one causing the delay.
This is the traceback of the exception:

HTTPError: 500 Server Error: Internal Server Error for url: https://www.ebi.ac.uk/spot/oxo/api/search

There are two problems here:

  1. Related with Ontoma. The query is not failing because of the label, but because of the ID. When using the steps for mapping codes (eg. 6148940), this will always fail because no ontology prefix is given. On the contrary, otmap.find_term('ORDO:455', code=True) works great. This scenario should be fixed in OnToma.
  2. Related with the ontology class. As the previous error will always happen, there's no point in retrying this query. This is what is making the process extremely inefficient. We should stop querying IDs without a prefix.

@tskir
Copy link
Contributor Author

tskir commented Sep 10, 2021

@ireneisdoomed Thank you for the additional comments!

Regarding performance issues, there are several things to unpack:

  1. The entire semicolon thing is only a hypothesis; I don't yet know what exactly causes ZOOMA lookup to fail in some cases. It is true that retrying those queries will have some performance hit, however, it's important to note that (1) those cases are very rare and (2) we can't easily distinguish them from other failure modes such as transient ZOOMA server errors. I will investigate this issue separately and report it to SPOT. In the meantime, retrying those queries with some delay seems like a sensible approach. However, to make the performance hit of even those rare cases less severe, I have reduced the number of attempts and intervals between them in the latest version of this PR.
  2. Indeed, if lookup by label fails, then lookup by ID is not performed at all. I agree this is not ideal, and this is changed in the latest version of this PR, with lookups by label and ID being retried separately. The long term solution will be to implement proper retry policy directly within OnToma.
  3. Identifier queries without an ontology namespace, such as 6148940, are not and should not be supported by OnToma, because this does not represent a complete description of an ontology term. If it is known that the identifiers in the data definitely come from OMIM, this should be pre-processed before feeding into OnToma by adding OMIM: prefix. However, this is indeed a case which should fail fast and not attempt to retry anything. This will be eventually solved properly by the OnToma internal retry policy, but for now I have added an additional check to not even attempt to map identifiers without a namespace.

I've pushed all related updates just now: 2707c71.

As for the nan issue, I'm still investigating. It strangely happens only for PhenoDigm, but not for PanelApp.

@tskir
Copy link
Contributor Author

tskir commented Sep 10, 2021

@ireneisdoomed The nan thing turned out to be an open bug in Pandas: pandas-dev/pandas#25353

For now, fixed it using a workaround: d36380d

All pending comments from this PR are now addressed, resubmitting for review. PhenoDigm evidence generation is underway, should take about an hour.

@tskir tskir requested a review from ireneisdoomed September 10, 2021 07:27
mappings = []
if disease_name:
mappings = _simple_retry(ontoma_instance.find_term, query=disease_name, code=False)
if not mappings and disease_id and ':' in disease_id:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This way of looking for an ontology prefix won't work for all sources. For example, Orphanet's are represented with underscores (Orphanet_2301).

I suggest translating all underscores to a colon for now and when the bugs are fixed at the OnToma level, manually account for all the different ontology prefixes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, thank you for pointing it out. Once we fix OnToma properly, this problem will disappear, because OnToma already performs identifier normalisation internally

Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>
@tskir tskir requested a review from ireneisdoomed September 15, 2021 07:16
@ireneisdoomed ireneisdoomed merged commit fc48086 into master Sep 15, 2021
@ireneisdoomed ireneisdoomed deleted the tskir-1703-ontoma-implementation branch September 15, 2021 07:40
ireneisdoomed added a commit that referenced this pull request Feb 11, 2025
General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants