-
Notifications
You must be signed in to change notification settings - Fork 3
General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation #94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All looks good and reasonable, I have only a minor comment/question.
I have just submitted a PR to your branch that uses the
Orphanet and Clingen runtimes were fantastic, you were right about |
@ireneisdoomed Great, thank you for all of the comments & for testing the module! I'll take a look at your PR right away. Regarding the performance issue, I presume this is only because records with missing |
101c532
to
b103ca8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you confirm the bug to fix diseaseFromSourceMappedId == 'nan'
cases is not yet implemented?
On the performance issue, I don't think it's -just- a matter of null diseaseFromSourceId
. You mentioned yesterday there might be problems with strings containing semicolons.
I tested the class using this df:
+-------------------------------------------------------------------------------------------------+-------------------+
|diseaseFromSource |diseaseFromSourceId|
+-------------------------------------------------------------------------------------------------+-------------------+
|MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE|614894 |
+-------------------------------------------------------------------------------------------------+-------------------+
Here's how the disease mapping step behaves: the query is retried 5 times returning nan
. It seems that it doesn't even try with the ID.
INFO - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO - ontoma.interface - Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
INFO:ontoma.interface:Processed: MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE → []
ERROR:root:OnToma lookup failed for 'MONOCYTE AND DENDRITIC CELL DEFICIENCY, AUTOSOMAL RECESSIVE;;IRF8 DEFICIENCY, AUTOSOMAL RECESSIVE' / '614894'
I tested the semicolon hypothesis by removing them from the string with the same behaviour and result.
When you were developing OnToma v1, I remember you were using PanelApp's labels for benchmarking, which are also very dirty. Did you experience such performance issues then? I want to know if it's a problem with OnToma or with the implementation of the tool.
I think I know what the problem is, at least what is happening on my side.
Every time a label is queried without a valid result (
There are two problems here:
|
@ireneisdoomed Thank you for the additional comments! Regarding performance issues, there are several things to unpack:
I've pushed all related updates just now: 2707c71. As for the |
@ireneisdoomed The For now, fixed it using a workaround: d36380d All pending comments from this PR are now addressed, resubmitting for review. PhenoDigm evidence generation is underway, should take about an hour. |
mappings = [] | ||
if disease_name: | ||
mappings = _simple_retry(ontoma_instance.find_term, query=disease_name, code=False) | ||
if not mappings and disease_id and ':' in disease_id: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This way of looking for an ontology prefix won't work for all sources. For example, Orphanet's are represented with underscores (Orphanet_2301
).
I suggest translating all underscores to a colon for now and when the bugs are fixed at the OnToma level, manually account for all the different ontology prefixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, thank you for pointing it out. Once we fix OnToma properly, this problem will disappear, because OnToma already performs identifier normalisation internally
Co-authored-by: Irene López <45119610+ireneisdoomed@users.noreply.github.com>
General OnToma module for PySpark parsers + PhenoDigm & PanelApp implementation
No description provided.