-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pull in all of UMLS #119
Comments
We could make use of the semantic types and biolink mappings, but I would want to review those pretty carefully first. |
@colleenXu can you comment on any particular types of entities that give you the most trouble? |
I didn't find as many examples as I expected but here are some:
SEMMEDDB has associations with outdated identifiers, so these show up as well... EDIT: the semmeddb data files do have pipe-delimited identifiers, so basically mapping to entrez gene IDs... |
More examples that SRI Node Normalizer doesn't fetch labels for. These map to biolink PhysiologicalProcess or MolecularActivity...
|
More examples of semmeddb semantic types already mentioned:
from other semmeddb semantic types:
|
Plant is particularly useful since it's used to annotate supplements in idisk... |
There are many of the missing UMLS that are taxa (plants, birds, etc). It looks like there are good mappings in the metathesaurus to both mesh and ncbi. |
Looking at some aapp types that don't map to other things in nodenorm at the moment, it looks like there are at least some decent mappings to meshes. We should review whether we want these mappings; I seem to recall them occasionally giving some trouble... |
Here's one way in which we could proceed:
@cbizon Do you think this would work? |
Yes, I think this is a very good plan. |
Note that there are UMLS IDs that seem to map to > 1 umls semantic type...This is an example: rosiglitazone mapped to both Organic Chemical and Pharmacologic Substance |
Sorry it took me a while to get back to this! It looks like there are 1,339,426 UMLS IDs in the latest Babel run. I found 1,037,476 UMLS IDs not already present in Babel that have a single Biolink type. There were also 3,110,863 UMLS IDs without a Biolink type, which are:
I'll be trying to map those to Biolink types next. There are also some UMLS IDs that have multiple Biolink types:
I suspect that I can remove those |
This all looks good to me. I do have a few questions - For all those ones that are different organism types (eukaryote, bird, etc), would it be feasible to go ahead and include them in the taxon concordance? I don't really understand why C1971594 is a device? Also, I would say that BZl101 is a small molecule but not a drug (in the biolink sense of drug) unless we are doing molecule/drug conflation. That said, the numbers on these are small, and I'm not particularly concerned about them enough to hold this up |
This might be doable by calling umls.write_umls_ids() and sending in the identifiers we need from this missing table above. Not worth more than a day. |
|
After the fixes described above, we're now down to 36313 UMLS IDs without UMLS types and 0 UMLS IDs with multiple UMLS types. The UMLS IDs without types are:
The full list is available on Hatteras at /projects/babel/babel-outputs/2022sep6/reports/umls.txt. I'll dig into this a bit more. At a quick look, it looks like the T031 Body Substance might need to be added to anatomy (it includes things like peritoneal fluid, aqueous humor, bronchoalveolar lavage fluid), but T167 Substance isn't (it includes things like elementary particles, fossils and hashish). |
We identified many UMLS IDs that correspond to taxa but aren't currently imported in TranslatorSRI/NodeNormalization#119 (comment). This PR updates the taxon compendia to include taxa from UMLS. Since Biolink Model v3.0.2 doesn't include UMLS as an identifier prefix for OrganismTaxon, this PR adds an `extra_prefixes` argument to `write_compendium()` to add `UMLS` to that list, since `UMLS` was added to the Biolink Model OrganismTaxon class (biolink/biolink-model#1084), we don't actually use this functionality here. But I'm leaving it in since we might need it in the future.
This PR will implement the plan in TranslatorSRI/NodeNormalization#119 (comment) to (1) create a compendium of UMLS IDs, labels and types for UMLS IDs that weren't included in other compendia, and (2) produce a report that will help us find sections of the UMLS that should be incorporated into Babel in other compendia. It also fixes a minor bug in which "UMLS" was not listed as a prerequisite for "chemical", which caused "chemical" in some cases to run before the UMLS ids, labels and synonyms had been produced.
We're doing a bit better now:
|
From the KG2 graph in Matrix, here are the classes of what are not resolving:
Pretty similar set. I'm a little surprised by the pharmacologic and other substances though. I think we should work on this a bit. |
Nodenormalizer has a few roles:
One problem is that people rely on it for 3, so if it can't do 1&2 for some chunk of data, then you don't get any labels.
Could we/ should we pull in all of UMLS either as 1) correctly merging everything in there into clique/types or 2) just bringing in the unmapped parts as NamedThings so that we can at least serve labels?
The text was updated successfully, but these errors were encountered: