-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retired CUIs in semmedVER43_2022_R_PREDICATION.csv
#2
Comments
@andrewsu initialed the following policies toward retired CUIs and piped CUIs
Confirmed.
Confirmed.
Discussion Pending.
Negative.
New analysis required. |
Piped vs Non-Piped RowsFile
Rows with Retired CUIsThe distributions of the counts of rows containing retired CUIs among the two types of rows are listed below, where the ratios are calculated against the total number of rows (
Note that if we create new predication for each mapped CUIs, those 150,634 rows with one-to-many mapped, non-piped CUIs will expand to Impact of Splitting Policies on Piped CUIsNote that in this section, retired CUIs (or the replacement plans) are not taken into consideration. The current splitting policies were proposed here:
Following these policies, the If we change the first policy and not discard any of the numeric IDs, we will find In summary:
P.S. current https://biothings.ncats.io/semmeddb API has |
Great, let's handle these by group:
Also, regarding piping, there are 6,993,449 predications with some piping. Can you calculate/estimate the number of predications that would turn into if you created a new predication for each ID in the pipe? |
Using the same numbering as Andrew did in his post:
|
With piping: Point A@andrewsu the scope of the issue was somewhat discussed here. However, the full effect on predications wasn't clear. For example, are there cases where both the subject + object have piped IDs - and how much expansion would then happen? Point BI think there's still some vagueness: are there any combos of IDs in a piped thing where the IDs represent "equivalent" things, to the point where we don't want to expand to multiple records? For example: when there's 1 Entrez ID and 1 CUI, are those two IDs "equivalent" enough that we just want a record with 1 of the IDs (probably the Entrez one)? Maybe one way to tell "equivalent" is when it's easy to find a cross-mapping between the Entrez ID and the CUI (in MyGene for instance)? Point COn the other hand, I'm starting to be less concerned about the chance of having "duplicated information" from expanding piped IDs that are basically equivalent into multiple records (each record = 1 combo of subject ID and object ID). At least, I think BTE can kinda handle it. For example, semmeddb currently has 3 records corresponding to the exact same triple + pmid. But when BTE is queried for that triple (see query details below), the edge only has one instance of that PMID (8959933) in its response from querying only semmeddb through BTE (POST to http://localhost:3000/v1/smartapi/1d288b3a3caf75d541ffaae3aab386c8/query locally): only one instance of PMID:8959933 in this response
|
@andrewsu @colleenXu, please find my updated comments above. |
Fantastic, I think we are very close here. @erikyao, In this comment, you mention there are three classes of piped IDs:
Can you post a sampling (maybe 20 examples) of the "1 UMLS + N Entrez" group? I'd just like to understand that group a bit better... |
1 UMLS + 13 Entrez, 7 examples'C0074479|4489|4490|4493|4494|4495|4496|4498|4499|4500|4501|4543|56052|644314'
'Antigens,CD43|MT1A|MT1B|MT1E|MT1F|MT1G|MT1H|MT1JP|MT1M|MT1L|MT1X|MTNR1A|ALG1|MT1IP'
'C0682972|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'G-Protein-Coupled Receptors|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'
'C0597298|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Protein Isoforms|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'
'C0079427|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Tumor Suppressor Genes|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'
'C0017968|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Glycoproteins|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'
'C0033684|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Proteins|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19'
'C0033371|5241|5541|5555|140738|449619|619465|100616101|100616102|100616103|100775105|100862683|100862684|100862685'
'Prolactin|PGR|PR@|PRH2|TMEM37|ERVK-7|ERVK-8|ERVK-10|ERVK-9|ERVK-21|ERVK-18|ERVK-25|ERVK-24|ERVK-19' 1 UMLS + 8 Entrez, 4 examples'C0002210|250|470|6590|10850|26033|27295|55226|80150'
'alpha-Fetoproteins|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1'
'C0212691|1523|4791|4940|6490|9733|22974|27044|84164'
'lyt-10 protein|CUX1|NFKB2|OAS3|PMEL|SART3|TPX2|SND1|ASCC2'
'C0126732|250|470|6590|10850|26033|27295|55226|80150'
'I Kappa B-Alpha|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1'
'C0600251|250|470|6590|10850|26033|27295|55226|80150'
'Interleukin-1 alpha|ALPP|ATHS|SLPI|CCL27|ATRNL1|PDLIM3|NAT10|ASRGL1' 1 UMLS + 5 Entrez, 3 examples'C0085828|2353|2354|3725|3726|3727'
'Transcription Factor AP-1|FOS|FOSB|JUN|JUNB|JUND'
'C0083957|3854|3872|5126|5311|8535'
'Proprotein Convertase 2|KRT6B|KRT17|PCSK2|PKD2|CBX4'
'C0135615|3853|5122|7832|10120|57332'
'Proprotein Convertase 1|KRT6A|PCSK1|BTG2|ACTR1B|CBX8' 1 UMLS + 3 Entrezs, 7 examples'C1141639|1081|3342|93659'
'Human Chorionic Gonadotropin|CGA|HTC2|CGB5'
'C0007082|1048|1084|5670'
'Carcinoembryonic Antigen|CEACAM5|CEACAM3|PSG2'
'C0968902|2167|2971|7020'
'Transcription Factor AP-2 Alpha|FABP4|GTF3A|TFAP2A'
'C1335440|100616102|100862685|100862688'
'Polymerase Gene|ERVK-9|ERVK-19|ERVK-11'
'C1335439|100616102|100862685|100862688'
'Polymerase|ERVK-9|ERVK-19|ERVK-11'
'C0035681|100616102|100862685|100862688'
'DNA-Directed RNA Polymerase|ERVK-9|ERVK-19|ERVK-11'
'C0012892|100616102|100862685|100862688'
'DNA-Directed DNA Polymerase|ERVK-9|ERVK-19|ERVK-11' |
For "1 UMLS + N Entrez", it seems like the UMLS ID and the Entrez IDs are not equivalent. Then maybe we want to change the current splitting policy: "In cases where a UMLS CUI is followed by one or more numeric IDs (presumed to be NCBI Gene IDs) e.g., C0056207|3075, discard the numeric IDs and process as usual"? Change to not discarding the numeric IDs? Has "Point B" above been explored? I was wondering if the "1 UMLS + 1 Entrez" are equivalent. |
Perhaps a generic way of handling the case of "1 UMLS + N Entrez" (including "1 UMLS + 1 Entrez") is to keep all Entrez IDs and create multiple records unless an Entrez ID also maps to the UMLS ID according to the Node Normalizer. Thoughts? |
I think it's an interesting idea. Would we want to use MyGene, rather than Node Normalizer? For example, one can query either the entrezgene field and then look at the umls field or vice versa... Here's an example using the
POST to https://mygene.info/v3/query?fields=entrezgene,umls,symbol,name,taxid:
Response. Notice that none of the umls ids returned match C0012892 / DNA-Directed DNA Polymerase
|
I think we should use Node Normalizer (assuming we can figure out batch querying via POST). Unless there is any other discussion or dissent, @erikyao please implement this behavior that I described in this comment. |
Noting how RTX-KG2 is doing it:
|
Hi @colleenXu , I think @andrewsu suggested replacement with all the mapped new IDs. Quote:
|
Given the small expansion in triples based on Yao's updated comment, yes, I think we proceed with the plan that @erikyao quoted the comment above...
|
File
semmedVER43_2022_R_PREDICATION.csv
contains 117,589,597 rows. After removing rows withSUBJECT_NOVELTY == 0
orOBJECT_NOVELTY == 0
, 81,282,024 rows remained. Among those rows, there are 303,080 unique subject CUIs, and 262,268 unique object CUIs (piped CUIs decomposed and counted).Following MRCUI.RRF data analysis, we found that, for subject CUIs, the counts and ratios of retired CUIs are:
and for object CUIs,
It's a safe bet to consider only the deleted and bijectively mapped CUIs. Also it's worth considering only mappings with
SY
relationship.The text was updated successfully, but these errors were encountered: