Handling letter-case in IDs in a consistent, easy-to-maintain way #735

colleenXu · 2023-10-04T07:29:37Z

Background

During the JQ work, we encountered an issue with an external API (CTD) and the letter-case of the bioentity IDs in its responses (ref: Jackson's comment and my reply):

KEGG.PATHWAY IDs should have lower-case letters (ex: hsa05323) but our api-response-transform module code was transforming the response's ID strings to all-caps.
other namespace's IDs should have upper-case letters, but all bioentity IDs in CTD responses had lower-case letters. That's why the code for the all-caps transformation was useful in these other cases...

Issue

We realized that this was a larger issue (ref: my braindump here, a "current situation" post here):

what should be the identifier-case formats for each of the namespaces we use?
handling these identifier-case formats in a consistent way, in one place that's easy to update and maintain. Right now, it may be hidden in some api-response-transform modules.
- we should be able to simplify: it seems like most ID namespaces with letters in their identifiers use all-caps. With a few exceptions that have all-lowercase. But we'd need to check first (see first main point)
- we'd want to confirm: what code is related to identifier-case format? aka what would we refactor? where is this info used?
  - definitely in api-response-transform module, basically every transformer
  - it seems like the rest of BTE is working fine....so maybe other modules aren't affected?
  - maybe this info could be useful for building sub-queries to APIs (templating in SmartAPI yaml request info)?

The text was updated successfully, but these errors were encountered:

colleenXu · 2023-10-04T07:40:37Z

The first step is figuring out the identifier-case formats for each of the namespaces we use. I'm likely the lead on this.

It'll take a bit of time, but not be too difficult, to review the namespaces we use and find those with lower-case letters in their IDs (aka the exceptions). (It's unnecessary to make a list of all namespaces we use, which ones of those have IDs w/ letters, which ones of those have all-caps...)

But there's an issue that I think needs untangling: I'm not clear on Translator's standards for identifier-case format. It's not clear from a quick look at the biolink-model repo...

I get the sense that we're supposed to follow the bioregistry for ID-value format...
- but it can be confusing to find the ID-value format by namespace-name/prefix because the biolink-model often uses namespace-name/prefixes that are different from bioregistry's
- an example of this confusion is Confusion on PMC vs PMCID biolink/biolink-model#1366 (do the ID-values start with PMC or not?)
we are also following the identifier-case format in the NodeNorm responses...but that doesn't help us when we're using ID namespaces that aren't covered by NodeNorm...
we're probably also following some historic standards (from back when we used our own id-resolver setup / based on the core BioThings APIs / based on how the resource provides their IDs)
I dunno how stable these standards are. That's a reason to make whatever we code easy to update / maintain...

gaurav · 2023-10-08T04:41:55Z

FWIW, NodeNorm doesn't expect identifiers to be purely numerical. Here is the current distribution of CURIE prefixes with non-numerical identifiers in NodeNorm:

Count	Prefix	Example
4	KEGG.REACTION	KEGG.REACTION:R06368
6	KEGG.DISEASE	KEGG.DISEASE:H00484
22	TCDB	TCDB:3.A.1.105.1
175	PANTHER.PATHWAY	PANTHER.PATHWAY:P06210
222	EC	EC:2.1.1.n11
271	OMIM	OMIM:PS278300
1244	ComplexPortal	ComplexPortal:CPX-1047
2438	ICD10	ICD10:H44.44
7153	SGD	SGD:S000028545
13892	dictyBase	dictyBase:DDB_G0289229
14544	DRUGBANK	DRUGBANK:DB16588
19277	KEGG.COMPOUND	KEGG.COMPOUND:C18469
26128	PANTHER.FAMILY	PANTHER.FAMILY:PTHR24082:SF38
30248	SMPDB	SMPDB:SMP0052819
30293	FB	FB:FBgn0266114
38059	ZFIN	ZFIN:ZDB-LINCRNAG-131127-634
48781	WormBase	WormBase:WBGene00172747
50770	NCIT	NCIT:C123904
109255	REACT	REACT:R-HSA-5617833.2
119457	UNII	UNII:GC91Z6YS52
171624	PR	PR:Q8TF17-5
217920	HMDB	HMDB:HMDB0014933
341463	MESH	MESH:C000609086
2399666	CHEMBL.COMPOUND	CHEMBL.COMPOUND:CHEMBL4247838
3195515	UMLS	UMLS:C2334787
32043961	ENSEMBL	ENSEMBL:ENSCURG00000012216
110286527	INCHIKEY	INCHIKEY:WCVKZUDHHMZXCB-UHFFFAOYSA-N
248913563	UniProtKB	UniProtKB:A0A8H5S4H9

If the list of all those identifiers will be useful to you (it's 2.9G compressed), please let me know and I can send it over!

colleenXu · 2023-10-09T18:58:47Z

@gaurav

This table of namespaces + example IDs is great. We don't need a list of all IDs for a namespace; the examples here are fine.

colleenXu · 2023-10-11T05:59:27Z

Note to self: there are examples where there's a mix of lettercase in the ID itself - FB (flybase) and WormBase. See Gaurav's list above.

colleenXu · 2023-10-16T18:51:56Z

I had a discussion with Sierra Moxon (Translator data-modeling team), on my concerns with Translator standards:

short answer: use bioregistry as the authoritative source for "format of IDs" (Pattern for Local Unique Identifiers, Example Local Unique Identifier)
- I think it can still be useful to see how the "original namespace / resource" formats their IDs
the wider issue can be described as "what is the regex pattern for local unique identifiers"
if I notice issues with this method, bring it up with data-modeling
- it may be a one-off, but I brought one up already: "PMC/PMCID aren't identical in biolink-model but they are in bioregistry. which makes it unclear whether these IDs in translator should start with 'PMC' or not"
prefixes for namespaces are a separate issue. Generally, biolink-model is the authoritative resource for the prefix-format. If it's not there, ask their team (note that they may defer to bioregistry's format).
- ontologies tend to have all-caps prefixes, non-ontologies don't (structured vocabs, resource-specific namespaces)

However, it's unclear how to resolve the issues/discrepancies (Translator-wide)

who's using this namespace / is affected? (not easy to tell when the namespace isn't used for nodes/bioentities)
getting everyone to agree on what to do / when. Although if the UI has requirements...that helps a lot
finding these issues in the first place can be tricky (especially Translator-wide)
standards may not be stable: resource itself can change its ID formats / prefix, or bioregistry can update

colleenXu added the enhancement label Oct 4, 2023

colleenXu mentioned this issue Oct 11, 2023

AGR API Yaml NCATS-Tangerine/translator-api-registry#127

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling letter-case in IDs in a consistent, easy-to-maintain way #735

Handling letter-case in IDs in a consistent, easy-to-maintain way #735

colleenXu commented Oct 4, 2023 •

edited

Loading

colleenXu commented Oct 4, 2023 •

edited

Loading

gaurav commented Oct 8, 2023

colleenXu commented Oct 9, 2023

colleenXu commented Oct 11, 2023

colleenXu commented Oct 16, 2023 •

edited

Loading

Handling letter-case in IDs in a consistent, easy-to-maintain way #735

Handling letter-case in IDs in a consistent, easy-to-maintain way #735

Comments

colleenXu commented Oct 4, 2023 • edited Loading

Background

Issue

colleenXu commented Oct 4, 2023 • edited Loading

gaurav commented Oct 8, 2023

colleenXu commented Oct 9, 2023

colleenXu commented Oct 11, 2023

colleenXu commented Oct 16, 2023 • edited Loading

colleenXu commented Oct 4, 2023 •

edited

Loading

colleenXu commented Oct 4, 2023 •

edited

Loading

colleenXu commented Oct 16, 2023 •

edited

Loading