Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling letter-case in IDs in a consistent, easy-to-maintain way #735

Open
colleenXu opened this issue Oct 4, 2023 · 5 comments
Open
Labels
enhancement New feature or request

Comments

@colleenXu
Copy link
Collaborator

colleenXu commented Oct 4, 2023

Background

During the JQ work, we encountered an issue with an external API (CTD) and the letter-case of the bioentity IDs in its responses (ref: Jackson's comment and my reply):

  • KEGG.PATHWAY IDs should have lower-case letters (ex: hsa05323) but our api-response-transform module code was transforming the response's ID strings to all-caps.
  • other namespace's IDs should have upper-case letters, but all bioentity IDs in CTD responses had lower-case letters. That's why the code for the all-caps transformation was useful in these other cases...

Issue

We realized that this was a larger issue (ref: my braindump here, a "current situation" post here):

  • what should be the identifier-case formats for each of the namespaces we use?
  • handling these identifier-case formats in a consistent way, in one place that's easy to update and maintain. Right now, it may be hidden in some api-response-transform modules.
    • we should be able to simplify: it seems like most ID namespaces with letters in their identifiers use all-caps. With a few exceptions that have all-lowercase. But we'd need to check first (see first main point)
    • we'd want to confirm: what code is related to identifier-case format? aka what would we refactor? where is this info used?
      • definitely in api-response-transform module, basically every transformer
      • it seems like the rest of BTE is working fine....so maybe other modules aren't affected?
      • maybe this info could be useful for building sub-queries to APIs (templating in SmartAPI yaml request info)?
@colleenXu colleenXu added the enhancement New feature or request label Oct 4, 2023
@colleenXu
Copy link
Collaborator Author

colleenXu commented Oct 4, 2023

The first step is figuring out the identifier-case formats for each of the namespaces we use. I'm likely the lead on this.

It'll take a bit of time, but not be too difficult, to review the namespaces we use and find those with lower-case letters in their IDs (aka the exceptions). (It's unnecessary to make a list of all namespaces we use, which ones of those have IDs w/ letters, which ones of those have all-caps...)

But there's an issue that I think needs untangling: I'm not clear on Translator's standards for identifier-case format. It's not clear from a quick look at the biolink-model repo...

  • I get the sense that we're supposed to follow the bioregistry for ID-value format...
    • but it can be confusing to find the ID-value format by namespace-name/prefix because the biolink-model often uses namespace-name/prefixes that are different from bioregistry's
    • an example of this confusion is Confusion on PMC vs PMCID biolink/biolink-model#1366 (do the ID-values start with PMC or not?)
  • we are also following the identifier-case format in the NodeNorm responses...but that doesn't help us when we're using ID namespaces that aren't covered by NodeNorm...
  • we're probably also following some historic standards (from back when we used our own id-resolver setup / based on the core BioThings APIs / based on how the resource provides their IDs)
  • I dunno how stable these standards are. That's a reason to make whatever we code easy to update / maintain...

@gaurav
Copy link

gaurav commented Oct 8, 2023

FWIW, NodeNorm doesn't expect identifiers to be purely numerical. Here is the current distribution of CURIE prefixes with non-numerical identifiers in NodeNorm:

Count Prefix Example
4 KEGG.REACTION KEGG.REACTION:R06368
6 KEGG.DISEASE KEGG.DISEASE:H00484
22 TCDB TCDB:3.A.1.105.1
175 PANTHER.PATHWAY PANTHER.PATHWAY:P06210
222 EC EC:2.1.1.n11
271 OMIM OMIM:PS278300
1244 ComplexPortal ComplexPortal:CPX-1047
2438 ICD10 ICD10:H44.44
7153 SGD SGD:S000028545
13892 dictyBase dictyBase:DDB_G0289229
14544 DRUGBANK DRUGBANK:DB16588
19277 KEGG.COMPOUND KEGG.COMPOUND:C18469
26128 PANTHER.FAMILY PANTHER.FAMILY:PTHR24082:SF38
30248 SMPDB SMPDB:SMP0052819
30293 FB FB:FBgn0266114
38059 ZFIN ZFIN:ZDB-LINCRNAG-131127-634
48781 WormBase WormBase:WBGene00172747
50770 NCIT NCIT:C123904
109255 REACT REACT:R-HSA-5617833.2
119457 UNII UNII:GC91Z6YS52
171624 PR PR:Q8TF17-5
217920 HMDB HMDB:HMDB0014933
341463 MESH MESH:C000609086
2399666 CHEMBL.COMPOUND CHEMBL.COMPOUND:CHEMBL4247838
3195515 UMLS UMLS:C2334787
32043961 ENSEMBL ENSEMBL:ENSCURG00000012216
110286527 INCHIKEY INCHIKEY:WCVKZUDHHMZXCB-UHFFFAOYSA-N
248913563 UniProtKB UniProtKB:A0A8H5S4H9

If the list of all those identifiers will be useful to you (it's 2.9G compressed), please let me know and I can send it over!

@colleenXu
Copy link
Collaborator Author

@gaurav

This table of namespaces + example IDs is great. We don't need a list of all IDs for a namespace; the examples here are fine.

@colleenXu
Copy link
Collaborator Author

Note to self: there are examples where there's a mix of lettercase in the ID itself - FB (flybase) and WormBase. See Gaurav's list above.

@colleenXu
Copy link
Collaborator Author

colleenXu commented Oct 16, 2023

I had a discussion with Sierra Moxon (Translator data-modeling team), on my concerns with Translator standards:

  • short answer: use bioregistry as the authoritative source for "format of IDs" (Pattern for Local Unique Identifiers, Example Local Unique Identifier)
    • I think it can still be useful to see how the "original namespace / resource" formats their IDs
  • the wider issue can be described as "what is the regex pattern for local unique identifiers"
  • if I notice issues with this method, bring it up with data-modeling
    • it may be a one-off, but I brought one up already: "PMC/PMCID aren't identical in biolink-model but they are in bioregistry. which makes it unclear whether these IDs in translator should start with 'PMC' or not"
  • prefixes for namespaces are a separate issue. Generally, biolink-model is the authoritative resource for the prefix-format. If it's not there, ask their team (note that they may defer to bioregistry's format).
    • ontologies tend to have all-caps prefixes, non-ontologies don't (structured vocabs, resource-specific namespaces)
However, it's unclear how to resolve the issues/discrepancies (Translator-wide)

  • who's using this namespace / is affected? (not easy to tell when the namespace isn't used for nodes/bioentities)
  • getting everyone to agree on what to do / when. Although if the UI has requirements...that helps a lot
  • finding these issues in the first place can be tricky (especially Translator-wide)
  • standards may not be stable: resource itself can change its ID formats / prefix, or bioregistry can update

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants