Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update PR regex #960

Merged
merged 5 commits into from
Oct 13, 2023
Merged

Update PR regex #960

merged 5 commits into from
Oct 13, 2023

Conversation

cthoyt
Copy link
Member

@cthoyt cthoyt commented Oct 12, 2023

Closes #959

@nataled can you provide a few more example local unique identifiers? I guess that the regex you supplied also includes UniProt isoforms

Closes #959

Co-Authored-By: Darren A. Natale <13770634+nataled@users.noreply.github.com>
@cthoyt
Copy link
Member Author

cthoyt commented Oct 12, 2023

@nataled it also appears that P0DP23 does not match your regex, is this because it only applies to a subset of UniProt?

Co-Authored-By: Darren A. Natale <13770634+nataled@users.noreply.github.com>
@nataled
Copy link
Contributor

nataled commented Oct 12, 2023

In addition to the one you already have (a nine-digit string), there are the following four examples:
Q15796
Q15796-1
A0A0G2QC33
A0A0G2QC33-2

I do not see how P0DP23 fails the regex:

^(\d+ covers the pure digit part
[OPQ][0-9][A-Z0-9]{3}[0-9](-\d+)? covers the six-place UniProtKB accessions
[A-NR-Z][0-9]([A-Z][A-Z0-9]{2}[0-9]){1,2}(-\d+)?)$ covers the ten-place UniProtKB accessions

Thus we need to look only at the six-place part:

[OPQ][0-9][A-Z0-9]{3}[0-9](-\d+)?
[OPQ][0-9][A-Z0-9][A-Z0-9][A-Z0-9][0-9](-\d+)?   (expanding the {3} for ease of aligning)
  P    0    D       P          2    3

Co-Authored-By: David Linke <2648874+dalito@users.noreply.github.com>
@cthoyt
Copy link
Member Author

cthoyt commented Oct 12, 2023

It appears the other ones also fail against this regex

@nataled
Copy link
Contributor

nataled commented Oct 12, 2023

What's crazy is that I pulled the regex from the OBO Foundry PURL system. In that context they do not fail.

@nataled
Copy link
Contributor

nataled commented Oct 12, 2023

I just tried it with a small script and they all pass. How are you assessing the pass/fail?

Co-Authored-By: David Linke <2648874+dalito@users.noreply.github.com>
@cthoyt
Copy link
Member Author

cthoyt commented Oct 13, 2023

The test run with Python's builtin unittest module's https://docs.python.org/3/library/unittest.html#unittest.TestCase.assertRegex. This apparently uses re.search. The error message is as follows:

    self.assertRegex(example, regex, msg=f"[{prefix}] invalid LUID: {example}")
E   AssertionError: Regex didn't match: '^(?:\\d{9}|[OPQ][0-9][A-Z0-9]{3}0-9?|[A-NR-Z]0-9{1,2}(?:-\\d+)?)$' not found in 'P0DP23' : [pr] invalid LUID: P0DP23

I confirmed that if you run the following, nothing is returned

>>> import re
>>> x = re.compile(r'^(?:\\d{9}|[OPQ][0-9][A-Z0-9]{3}0-9?|[A-NR-Z]0-9{1,2}(?:-\\d+)?)$')
>>> x.search('P0DP23')

Similarly, x.match('P0DP23') also returns none

@nataled
Copy link
Contributor

nataled commented Oct 13, 2023

I see the problem. Somehow my pasted regex is losing lots of characters, so the one you're testing is an inaccurate regex. I'm pasting the regex into a text file (attached) to avoid this issue. Aligning the two, I see that github didn't much care for the (?: notation, plus did a few other odd removals.

PR_regex.txt

@codecov
Copy link

codecov bot commented Oct 13, 2023

Codecov Report

All modified lines are covered by tests ✅

Files Coverage Δ
src/bioregistry/version.py 68.42% <100.00%> (ø)

... and 12 files with indirect coverage changes

📢 Thoughts on this report? Let us know!.

@cthoyt
Copy link
Member Author

cthoyt commented Oct 13, 2023

@nataled looks like the last message did the trick! Thanks!

@cthoyt cthoyt merged commit 86c8ef9 into main Oct 13, 2023
12 of 13 checks passed
@cthoyt cthoyt deleted the update-pr-regex branch October 13, 2023 17:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update Protein Ontology regex specification
2 participants