-
-
Notifications
You must be signed in to change notification settings - Fork 31.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outdated Unicode data in the re module #91575
Closed
2 of 4 tasks
Labels
3.9
only security fixes
3.10
only security fixes
3.11
only security fixes
topic-regex
type-bug
An unexpected behavior, bug, or error
type-feature
A feature request or enhancement
Comments
serhiy-storchaka
added a commit
to serhiy-storchaka/cpython
that referenced
this issue
Apr 15, 2022
I will do these. |
I am already working on this. |
serhiy-storchaka
added a commit
that referenced
this issue
Apr 18, 2022
serhiy-storchaka
added a commit
to serhiy-storchaka/cpython
that referenced
this issue
Apr 18, 2022
serhiy-storchaka
added a commit
to serhiy-storchaka/cpython
that referenced
this issue
Apr 18, 2022
…latest Unicode version (pythonGH-91580). (cherry picked from commit 1c2fceb) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka
added a commit
that referenced
this issue
Apr 22, 2022
…ing in re (GH-91660) Also test that all extra cases are in BMP.
miss-islington
pushed a commit
to miss-islington/cpython
that referenced
this issue
Apr 22, 2022
…latest Unicode version (pythonGH-91580). (pythonGH-91661) (cherry picked from commit 1c2fceb) (cherry picked from commit 1748816) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
serhiy-storchaka
added a commit
that referenced
this issue
Apr 22, 2022
serhiy-storchaka
added a commit
that referenced
this issue
Apr 22, 2022
hello-adam
pushed a commit
to hello-adam/cpython
that referenced
this issue
Jun 2, 2022
…atest Unicode version (pythonGH-91580). (pythonGH-91661) (pythonGH-91837) (cherry picked from commit 1c2fceb) (cherry picked from commit 1748816) Co-authored-by: Serhiy Storchaka <storchaka@gmail.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
3.9
only security fixes
3.10
only security fixes
3.11
only security fixes
topic-regex
type-bug
An unexpected behavior, bug, or error
type-feature
A feature request or enhancement
The
re
module contains a table for characters:c1.upper() == c2.upper() and c1 != c2 and c1.lower() == c1 and c2.lower() == c2
. For example, 'ς' and 'σ':'ς'.upper() == 'σ'.upper() == 'Σ'
.It was generated for 3.5. But newer Python versions support newer Unicode standards, and more such characters were added. For example: 'в' and 'ᲀ':
'в'.upper() == 'ᲀ'.upper() == 'В'
.Python re lib fails case insensitive matches on Unicode data #56937
The code depends on some assumption about characters outside of the BMP range. The comment says that there are only two ranges of cased non-BMP characters, and that RANGE_UNI_IGNORE works with them.
Now there are more ranges of cased non-BMP characters. Seems the assumption is still true and RANGE_UNI_IGNORE still works, but the comment is outdated.
IGNORECASE breaks unicode literal range matching #61583
The plan is:
make
target for generating that table with the actual Unicode version (the developed version only).The text was updated successfully, but these errors were encountered: