-
-
Notifications
You must be signed in to change notification settings - Fork 31.6k
Python re lib fails case insensitive matches on Unicode data #56937
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The Python re library is broken in its approach to case-insensitive matches. It erroneously attempts to compare lowercase mappings. This is wrong. You must compare the Unicode casefolds, not the Unicode casemaps. Otherwise you get wrong answers. I include a small test case that illustrates this bug. The bug exists on both 2.7 and 3.2, and on both wide builds and narrow builds. For comparison, I also show results using Matthew Barnett's regex library, which gets all 5 tests correct where re gets all 5 tests wrong. A sample run is: FAIL: re pattern Ι is not the same as string ͅ re lib passed 0 of 5 tests |
I am not sure that everyone will agree that this is a bug, rather than a feature request, or that if a bug, that it should be changed in existing releases and possibly break running code. The doc just says, somewhat vaguely, that IGNORECASE "works for Unicode characters as expected". I have added others as nosy for their opinions. The test file should have omitted the gratuitous and distracting warnings, especially the one that effectively scolds Windows users for running Windows. With those omitted, the test cases given would form the basis for an added TestCase. |
Working as expected for Unicode characters means it must the Unicode's
Part of those functional character specifications can be found in the three One is not allowed to make up one's own rules that run counter to Unicode's
I have absolutely no idea what on earth you could possibly be referring to. Let me make perfectly clear that I have never in my life come anywhere near a If you don't like my test cases, you know where to find vi. I supposed I could always send you the program that writes these programs --tom |
This bug could do with a little less attitude. That said, I think it is a bug and should be fixed, at the very least for Python 3.3. As always, it is a matter of much debate to what extent bugs can be fixed in previous Python versions (specifically, 2.7 and 3.2) without breaking more code than it fixes, and I don't want to jump the gun on that issue. Let's first see what it takes to fix this for 3.3. |
Here is preliminary patch which fixes case-insensitive regular expression matching of unicode strings. It is incomplete, it needs applying patches from bpo-17381, which fixes other aspects of case-insensitive matching. One bug is left for Turkish letters. This matching is not transitive. Three pairs of letters should match: ı ~ I ~ i ~ İ. All other combinations should not match (ı !~ i, I !~ İ, ı !~ İ). This patch doesn't fixes this bug. |
Here are complete patch and script used to generate equivalence table. |
Could anyone please make a review? The script is updated so that it now is compatible with 2.7. There are some differences in equivalence table between 2.7 and 3.4 (e.g. 'ΐ' (U+0390) is not equivalent to 'ΐ' (U+1FD3) in 2.7). |
New changeset 4caa695af94c by Serhiy Storchaka in branch '2.7': New changeset 47b3084dd6aa by Serhiy Storchaka in branch '3.4': New changeset 09ec09cfe539 by Serhiy Storchaka in branch 'default': |
This solution (with hardcoded table of equivalent lowercases) is temporary. In future re engine will be changed to support correct caseless matching of different lowercase forms internally. |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: