-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP: Allow overriding all categories #27
base: master
Are you sure you want to change the base?
Conversation
If the overall idea behind this is acceptable, I'll tighten down how the overrides work and add more tests. |
self.categories = CATEGORIES.copy() | ||
if flags & re.DOTALL: | ||
self.categories[sre_constants.CATEGORY_LINEBREAK] = "" | ||
self.categories[sre_constants.CATEGORY_NOT_LINEBREAK] = CHARSET |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be = charset
lowercase
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch. Otherwise, does this approach seem reasonable, and worth polishing?
I'm not sure yet? Let me explain some reasoning, and we can see if it lines up. Prehistory
When making subscriptions, people tended to write regexes like What I'm trying to say, is I approach the I think that's where you're headed with this, although I haven't had my coffee yet and the test doesn't give me much of a hint. DifficultiesThere were a couple straightforward reasons I never solved this:
{
sre_constants.CATEGORY_WORD: set(c for c in charset if re.match(r"\w", c))
...
} With Python 3 ( CHARSET_ASCII = [chr(i) for i in range(256)] # maybe still true?
CHARSET_UNICODE_BMP_NO_SURROGATES = [chr(i) for i in range(65536) if i < 0xd800 or i > 0xdfff]
CHARSET_UNICODE_EVERYTHING = [chr(i) for i in range(sys.maxunicode)] |
Let me explain my use case, and how it relates to the underlying set of problems which sparked #14 . Most important, I can not manually handle crappy regex. I can have crappy regex identified automatically, and submit PRs upstream to fix them, but upstream requires a single PR for each file, which often equated to each crappy regex, and their review process is molasses sprinkled with large rocks to reduce the flow. I would like to detect the crappiest, to prioritise for PRs, but I need to just make sense of the all but the very crappiest. If I filter them too hard, it becomes unjustified to say my library is compatible. Given that my library is about security/privacy, skipping too many potentially harms my users. To detect the crappiest, I need to verify the crappy regex doesnt actually map to real hosts. That process of checking all hosts is a time sink if I cant reduce the search space initially and expand it incrementally until I find the real limits. I see lots of regex which have Problem is that a So I would like to be able to map As a result, having this approach in the library isnt critical for me. I can always subclass and inject this into my own subclass if I need it. There might be some redundant code in my In addition, mapping With that said, the reason I proposed this PR is that it doesn't seem to introduce any significant hacky code into the core code. Users can redefine these categories, or not, and there is little code here to support that. If unicode support isnt added soon, it has the distinct benefit of allowing users to add their own unicode support because they can change any of these categories, and importantly they can also use "ascii + the unicode bits I need" as opposed to the insane "all of unicode most of which I know is impossible in my input stream". Unless full unicode can be added with negligible perf impact, users will appreciate subsets of unicode. e.g. they might want only Arabic support, and they know for certain they only need Arabic. Having Some of this discussion probably belongs on the re.UNICODE issue. I am more than happy to hash out a useful |
I could be convinced that this is useful. I'm not wild about the double use of |
Closes #2