Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Allow overriding all categories #27

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

jayvdb
Copy link
Contributor

@jayvdb jayvdb commented Sep 22, 2020

Closes #2

@jayvdb
Copy link
Contributor Author

jayvdb commented Sep 22, 2020

If the overall idea behind this is acceptable, I'll tighten down how the overrides work and add more tests.

self.categories = CATEGORIES.copy()
if flags & re.DOTALL:
self.categories[sre_constants.CATEGORY_LINEBREAK] = ""
self.categories[sre_constants.CATEGORY_NOT_LINEBREAK] = CHARSET
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be = charset lowercase

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Otherwise, does this approach seem reasonable, and worth polishing?

@thatch
Copy link
Collaborator

thatch commented Sep 22, 2020

I'm not sure yet? Let me explain some reasoning, and we can see if it lines up.

Prehistory

sre_yield originally was written was for matching keys in a monitoring system that have roughly the same shape as ASCII DNS entries ([a-z0-9.\-]+ and suffixes are common).

When making subscriptions, people tended to write regexes like (service1|service2)-requests-(min|max|avg) and this worked fine until one of the services decided they wanted sets of strings instead of regexes. So this was initially designed for well-behaved regexes that were only literal strings and had maybe 10-100 possible matches. . and \d or [0-7] were supported, because of a goal of being a general regex expander, but we didn't really need that -- if there were too many matches, then ask the human to go look at what keys were in use, and write the set of strings themselves.

What I'm trying to say, is I approach the charset <-> . handling from a more mathematical perspective, rather than a pragmatic need. My initial idea was to just intersect them. Say if you wanted a smoke test with charset="abc123" then \d would match [123]. The negated ones, I think you could intersect in the same way (so that \D would continue to match whatever non-digits you had, but would not ever gain [4-9]). At the risk of re-re-stating the obvious, charset would only be to restrict from some ideal maximum charset that Python was using internally. But any literal string would be passed through even if it was outside the charset, so .*\.com with charset="123" would return .com through 3333[snip]3333.com

I think that's where you're headed with this, although I haven't had my coffee yet and the test doesn't give me much of a hint.

Difficulties

There were a couple straightforward reasons I never solved this:

  1. Let's say you handle a restricted charset like "abc123" with \d -- do you also restrict [0-9]? What about [0]? Is that different from 0 not in a charclass? Should it be a parse error if a charclass is ever "empty" and can we defer that to only if it's used? Can we trust the sre_parse implementation to not ever optimize [0] -> 0?
  2. When we support Unicode for Support flags=re.UNICODE #3, I'd like that to be a very straightforward thing, such as passing charset=CHARSET_UNICODE_BMP_NO_SURROGATES for example. So a near-term goal once thinning works on categories, is to make the category chars come from something like
{
  sre_constants.CATEGORY_WORD: set(c for c in charset if re.match(r"\w", c))
  ...
}

With Python 3 (re.UNICODE being the default, PEP 393 eliminating narrow builds, and being less concerned with Jython limitations) this is in a better state to give it a try than 2014. The combinatorics get out of hand quickly, so I wouldn't make it the default, but I think it's important that we support it. Some breadcrumbs:

CHARSET_ASCII = [chr(i) for i in range(256)]  # maybe still true?
CHARSET_UNICODE_BMP_NO_SURROGATES = [chr(i) for i in range(65536) if i < 0xd800 or i > 0xdfff]
CHARSET_UNICODE_EVERYTHING = [chr(i) for i in range(sys.maxunicode)]

@jayvdb
Copy link
Contributor Author

jayvdb commented Sep 22, 2020

Let me explain my use case, and how it relates to the underlying set of problems which sparked #14 .

Most important, I can not manually handle crappy regex. I can have crappy regex identified automatically, and submit PRs upstream to fix them, but upstream requires a single PR for each file, which often equated to each crappy regex, and their review process is molasses sprinkled with large rocks to reduce the flow.

I would like to detect the crappiest, to prioritise for PRs, but I need to just make sense of the all but the very crappiest. If I filter them too hard, it becomes unjustified to say my library is compatible. Given that my library is about security/privacy, skipping too many potentially harms my users. To detect the crappiest, I need to verify the crappy regex doesnt actually map to real hosts. That process of checking all hosts is a time sink if I cant reduce the search space initially and expand it incrementally until I find the real limits.

I see lots of regex which have . when they should have \w. This is already handled with charset.

Problem is that a \w or \w{3} in a domain name is almost certainly not 26 or 26*3 different domains that are actually registered and actively receiving connections. It is typically shorthand for [a-c] or [a-c]{3}, or some other similar subset, except for a few large farms. Likewise \d{3} is a large perf hit and usually is shorthand for 00\d or something like that.

So I would like to be able to map \w and \d to subsets. For me, it isnt about correctness. It is lossy heuristics. You suggested "pragmatic" and that is probably an appropriate term.

As a result, having this approach in the library isnt critical for me. I can always subclass and inject this into my own subclass if I need it. There might be some redundant code in my __init__ because RegexMembershipSequence.__init__ has a lot of code which probably should be in a method, but this isnt a blocker for me. I can use tools like patchy to remove that redundancy if it really bothers me.

In addition, mapping \w and \d down to almost random subsets is a quick win, but it is still ugly. It buys me some time to focus elsewhere. It allows some algorithms to become feasible to use or explore. But it doesnt lead to code that is approaching pretty/done - And very likely I will still need to occasionally pick up that "can" and kick that can down the road a bit more to keep that temporary solution usable.

With that said, the reason I proposed this PR is that it doesn't seem to introduce any significant hacky code into the core code. Users can redefine these categories, or not, and there is little code here to support that. If unicode support isnt added soon, it has the distinct benefit of allowing users to add their own unicode support because they can change any of these categories, and importantly they can also use "ascii + the unicode bits I need" as opposed to the insane "all of unicode most of which I know is impossible in my input stream". Unless full unicode can be added with negligible perf impact, users will appreciate subsets of unicode. e.g. they might want only Arabic support, and they know for certain they only need Arabic. Having \w expand to all characters in all languages/scripts/etc/etc becomes annoying, even if it was without any perf impact in the library - the user code needs to then filter out the unwanted chars, and that wont be perf-neutral.

Some of this discussion probably belongs on the re.UNICODE issue. I am more than happy to hash out a useful re.UNICODE design first, if it helps determine whether this PR is a help or hindrance to that, as I do appreciate that re.UNICODE is a correctness type issue, and probably shouldnt be delayed much longer. I've seen other packages similar to sre_yield being written, with crappy unicode support but claiming it works, so having at least a roadmap there seems important, and I wouldnt want this PR to create a feature which needs to be rewritten in order to get unicode support.

@thatch
Copy link
Collaborator

thatch commented Sep 26, 2020

I could be convinced that this is useful. I'm not wild about the double use of charset param though, it breaks typing. If you can make it a different param, so charset still do the sensible thing that mostly works for unicode, then great, I'll merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement thinning consistently
2 participants