-
-
Notifications
You must be signed in to change notification settings - Fork 425
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use regex in addition to re? #590
Comments
Hi, I'm certainly open to supporting the Is it fully-compatible with How does it compare in terms of performance? I think it might be nice to support something like
Not sure about making it the default though. |
Hello, If you like, I could make a pull request and we could then check performance, etc. from there? |
Sure, that sounds like a good idea! |
Thanks! Done in #593. |
Closed by #593. |
This fixes python-poetry#221 by exchanging the regex package for re (the former having fixed the long-standing bug in re (see [here](lark-parser/lark#590) for details). Not sure if this will work for poetry (i.e., whether adding regex as a dependency is acceptable)...but thought I would propose this change and wait for feedback to take it from there. Thanks!
Suggestion
While using
Lark
I came across an already-documented bug in the builtinre
module, whereby abugidas with vowel marks (that is, most of South Asia and Oceania) fail to match the\w
directive (e.g., as inre.match("^[\w\s][\w\s]*", "किशोरी", re.UNICODE)
, as taken from this question here).It seems that the recommended "solution" is to simply used the
regex
module instead ofre
. However, this would mean adding a dependency toLark
, so I thought I would ask whether you would look favourably upon such a suggestion before submitting a pull request. I was thinking of proposing something along the lines of:so that
regex
is not required and nothing breaks if the dependency is not installed.Describe alternatives you've considered
An alternative, which I am currently using as a workaround, is to generate the toxens manually. For instance, for a Python parser, I can use:
and then define my
NAME
token asID_START ID_CONTINUE*
. However, I think that it would be much nicer to be able to use\w
according to Unicode standard. A side benefit of usingregex
is that projects using Lark and specifyingregex
as a dependency could also then explicitly use Unicode categories in their regexes as well.Please do let me know!
The text was updated successfully, but these errors were encountered: