-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New \p{Letter} Unicode property escape #1688
Conversation
I guess you followed http://www.regular-expressions.info/unicode.html with your names. Do you also support short forms (like |
Yes, short and long forms are supported.
I haven't done Unicode blocks yet, but that'd be easy to add.
…On Thu, Feb 23, 2017 at 3:10 AM Mike Lischke ***@***.***> wrote:
I guess you followed http://www.regular-expressions.info/unicode.html
with your names. Do you also support short forms (like \p{Ll})? And what
about the Unicode block syntax (\p{InXXX}?
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1688 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU5NqIdLxKPp1Em-nkXWj8M1zbzPtks5rfWkxgaJpZM4MJWlJ>
.
|
fb17610
to
93c625e
Compare
OK, added support for Unicode blocks with names prefixed with Since block and script names overlap, |
93c625e
to
44ddf86
Compare
44ddf86
to
a48b0af
Compare
ef12fa2
to
57d021f
Compare
57d021f
to
e4ebfc1
Compare
e4ebfc1
to
ca03e6a
Compare
OK, rebased and ready for review! (Hopefully all the tests pass.) |
boom, jack! |
Awesomesauce. |
@mike-lischke asked for this in his review of #1633 . I think it's a great way to show the power of the new full Unicode functionality in ANTLR4.
This PR adds two new lexer escapes suitable for use in a charset, so:
[a-z]
could become:
[\p{Ll}]
or, equivalently:
[\p{Lowercase_Letter}]
to match both a-z as well as exciting Unicode code points like 𝐚 (
U+1D41A
) through 𝐳 (U+1D433
).I included both matching and non-matching variants:
\p{Letter}
: Include all Unicode code points with the general category "Letter"\P{Letter}
: Include all Unicode code points which do not have the general category "Letter"This also works for:
\p{Latin}
,\p{Hiragana}
,\p{Cyrillic}
)\p{Emoji}
,\p{Changes_When_Uppercased}
,\p{Quotation_Mark}
)\p{InHiragana}
,\p{InArabic_Ext_A}
,\p{InGreek}
)The names of properties are case-insensitive. In addition,
-
and_
are treated the same.