-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New extended Unicode escape \u{10ABCD} to support Unicode literals > U+FFFF #1633
Conversation
📝 I'm looking into the possibility of implementing this without changing the serialized ATN at all. I'm not quite done but I've started the work in my unicode-stream branch. Note that I haven't implemented the |
Thanks Sam! I thought about that, but was worried it would break
backwards-compatibility with existing serialized ATNs (there's no guarantee
they are UTF-16 today).
…On Thu, Jan 26, 2017, 7:40 PM Sam Harwell ***@***.***> wrote:
📝 I'm looking into the possibility of implementing this without changing
the serialized ATN at all. I'm not quite done but I've started the work in
my unicode-stream
<master...sharwell:unicode-stream>
branch.
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#1633 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU7DH6Q5uO8GNrb2euuGNFq7y-JXNks5rWWcXgaJpZM4LvVMQ>
.
|
@bhamiltoncx To preserve compatibility, I put the decode behind a deserialization option which defaults to false. I'll have to make the option available for the ATN instance which appears in generated code (so it can be set to true), but that step would appear after the initial step of just getting things working. |
@sharwell: I took a look at your branch. I think there are two places we need to support Unicode values > U+FFFF:
Without changing the serialized ATN format, I'm not convinced we can store Unicode SMP values > U+FFFF in either of these locations. Even using UTF-16 to encode Unicode SMP values as a series of 16-bit values, the problem is that we can only store a single UTF-16 value as a start/end value of a set or as an argument to an edge. As far as I can tell, the existing ATN format doesn't have any way we can bypass that restriction. Please let me know if I missed something! |
I'm guessing we actually don't need to handle them in the ATN for SerializationAtomTransition
RangeTransition
SetTransition, NotSetTransition
NotSetTransition
WildcardTransitionSerialize as-is. Deserialization
📝 There is no need to reconstruct the fully optimized set transitions. Literal values above U+FFFF would likely be infrequent in the grammar, so the number of excess range transitions would likely be minimal. Even if the input contained these values, they are most likely to be handled by NotSetTransition and WildcardTransition. However, it is also possible to collapse these transitions, a feature I've actually implemented in my fork. |
If I understand correctly, changing the sequence from one 16-bit value to two 16-bit values means we'll break compatibility with existing deserializers which expect all sequences to have a single 16-bit value. If that's correct, we might as well change the UUID, right? It seems like the main advantage of your proposal is a smaller serialized ATN for grammars which contain no values > Overall, it seems like your proposal will work, but it's a lot more complicated than extending set and transition values from 16 to 32 bits — especially implementing it in all the runtimes will be a good chunk of work. We can definitely improve the size of the serialized ATN in a few ways if that becomes a bottleneck. |
I mean serialize two transitions instead of just one, as a sequential pair. |
I see! Yes, that proposal would definitely maintain binary compatibility with the existing ATN format. I still think keeping a 16-bit restriction on sets and transition arguments may not be the most useful goal. Do we know if it's worthwhile? Are there any tests we can run to determine this? |
If we're changing the ATN format, we could still keep the overhead down for all existing grammars to a single int. When serializing the
Then we always serialize transitions containing an explicit value over U+FFFF as a SetTransition (this is trivial, since AtomTransition ⊆ RangeTransition ⊆ SetTransition). During deserialization, if a set only has one element we construct an AtomTransition, and if it only has one interval we construct a RangeTransition. |
Definitely agreed!
Interesting. Is there a reason we don't do this today in general for all transitions? Is it another storage optimization? |
Yes, we avoid storing interval sets for atoms and ranges by inlining those values instead. |
I haven't yet implemented @sharwell's suggestions to reduce the size of the serialized ATN, but the WIP branch should now work for atoms > U+FFFF, sets containing values > U+FFFF, and ranges containing values > U+FFFF. Lots of low-hanging fruit to fix, but the basic tests are now passing! |
Great news Ben! I'll be poking around this weekend. |
While we're at it, guys, should we open ANTLR up to parsing arbitrary 32-bit int streams? I can see limiting unicode 32 to be \u{...} but perhaps \xABCDABCD for 32-bit values? Seems like we may need a grammar-level option like encoding=unicode32 or some such anyway so perhaps encoding=int ?? In that case, getText() would be meaningless or we could just define it as the little or big endian sequence of 16bit words. |
This is a great question. I think it'd be fine to have a grammar option to say "this is not unicode, but a stream of x-bit-wide units" so people could specify what type of input they intend to parse. That would improve life for folks parsing 8-bit binary formats as well. Probably that work should be separate from this work. |
Agreed, let's just keep 32-bit int parsing in the back of our minds. |
481c749
to
9d92975
Compare
Man, Windows is not fun! Python 3.5 on Windows doesn't actually support writing Unicode to stdout via http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console/32176732#32176732 |
Oh, Windows.. https://msdn.microsoft.com/en-us/library/system.console(v=vs.110).aspx
Ugh, well, okay, at least there's a workaround..
Geez, I swear I tried that.. reads further
headdesk (FYI, Mono gets this right; setting |
Thank goodness this stuff is open-source. Looks like for some reason I'll try |
Oh man, those rascals! |
Hmm, even I guess that's because under the hood, Windows' interprocess I/O (in our case, stdout) for C# <-> Java communication still has to convert Unicode to bytes, so it's going to use whatever is the system default code page (probably CP1252 if this is US English Windows). Mono doesn't seem to have a problem getting this right, but Windows is proving to be a challenge. My options are:
|
For (1.) is there a generic way to set the console to UTF-8 that works on all platforms? |
I'm trying to figure that out now.
…On Mon, Feb 6, 2017, 3:39 PM Terence Parr ***@***.***> wrote:
For (1.) is there a generic way to set the console to UTF-8 that works on
all platforms?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1633 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAUy7zMnEmLsqi2fTW7V5R7RG4J5jOks5rZ6E0gaJpZM4LvVMQ>
.
|
OK, I'm going to give up on trying to get C# tests on Windows to write Unicode via stdout, and write to a file instead. |
ok, how big a change will it be in the test rig? |
Yes, just saw it and love it. You are really going the extra round to get this fully done. Respect. |
Yeah, it's amazing work! I think we'll need to update the doc too. Do you have a complete list of new features? Is it "just" |
Those plus the new ANTLRInputStream replacement CodePointInputStream and
equivalents for each runtime language.
…On Tue, Feb 21, 2017, 1:14 PM Terence Parr ***@***.***> wrote:
Yeah, it's amazing work! I think we'll need to update the doc too. Do you
have a complete list of new features? Is it "just" \u{...} and \p{...}?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1633 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAUzao30rZvD-TRccWZTphbUF5rQVuks5re1OzgaJpZM4LvVMQ>
.
|
Sorry, I must have missed something. I don't see a |
Sorry, I mis typed. It's CodePointCharStream, and only some run times
needed it (Java and C#}. As you noted C++ was already in good shape.
…On Wed, Feb 22, 2017 at 1:52 AM Mike Lischke ***@***.***> wrote:
Sorry, I must have missed something. I don't see a CodePointInputStream
class in the patch nor can I find it in the current code (everything is
merged already to ANTLR4 master, right?). What is this class for? I don't
think we need to change e.g. the C++ implementation of the ANTLRInputStream
class, as it already handles UTF-32.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1633 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU0tdDoUAWM1MKbon4xJQvoCDsITVks5rfAVFgaJpZM4LvVMQ>
.
|
BTW, Sam suggested we verify:
|
OK! 1. should be pretty easy to add a unit test for, I'll do that. |
I can run 2 using a test rig from our paper. :) Actually, one of the tests already does a big lexing job I think. Oh, LargeLexer tests a large lexer not large input. I can still do a test for 2 once we integrate I guess. Shouldn’t have affected anything. It would good to check size difference in serialized ATN before/after on LargeLexer test.
|
OK, sent out WIP #1688 to add new lexer charset escapes, so:
matches both a-z as well as 𝐚-𝐳. I'll add the tests for the "binary" parsing feature next. |
Added a test for a "binary" grammar, works fine. |
Here's the difference in size of the serialized ATN for Before change
After change
Looks pretty good to me (2 bytes difference, which is what I'd expect from the extra serialized int for the "0 SMP sets"). |
Ok, i see nothing super scary in the change list and @bhamiltoncx has been meticulous and fastidious. He has lots of tests on the new |
Great! I'll follow up on any issues and take a look at the documentation. |
@teverett we should repair warnings and errors in existing grammars. See issue update. |
I received a bug report about CLASSIFY_Lu and CLASSIFY_Ll ranges being too
long by 1.
I modified the code to fix this and generated new grammars.
These were committed at:
https://github.com/jlettvin/UniTree
Please make suitable updates (copies of mine) to:
classify16.g4 and classify21.g4 in your repositories.
…On Fri, Feb 24, 2017 at 5:46 PM, Ivan Kochurkin ***@***.***> wrote:
@teverett <https://github.com/teverett> we have to repair warnings and
errors in existing grammars. See issue update.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1633 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABK-S8UiYi21aTwT-faMfjvDAn3B0qYwks5rf13OgaJpZM4LvVMQ>
.
--
This e-mail is from Jonathan D. Lettvin, and may contain information that
is confidential or privileged. If you are not the intended recipient, do
not read, copy or distribute the e-mail or any attachments. Instead, please
notify the sender and delete the e-mail and any attachments. Thank you.
Jonathan D. Lettvin
|
so one of the built-in sets is off by one? |
@parrt: No, I think the grammars classify16.g4 and classify21.g4 predate my
Unicode work and have their own lists of code points (all < U+FFFF, since
they were built before ANTLR had full Unicode support).
…On Fri, Jun 23, 2017 at 1:08 PM Terence Parr ***@***.***> wrote:
so one of the built-in sets is off by one?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1633 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AApAU5_jXYGe4QG10Cl6tdMPHZ5QgYcbks5sHA0agaJpZM4LvVMQ>
.
|
I am the author of the two grammars.
I was notified of the bug by a user.
I made the correction in the grammar generator
then I generated the two grammars again.
I have committed the changes to my github repo.
More than one built-in set was off.
at least 4 were, but the two character correction
in the grammar generator fixed all four.
I suspect there were more.
On Fri, Jun 23, 2017 at 3:57 PM, Ben Hamilton (Ben Gertzfield) <
notifications@github.com> wrote:
… @parrt: No, I think the grammars classify16.g4 and classify21.g4 predate my
Unicode work and have their own lists of code points (all < U+FFFF, since
they were built before ANTLR had full Unicode support).
On Fri, Jun 23, 2017 at 1:08 PM Terence Parr ***@***.***>
wrote:
> so one of the built-in sets is off by one?
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> <#1633 (comment)>, or
mute
> the thread
> <https://github.com/notifications/unsubscribe-auth/AApAU5_
jXYGe4QG10Cl6tdMPHZ5QgYcbks5sHA0agaJpZM4LvVMQ>
> .
>
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1633 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABK-S4eXJ0t-HEaD7R54uB7ud2cVlJyQks5sHBiZgaJpZM4LvVMQ>
.
--
This e-mail is from Jonathan D. Lettvin, and may contain information that
is confidential or privileged. If you are not the intended recipient, do
not read, copy or distribute the e-mail or any attachments. Instead, please
notify the sender and delete the e-mail and any attachments. Thank you.
Jonathan D. Lettvin
|
ok, i'll let @teverett grab the latest. |
pulled
…On Fri, Jun 23, 2017 at 3:16 PM, Terence Parr ***@***.***> wrote:
ok, i'll let @teverett <https://github.com/teverett> grab the latest.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1633 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABK1ZtvvXChvLoarNCCKy6tlWaA_ihmjks5sHCsdgaJpZM4LvVMQ>
.
--
A better world shall emerge based on faith and understanding - Douglas
MacArthur
|
Fixes #276 .
This used to be a WIP PR, but it's now ready for review.
This PR introduces a new extended Unicode escape
\u{10ABCD}
in ANTLR4 grammars to support Unicode literal values >U+FFFF
.The serialized ATN represents any atom or range with a Unicode value > U+FFFF as a set. Any such set is serialized in the ATN with 32-bit arguments.
I bumped the UUID, since this changes the serialized ATN format.
I included lots of tests and made sure everything is passing on Linux, Mac, and Windows.