Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support for utf-8 / unicode characters #249

Closed
MrBenGriffin opened this issue Dec 19, 2018 · 11 comments
Closed

support for utf-8 / unicode characters #249

MrBenGriffin opened this issue Dec 19, 2018 · 11 comments

Comments

@MrBenGriffin
Copy link

No description provided.

@MrBenGriffin
Copy link
Author

EAdd. Exp ::= Exp "⌽" Exp1 ;
-- syntax error at line 1, column 22 due to lexer error
The inability to lex simple unicode character declarations is a shortfall.
It may be there's some way of addressing this, but the documentation doesn't mention anything.

@MrBenGriffin MrBenGriffin reopened this Dec 19, 2018
@andreasabel andreasabel self-assigned this Jan 2, 2019
@andreasabel andreasabel added this to the 2.8.3 milestone Jan 2, 2019
andreasabel added a commit that referenced this issue Jan 2, 2019
Keywords and operators in a LBNF grammar can now contain unicode characters.

Works for Haskell, Java, Latex...
Broken for C, C++, Ocaml [NEEDS WORK]
@andreasabel
Copy link
Member

By simply allowing unicode in the generated Haskell lexer, I managed to enable this feature for LBNF and some backends, including Haskell and Java.
For some backends, the testsuite reports errors: C, C++, Ocaml.

@MrBenGriffin
Copy link
Author

Thanks for this Andreas. I will try again ;-D

@andreasabel
Copy link
Member

To be precise, I only changed in the generated .x file for Alex the line

$u = [\0-\255]          -- universal: any character

to

$u = [. \n]          -- universal: any character

However, I did not do anything to extend the definitions of capital and lower-case letters to unicode.

@sillydan1
Copy link

Changing
$u = [\0-\255] -- universal: any character
to
$u = [\x00-\xffff] -- universal: any character
Seems to handle a lot more characters.
(Mind you, I've only tested on the unicode characters , and )

@andreasabel
Copy link
Member

Did you mean ?

Changing
$u = [. \n] -- universal: any character
to
$u = [\x00-\xffff] -- universal: any character
seems to handle a lot more characters.

@sillydan1
Copy link

Yes.

@andreasabel
Copy link
Member

Strange, that seems to contradict the alex documentation at https://www.haskell.org/alex/doc/html/charsets.html :

.

    The built-in set ‘.’ matches all characters except newline (\n).

    Equivalent to the set [\x00-\x10ffff] # \n.

Or maybe I do not understand it. From my naive point of view \x10ffff is a bigger number than \xffff, thus, the currently implemented range should include your range.

Could you provide me with a minimal test case, please?

@MrBenGriffin
Copy link
Author

MrBenGriffin commented Apr 7, 2019

Andreas, you are correct.
Unicode extends beyond 0xffff As I understand it, Unicode enabled regular expressions treat ‘.’ as any character other than newline and null, bearing in mind that the Unicode character type can be represented with variable number of bytes. Another feature of multi-byte encodings (such as the pretty normative UTF-8) the null character is not found in a legal character string (but, depending on implementation, may be used to mark the end of string - however that’s beyond the scope of regular expressions, which are only interested in the strings themselves). It should be clear already that regular expressions are only useful for string-like types, and are not suitable for raw streams. All of this is better expressed elsewhere.

TL;DR

[. \n] is a good pattern for ‘any character’

@andreasabel
Copy link
Member

@sillydan1: I am closing this, but you are welcome reopen it with a MWE.

@andreasabel
Copy link
Member

andreasabel commented Jan 19, 2020

Fixed printer for C.
C++ seems to work already.
b8701c3 broke #249 for Java/ANTLR in the lexer,
c49d1fd for Java/CUP.

andreasabel added a commit that referenced this issue Jan 19, 2020
Don't use showLitChar for unicode characters!

b8701c3 broke #249 for Java/ANTLR in the lexer,
c49d1fd for Java/CUP.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants