Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Data.Char.isSymbol" incorrect. #891

Closed
zstone1 opened this issue Oct 23, 2018 · 2 comments
Closed

"Data.Char.isSymbol" incorrect. #891

zstone1 opened this issue Oct 23, 2018 · 2 comments
Labels

Comments

@zstone1
Copy link

zstone1 commented Oct 23, 2018

Eta is getting the unicode "General Category" wrong. For example, eta thinks '<' is a control character.
This matters because the lexer in Haskell-src-exts uses 'isSymbol' during lexing, and so it refuses to parse "\x -> x" because eta is miscategorizing '-' and '>'. So many quasi-quotations don't work.

Description

Expected Behavior

isSymbol '<' == True

Actual Behavior

isSymbol '<' == False

Possible Fix

Debugging a little, it looks like in Java has the following behavior:

Character.getType('<') == 25
Character.getType('-') == 20

As the code in https://github.com/typelead/eta/blob/master/libraries/base/GHC/Unicode.hs, suggests, in eta we find these have '-' -> CurrencySymbol and '<' -> Control.
Interestingly enough, both of these are exactly 7 away in the enum from their intended targets, DashPunctuation and MathSymbol.

generalCategory c = toEnum $ fromIntegral $ wgencat $ fromIntegral $ ord c seems to be suspicious. Naively, we could from Java by 7. But what does that mean for the other 7 values? What does 25 really mean coming from Java? I do not know.

Steps to Reproduce

  1. import Data.Char
  2. print $ isSymbol '<'

Context

I cannot use any quasiquoters that touch haskell code. String interpolation, for example, is now painful to use. See https://github.com/haskell-suite/haskell-src-exts/blob/master/src/Language/Haskell/Exts/InternalLexer.hs at isHSymbol for the actual usage.

Your Environment

Code is run with Java 8. The issue in the code is present in master/head. So I'm guessing all version of eta suffer from this.

@rahulmutt
Copy link
Member

The Unicode categorization implementation is untested, so thanks for testing this out!

As you noted, in GHC.Unicode, we call out to the Java methods. One way you could debug this is to make a test case that prints out the category for all the categories and run that code both in Eta and GHC and see the output difference and tweak the implementations accordingly.

Would you be interested in contributing a patch? Would be happy to guide if you get stuck anywhere.

@rahulmutt rahulmutt added the bug label Oct 25, 2018
@zstone1
Copy link
Author

zstone1 commented Oct 25, 2018

I can make an attempt this weekend. I'll update here on progress, if I make any.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants