Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clearer parse errors for invalid characters #28339

Closed
ckoe-bccms opened this issue Jul 29, 2018 · 8 comments
Closed

clearer parse errors for invalid characters #28339

ckoe-bccms opened this issue Jul 29, 2018 · 8 comments
Labels
error handling Handling of exceptions by Julia or the user parser Language parsing and surface syntax

Comments

@ckoe-bccms
Copy link

Hello,

I would like to file an issue about the current (0.7.0-beta2.0, Linux) error reporting with UTF characters. Copy and Pasting the following numeric value from a pdf (https://arxiv.org/pdf/cond-mat/0110585) via the X11 clipboard into julia or reading the pasted numerical value from a file gives an error:

julia> -0.6626458266981849E−01
ERROR: syntax: invalid character "−"

Reading from a file this becomes

julia> using ode_symplektisch
ERROR: LoadError: syntax: invalid character "−"
Stacktrace:
 [1] include at ./boot.jl:317 [inlined]
 [2] include_relative(::Module, ::String) at ./loading.jl:1034
 [3] macro expansion at ./logging.jl:312 [inlined]
 [4] _require(::Base.PkgId) at ./loading.jl:929
 [5] require(::Base.PkgId) at ./loading.jl:838
 [6] require(::Module, ::Symbol) at ./loading.jl:833
in expression starting at /home/ckoe/src/cp/ODE_Integratoren/ode_symplektisch.jl:198

I see two issues with this error message:

  1. It does not indicate which "minus" (hint: it is the second one in the exponent)
  2. It could contain the hint that an UTF character is involved, eventually even which actual character code

Considering the fact that julia explicitely allows a lot of UTF characters in normal source makes this error message even more unsatisfying. Note: I do not at all propose that the parser should accept this UTF character as a valid "minus" ! But it would be very helpfull if the error message could at least indicate the actual character (position on input line).

Best Regards

@JeffBezanson JeffBezanson added parser Language parsing and surface syntax error handling Handling of exceptions by Julia or the user labels Jul 29, 2018
@stevengj
Copy link
Member

stevengj commented Jul 29, 2018

See also #26193; it's not completely crazy to normalize U+2212 (minus) to hyphen, but it would require some additional parser changes discussed in #25157, so it's almost certain not to happen for Julia 1.0.

I'm not exactly certain what a better error message would look like. Maybe

ERROR: syntax: invalid character "−" in: -0.6626458266981849E−01
                                                             ^

although it is somewhat tricky to get the ^ in the right place (depends on charwidth matching the terminal's font) if there are arbitrary Unicode characters in the input.

@ckoe-bccms
Copy link
Author

ckoe-bccms commented Jul 30, 2018

Hello,

I'm not exactly certain what a better error message would look like. Maybe

in my opinion it would be sufficient to give the number of the column on the line, I assume most editors display the cursor column. So it would be easier to figure out which character it is. Something like
ERROR: syntax: invalid character "−" in: -0.6626458266981849E−01 column 21
or
in expression starting at /home/ckoe/src/cp/ODE_Integratoren/ode_symplektisch.jl:198 column 23
Generally, it might be useful to give a column with other error messages, too.

Also, indicating that it is from the viewpoint of the parser an UTF and not some other formatting issue (caused by some typo earlier in a source file, we all know those) would be helpful when reading from a file obviously.

Concerning 26193: I have no opinion on that.

EDIT: I should add, that of course the error when using the module was first and let to quite a bit of confusion. The REPL test was just to demonstrate the issue. Having the column number there is only of limited value, but still better than nothing.

Best Regards

@stevengj stevengj changed the title Unhelpful error message with UTF character clearer parse errors for invalid characters Jul 30, 2018
@stevengj
Copy link
Member

stevengj commented Jul 30, 2018

it would be sufficient to give the number of the column on the line

That also depends on Julia's (utf8proc's) charwidth matching the editors charwidths. (Some Unicode characters take up 0 columns, some take up 1 column, and some take up 2 columns, but there isn't universal agreement on which is which — see e.g. #3721.)

However, it will match for most text, and I suppose we could say "near column 21" for the rare cases where the charwidths don't agree.

@StefanKarpinski
Copy link
Member

We could just approximate column number with character number. After all, most people code in fixed-width fonts and for a lot of code this approximation will be correct. It seem better than not providing any information for fear of doing so imperfectly.

@stevengj
Copy link
Member

stevengj commented Jul 30, 2018

Summing the charwidths is a better approximation, and is easy to compute …

One point of confusion here, @StefanKarpinski: charwidths (as computed by e.g. the Julia textwidth function, which uses utf8proc_charwidth) are for fixed-width fonts. That's why textwidth(::Char) returns 0, 1, or 2, and not some width in points or something. Even in a fixed-width font, 🍕 should have width 2 (= 2 columns) and combining characters should have width 0. (Combining characters are not that uncommon in math-heavy Julia code for things like or .)

@vtjnash
Copy link
Member

vtjnash commented Jul 31, 2018

c.f. #9579, which also added support for the necessary column number tracking in the parser

@stevengj
Copy link
Member

stevengj commented Jul 31, 2018

#9579 counts code units, which is even cruder than counting characters… I wonder why the count wasn't incremented in ios_getutf8 instead of ios_getc? Oh, I see: flisp/read.c has its own complicated logic using ios_getc. Julia code is parsed using fl_iogetc, which calls ios_getutf8.

@ckoe-bccms
Copy link
Author

Thank you !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
error handling Handling of exceptions by Julia or the user parser Language parsing and surface syntax
Projects
None yet
Development

No branches or pull requests

5 participants