clearer parse errors for invalid characters #28339

ckoe-bccms · 2018-07-29T11:28:02Z

Hello,

I would like to file an issue about the current (0.7.0-beta2.0, Linux) error reporting with UTF characters. Copy and Pasting the following numeric value from a pdf (https://arxiv.org/pdf/cond-mat/0110585) via the X11 clipboard into julia or reading the pasted numerical value from a file gives an error:

julia> -0.6626458266981849E−01
ERROR: syntax: invalid character "−"

Reading from a file this becomes

julia> using ode_symplektisch
ERROR: LoadError: syntax: invalid character "−"
Stacktrace:
 [1] include at ./boot.jl:317 [inlined]
 [2] include_relative(::Module, ::String) at ./loading.jl:1034
 [3] macro expansion at ./logging.jl:312 [inlined]
 [4] _require(::Base.PkgId) at ./loading.jl:929
 [5] require(::Base.PkgId) at ./loading.jl:838
 [6] require(::Module, ::Symbol) at ./loading.jl:833
in expression starting at /home/ckoe/src/cp/ODE_Integratoren/ode_symplektisch.jl:198

I see two issues with this error message:

It does not indicate which "minus" (hint: it is the second one in the exponent)
It could contain the hint that an UTF character is involved, eventually even which actual character code

Considering the fact that julia explicitely allows a lot of UTF characters in normal source makes this error message even more unsatisfying. Note: I do not at all propose that the parser should accept this UTF character as a valid "minus" ! But it would be very helpfull if the error message could at least indicate the actual character (position on input line).

Best Regards

The text was updated successfully, but these errors were encountered:

stevengj · 2018-07-29T20:54:48Z

See also #26193; it's not completely crazy to normalize U+2212 (minus) to hyphen, but it would require some additional parser changes discussed in #25157, so it's almost certain not to happen for Julia 1.0.

I'm not exactly certain what a better error message would look like. Maybe

ERROR: syntax: invalid character "−" in: -0.6626458266981849E−01
                                                             ^

although it is somewhat tricky to get the ^ in the right place (depends on charwidth matching the terminal's font) if there are arbitrary Unicode characters in the input.

ckoe-bccms · 2018-07-30T06:57:44Z

Hello,

I'm not exactly certain what a better error message would look like. Maybe

in my opinion it would be sufficient to give the number of the column on the line, I assume most editors display the cursor column. So it would be easier to figure out which character it is. Something like
ERROR: syntax: invalid character "−" in: -0.6626458266981849E−01 column 21
or
in expression starting at /home/ckoe/src/cp/ODE_Integratoren/ode_symplektisch.jl:198 column 23
Generally, it might be useful to give a column with other error messages, too.

Also, indicating that it is from the viewpoint of the parser an UTF and not some other formatting issue (caused by some typo earlier in a source file, we all know those) would be helpful when reading from a file obviously.

Concerning 26193: I have no opinion on that.

EDIT: I should add, that of course the error when using the module was first and let to quite a bit of confusion. The REPL test was just to demonstrate the issue. Having the column number there is only of limited value, but still better than nothing.

Best Regards

stevengj · 2018-07-30T12:43:01Z

it would be sufficient to give the number of the column on the line

That also depends on Julia's (utf8proc's) charwidth matching the editors charwidths. (Some Unicode characters take up 0 columns, some take up 1 column, and some take up 2 columns, but there isn't universal agreement on which is which — see e.g. #3721.)

However, it will match for most text, and I suppose we could say "near column 21" for the rare cases where the charwidths don't agree.

StefanKarpinski · 2018-07-30T16:03:03Z

We could just approximate column number with character number. After all, most people code in fixed-width fonts and for a lot of code this approximation will be correct. It seem better than not providing any information for fear of doing so imperfectly.

stevengj · 2018-07-30T16:04:23Z

Summing the charwidths is a better approximation, and is easy to compute …

One point of confusion here, @StefanKarpinski: charwidths (as computed by e.g. the Julia textwidth function, which uses utf8proc_charwidth) are for fixed-width fonts. That's why textwidth(::Char) returns 0, 1, or 2, and not some width in points or something. Even in a fixed-width font, 🍕 should have width 2 (= 2 columns) and combining characters should have width 0. (Combining characters are not that uncommon in math-heavy Julia code for things like ẋ or x̂.)

vtjnash · 2018-07-31T01:55:18Z

c.f. #9579, which also added support for the necessary column number tracking in the parser

stevengj · 2018-07-31T15:37:50Z

#9579 counts code units, which is even cruder than counting characters… I wonder why the count wasn't incremented in ios_getutf8 instead of ios_getc? ~~Oh, I see: flisp/read.c has its own complicated logic using ios_getc.~~ Julia code is parsed using fl_iogetc, which calls ios_getutf8.

ckoe-bccms · 2018-11-11T15:34:57Z

Thank you !

JeffBezanson added parser Language parsing and surface syntax error handling Handling of exceptions by Julia or the user labels Jul 29, 2018

stevengj changed the title ~~Unhelpful error message with UTF character~~ clearer parse errors for invalid characters Jul 30, 2018

stevengj mentioned this issue Jul 31, 2018

add colno to invalid-char parse error #28373

Merged

stevengj closed this as completed in #28373 Nov 2, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clearer parse errors for invalid characters #28339

clearer parse errors for invalid characters #28339

ckoe-bccms commented Jul 29, 2018

stevengj commented Jul 29, 2018 •

edited

Loading

ckoe-bccms commented Jul 30, 2018 •

edited

Loading

stevengj commented Jul 30, 2018 •

edited

Loading

StefanKarpinski commented Jul 30, 2018

stevengj commented Jul 30, 2018 •

edited

Loading

vtjnash commented Jul 31, 2018

stevengj commented Jul 31, 2018 •

edited

Loading

ckoe-bccms commented Nov 11, 2018

clearer parse errors for invalid characters #28339

clearer parse errors for invalid characters #28339

Comments

ckoe-bccms commented Jul 29, 2018

stevengj commented Jul 29, 2018 • edited Loading

ckoe-bccms commented Jul 30, 2018 • edited Loading

stevengj commented Jul 30, 2018 • edited Loading

StefanKarpinski commented Jul 30, 2018

stevengj commented Jul 30, 2018 • edited Loading

vtjnash commented Jul 31, 2018

stevengj commented Jul 31, 2018 • edited Loading

ckoe-bccms commented Nov 11, 2018

stevengj commented Jul 29, 2018 •

edited

Loading

ckoe-bccms commented Jul 30, 2018 •

edited

Loading

stevengj commented Jul 30, 2018 •

edited

Loading

stevengj commented Jul 30, 2018 •

edited

Loading

stevengj commented Jul 31, 2018 •

edited

Loading