Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Random error using variables with unicode characters #5712

Closed
nlebedenco opened this issue Feb 7, 2014 · 30 comments
Closed

Random error using variables with unicode characters #5712

nlebedenco opened this issue Feb 7, 2014 · 30 comments
Labels
bug Indicates an unexpected problem or unintended behavior unicode Related to unicode characters and encodings

Comments

@nlebedenco
Copy link

At first I thought it could have been an issue related to how I copied and pasted the pi character because after pasting it again it simply worked but after playing with multiplications I get seemingly random errors like:

ERROR: syntax: invalid character "�"

Notice how 2 * π evaluates as expected but 2π raises an exception...

notroot@dev-mint ~ $ julia
               _
   _       _ _(_)_     |  A fresh approach to technical computing
  (_)     | (_) (_)    |  Documentation: http://docs.julialang.org
   _ _   _| |_  __ _   |  Type "help()" to list help topics
  | | | | | | |/ _` |  |
  | | |_| | | | (_| |  |  Version 0.3.0-prerelease
 _/ |\__'_|_|_|\__'_|  |  Official http://julialang.org release
|__/                   |  i686-linux-gnu

julia> ☃ = 1
1

julia> ☃
1

julia> ☃ 
1

julia> ☃ 
1

julia> ☃
1

julia> 3☃
3

julia> 5☃
5

julia> 5☃8
ERROR: ☃8 not defined

julia> 5☃
5

julia> 5☃*2
10

julia> s = "This is a string."
"This is a string."

julia> s = "âThis is a string."
"âThis is a string."

julia> s[1]
'â'

julia> s[2]
ERROR: invalid UTF-8 character index
 in getindex at utf8.jl:63

julia> s[3]
'T'

julia> @printf "%d is less than %f" 4.5 5.3 # casa
5 is less than 5.300000
julia> bla! = 2
2

julia> Bla! = 2
2

julia> Bla! = 6
6

julia> bla!
2

julia> Bla!
6

julia> 2 * π
6.283185307179586

julia> 2π
ERROR: syntax: invalid character "�"

julia> ☃
ERROR: syntax: invalid character "�"

julia> π
π = 3.1415926535897...

julia> ☃
ERROR: syntax: invalid character "�"

julia> 5☃
5

julia> ☃
1

julia> 5☃*2
10

julia> 5☃
ERROR: syntax: invalid character "�"

julia> 5☃*2
10

julia> 5☃
5

julia> 5☃*2
10

julia> 5☃
5

julia> 5☃*2
10

julia> 5☃
5

julia> ☃
1

julia> π
π = 3.1415926535897...

julia> 

Any clues?

EDIT: adding versioninfo

julia> versioninfo()
Julia Version 0.3.0-prerelease
Platform Info:
  System: Linux (i686-linux-gnu)
  CPU: Intel(R) Core(TM) i5 CPU       M 450  @ 2.40GHz
  WORD_SIZE: 32
  BLAS: libblas.so.3
  LAPACK: liblapack.so.3
  LIBM: libopenlibm
@JeffBezanson
Copy link
Member

Is your terminal set to a utf8 locale?
On Feb 6, 2014 9:36 PM, "Nícolas Lebedenco" notifications@github.com
wrote:

At first I thought it could have been an issue related to how I copied and
pasted the pi character because after pasting it again it simply worked but
after playing with multiplications I get seemingly random errors like:

ERROR: syntax: invalid character "�"

Notice how 2 * π evaluates as expected but 2π raises an exception...

notroot@dev-mint ~ $ julia
_
_ _ ()_ | A fresh approach to technical computing
() | () () | Documentation: http://docs.julialang.org
_ _ | | __ _ | Type "help()" to list help topics
| | | | | | |/ ` | |
| | |
| | | | (
| | | Version 0.3.0-prerelease
/ |_'|||__'| | Official http://julialang.org release
|__/ | i686-linux-gnu

julia> ☃ = 1
1

julia> ☃
1

julia> ☃
1

julia> ☃
1

julia> ☃
1

julia> 3☃
3

julia> 5☃
5

julia> 5☃8
ERROR: ☃8 not defined

julia> 5☃
5

julia> 5☃*2
10

julia> s = "This is a string."
"This is a string."

julia> s = "âThis is a string."
"âThis is a string."

julia> s[1]
'â'

julia> s[2]
ERROR: invalid UTF-8 character index
in getindex at utf8.jl:63

julia> s[3]
'T'

julia> @printf "%d is less than %f" 4.5 5.3 # casa
5 is less than 5.300000
julia> bla! = 2
2

julia> Bla! = 2
2

julia> Bla! = 6
6

julia> bla!
2

julia> Bla!
6

julia> 2 * π
6.283185307179586

julia> 2π
ERROR: syntax: invalid character "�"

julia> ☃
ERROR: syntax: invalid character "�"

julia> π
π = 3.1415926535897...

julia> ☃
ERROR: syntax: invalid character "�"

julia> 5☃
5

julia> ☃
1

julia> 5☃*2
10

julia> 5☃
ERROR: syntax: invalid character "�"

julia> 5☃*2
10

julia> 5☃
5

julia> 5☃*2
10

julia> 5☃
5

julia> 5☃*2
10

julia> 5☃
5

julia> ☃
1

julia> π
π = 3.1415926535897...

julia>

Any clues?


Reply to this email directly or view it on GitHubhttps://github.com//issues/5712
.

@jiahao
Copy link
Member

jiahao commented Feb 7, 2014

I've also just noticed this error happening sporadically in my IJulia notebook instance running locally.

In[173]: versioninfo()
Julia Version 0.3.0-prerelease+1388
Commit 9fa2d17* (2014-02-04 20:15 UTC)
Platform Info:
  System: Darwin (x86_64-apple-darwin13.0.2)
  CPU: Intel(R) Core(TM) i5-4258U CPU @ 2.40GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY)
  LAPACK: libopenblas
  LIBM: libopenlibm

@nlebedenco
Copy link
Author

sure, otherwise I wouldn't be able to accurately copy the character anyway...

notroot@dev-mint ~ $ echo $LANG
pt_BR.UTF-8

One thing I forgot to mention was that I generally invoked and edited the last executed lines from history (using the keyboard arrows) instead of always retyping everything. So I would also consider anything related to how history is implemented in repl...

@stevengj
Copy link
Member

stevengj commented Feb 7, 2014

This just happened for me too, but then I restarted the notebook and couldn't reproduce it. Does anyone have a reproducible test case?

@stevengj
Copy link
Member

stevengj commented Feb 7, 2014

I wonder if it is a bug introduced somehow by the unicode normalization in #5462? If you have a reproducible problem, maybe try adding a line

#define normalize(s) s

before static symbol_t *mk_symbol(const char *str) in src/flisp/flisp.c and see if the problem goes away?

@nlebedenco
Copy link
Author

I couldn't reproduce the problem in a controlled way yet. It happens eventually if I insist enough with a variable. I even thought it could have been a repl bug related to backspace operating on half of a utf8 character but couldn't confirm that. I'll be on vacation for the following weeks. On my return I maybe able to give it a try with #define normalize(s)

@stevengj
Copy link
Member

I'm pretty sure it isn't a REPL bug, because both jiahao and I have seen it in IJulia.

@Keno
Copy link
Member

Keno commented Feb 18, 2014

This is really, really annoying. Do we have any idea what's going on?

@jiahao
Copy link
Member

jiahao commented Feb 18, 2014

fwiw, I suspect that some sort of memory corruption is resulting in characters not being parsed correctly and thus being normalized to the generic Unicode replacement character � = '\ufffd'

@StefanKarpinski
Copy link
Member

The question is if it's a utf8proc error, error in how utf8proc is being used, or an unrelated memory corruption.

@jiahao
Copy link
Member

jiahao commented Feb 18, 2014

I have been unable to reproduce with one unicode character, and intermittently the problem shows up with a second character.

@Keno
Copy link
Member

Keno commented Feb 26, 2014

The most frequent error I'm seeing is "malformed expression". I just came across some code that works when loaded from a file I edited in sublime, but fails when executed from IJulia in chrome.
I diffed the raw bytes and there a difference in how chi is encoded. When posted via the browser:

julia> a[254:260]
0xed
 0xa0
 0xb5
 0xed
 0xbc
 0x92
 0x20

When loaded from a file:

julia> b[254:258]
 0xf0
 0x9d
 0x9c
 0x92
 0x20

Note that I literally copy-pasted this from chrome into sublime and it started working. The code is in this gist: https://gist.github.com/loladiro/9221793. (Github wouldn't allow me to post it). I don't have much time right now to debug but maybe this is helpful.

@Keno
Copy link
Member

Keno commented Apr 20, 2014

IJulia notebook bug is fixed in ipython. See ipython/ipython#5618. I also haven't seen the original REPL bug anymore and I do use unicode a lot (but feel free to reopen if it does happen).

@Keno Keno closed this as completed Apr 20, 2014
@jiahao
Copy link
Member

jiahao commented Apr 20, 2014

Actually I just encountered this bug again yesterday when introducing the empty set. I haven't been able to reproduce it with a debugger attached though...

@joehuchette
Copy link
Member

I have been getting this with some frequency lately. No minimal working example as it seems nondeterministic, but it's only appears at the REPL (not when running a script with a julia foo.jl invocation). E.g.

julia> (1-ɛ)/ɛ
ERROR: syntax: invalid character "�"

julia> (1-ɛ)/ɛ
ERROR: syntax: invalid character "�"

julia> (1-ɛ)/ɛ
18.999999999999996

julia> (1-ɛ)/ɛ
18.999999999999996

julia> (1-ɛ)/ɛ
18.999999999999996

Maybe should be reopened?

@lstagner
Copy link
Contributor

I noticed that if the Unicode character is sandwiched between ASCII then the error won't occur

julia> e₁e = 2
2

julia> e₁ = 2
2

julia> e₁
ERROR: syntax: invalid character "�"

julia> e₁e
2

julia> e₁e
2

julia> e₁e
2
.
.
.

@elextr
Copy link

elextr commented May 23, 2014

This last looks like an error I've made in the past, assuming the byte index of the last character == the byte index of the last byte for UTF-8.

@lstagner
Copy link
Contributor

This could be nothing but I noticed that so far the error has only occurred on my 32-bit desktop but not my 64-bit laptop.

Edit: n/m I got it to happen

@Keno
Copy link
Member

Keno commented May 24, 2014

I also see this on my (64bit) mac.

@Keno
Copy link
Member

Keno commented Jun 20, 2014

Findings so far: The replacement character is introduced by u8_toutf8 directly when called from flisp. It's being passed junk value (they seem to currently always look like 0xff65bxxx[x], i.e. the ff65b is always there, but it differs in position and the random junk that follows), which I can't make sense of.

@Keno
Copy link
Member

Keno commented Jun 20, 2014

Curiously, it also seems to sometimes evaluate correctly even when hitting the replacement char case (I did verify that the character gets introduced there, by replacing the replacement character with a different one, which did indeed show up in the error message.

@Keno
Copy link
Member

Keno commented Jun 20, 2014

Valgrind with MEMDEBUG2 is very vocal: https://gist.github.com/Keno/6c52aad3b1b3a17f407e

@Keno
Copy link
Member

Keno commented Jun 20, 2014

@JeffBezanson could the problem be that we are peeking into unallocated memory, which may look like a continuation byte, hence giving us the wrong character?

@JeffBezanson
Copy link
Member

That sounds possible, but it does check u8_seqlen to make sure enough bytes are available.

@Keno
Copy link
Member

Keno commented Jun 20, 2014

Why do you compare against seqlen-1?

@JeffBezanson
Copy link
Member

I probably wrote that because the code had already looked at one byte, but it doesn't consume that byte, so yes that looks wrong. Definitely try changing that.

@joehuchette
Copy link
Member

💯

@juliohm
Copy link
Contributor

juliohm commented Aug 1, 2014

The bug is still present with for instance "a subscript t".

Julia Version 0.3.0-rc1+260
Commit 727733d (2014-07-29 22:14 UTC)
Platform Info:
System: Linux (x86_64-unknown-linux-gnu)
CPU: Intel(R) Core(TM) i7-3632QM CPU @ 2.20GHz
WORD_SIZE: 64
BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Sandybridge)
LAPACK: libopenblas
LIBM: libopenlibm
LLVM: libLLVM-3.3

@Keno
Copy link
Member

Keno commented Aug 1, 2014

That's a different issue, I believe: #7582

@juliohm
Copy link
Contributor

juliohm commented Aug 1, 2014

Thank you @Keno, you're right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Indicates an unexpected problem or unintended behavior unicode Related to unicode characters and encodings
Projects
None yet
Development

No branches or pull requests

10 participants