Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

evaluate losing character encoding information of arguments #74

Closed
jeroen opened this issue Jun 22, 2017 · 5 comments
Closed

evaluate losing character encoding information of arguments #74

jeroen opened this issue Jun 22, 2017 · 5 comments

Comments

@jeroen
Copy link
Member

jeroen commented Jun 22, 2017

We have been experiencing problems due to evaluate() losing character encoding information.

The problem goes unnoticed on platforms that treat unknown as UTF-8. But it leads to serious interoperability problems when serializing data. It is important that the encoding bit is retained.

A minimal example that returns the correct answer for eval() but evaluate() returns unknown.

# Some UTF-8 text
x <- c("寿司")
cl <- call('Encoding', x)

# Expected output: UTF-8
eval(cl)
## [1] "UTF-8"

# Evaluate returns 'unknown'
evaluate::evaluate(cl)
## ...
## [[2]]
## "[1] \"unknown\"\n"

On Windows (english) the strings even gets garbled:

cl2 <- call('c', x)
eval(cl2)
## "寿司"

evaluate::evaluate(cl2)
##  "[1] \"<U+5BFF><U+53F8>\"\n"
jimhester added a commit to jimhester/evaluate that referenced this issue Jun 22, 2017
This allows one to specify the encoding the code is parsed with, and is
passed to `base::parse()`. Previously the encoding was implicitly
"unknown", so only the code was assumed to be ASCII.

Fixes r-lib#74
@yihui
Copy link
Collaborator

yihui commented Jun 22, 2017

The Encoding(x) does not necessarily return UTF-8. I can show you two examples:

On macOS with the C locale:

$ LANG="" R -e 'x <- c("寿司"); cl <- call("Encoding", x); eval(cl)'
> x <- c("寿司"); cl <- call("Encoding", x); eval(cl)
[1] "unknown"

On Windows with the Chinese locale:

> x <- c("寿司")
> cl <- call('Encoding', x)
> 
> eval(cl)
[1] "unknown"
> Sys.getlocale()
[1] "LC_COLLATE=Chinese (Simplified)_People's Republic of China.936;LC_CTYPE=Chinese (Simplified)_People's Republic of China.936;LC_MONETARY=Chinese (Simplified)_People's Republic of China.936;LC_NUMERIC=C;LC_TIME=Chinese (Simplified)_People's Republic of China.936"

And the problem with an English locale on Windows is known: #59 #66 and I don't think we can fix it in evaluate without a proper fix in base R.

I recommend you to post your original problem and see if we can fix it.

@jeroen
Copy link
Member Author

jeroen commented Jun 22, 2017

@yihui it doesn't have to be UTF-8 but I the result of evaluate(cl) should match that of eval(cl)?

@yihui
Copy link
Collaborator

yihui commented Jun 22, 2017

The root problem here is that cl is a call object. We should probably just let evaluate() fail on such objects. What it currently does is to deparse() the call to get its "source" code, and evaluate the code, instead of evaluating it right away. The reason is that evaluate() aims to simulate REPL, which means what you need to provide to it is the source code as a character vector. For call objects, the source code has been lost.

@jeroen
Copy link
Member Author

jeroen commented Jun 22, 2017

@yihui is there any reason we can not just evaluate given expression since it has already been parsed? I.e.: #76

@yihui
Copy link
Collaborator

yihui commented Jun 22, 2017

That is exactly what I was thinking. Let me think a bit longer about it (edge cases, Windows, etc).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants