Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issues in Hebrew. Lost in translation #234

Closed
isteves opened this issue Jan 19, 2019 · 25 comments
Closed

Encoding issues in Hebrew. Lost in translation #234

isteves opened this issue Jan 19, 2019 · 25 comments

Comments

@isteves
Copy link

isteves commented Jan 19, 2019

reprex::reprex("א")
>Rendering reprex...
>Error in gsub("(?<=\n)(?=.|\n)", continue, x, perl = TRUE) : 
 > input string 1 is invalid UTF-8

@krlmlr
This is on a Windows 10 machine with Hebrew language

Sys.getlocale()
> [1] "LC_COLLATE=Hebrew_Israel.1255;LC_CTYPE=Hebrew_Israel.1255;LC_MONETARY=Hebrew_Israel.1255;LC_NUMERIC=C;LC_TIME=Hebrew_Israel.1255"
@yutannihilation
Copy link
Member

yutannihilation commented Jan 23, 2019

@krlmlr
Copy link
Member

krlmlr commented Jan 23, 2019

Thanks, @yutannihilation. Does this also happen in your locale?

I wonder if there's a string that should have been tagged as UTF-8 but isn't.

@isteves
Copy link
Author

isteves commented Jan 23, 2019

@yutannihilation that seems to be an accurate assessment. When I run an RMarkdown file that looks like this with rmarkdown::render():

```{r}
"א"
```

...I get the following traceback():

> traceback()
18: gsub("(?<=\n)(?=.|\n)", continue, x, perl = TRUE)
17: paste0(prompt, gsub("(?<=\n)(?=.|\n)", continue, x, perl = TRUE))
16: line_prompt(x[which], prompt = prefix, continue = prefix)
15: comment_out(x, options$comment)
14: wrap.character(X[[i]], ...)
13: FUN(X[[i]], ...)
12: lapply(x, wrap, options)
11: wrap.list(res, options)
10: wrap(res, options)
9: unlist(wrap(res, options))
8: block_exec(params)
7: call_block(x)
6: process_group.block(group)
5: process_group(group)
4: withCallingHandlers(if (tangle) process_tangle(group) else process_group(group), 
       error = function(e) {
           setwd(wd)
           cat(res, sep = "\n", file = output %n% "")
           message("Quitting from lines ", paste(current_lines(i), 
               collapse = "-"), " (", knit_concord$get("infile"), 
               ") ")
       })
3: process_file(text, output)
2: knitr::knit(knit_input, knit_output, envir = envir, quiet = quiet, 
       encoding = encoding)
1: rmarkdown::render("test.Rmd")

@isteves
Copy link
Author

isteves commented Jan 23, 2019

Update: when I run Sys.setlocale('LC_ALL','C') then reprex::reprex("א") works (the output is garbled, but that's a separate/known problem)

https://stackoverflow.com/questions/41717781/warning-input-string-not-available-in-this-locale

@krlmlr
Copy link
Member

krlmlr commented Jan 23, 2019

Perhaps reprex::reprex() needs to use xfun::write_utf8() instead of writeLines() (analogously for reading)?

writeLines(src, r_file)
.

@yutannihilation
Copy link
Member

This might reproduce on my locale (Japanese_Japan.932), but I'm yet to figure out... Here's the result of reprex::reprex("א"):

"<U+05D0>"
#> [1] "<U+05D0>"

Created on 2019-01-23 by the reprex package (v0.2.1)

I'm afraid this might be a result of knitr's breaking change, which seems to require reprex to choose a different strategy for encoding.

yihui/knitr@44e92a9

@yutannihilation
Copy link
Member

Ah, sorry. It seems I'm wrong if this also occurs with render().

When I run an RMarkdown file that looks like this with rmarkdown::render():

@yutannihilation
Copy link
Member

yutannihilation commented Jan 23, 2019

@isteves If you add encoding, does it raise an error?

rmarkdown::render("/path/to/file.Rmd", encoding = "UTF-8")

@krlmlr
Copy link
Member

krlmlr commented Jan 23, 2019

Can you try a Japanese character in the Japanese locale, please?

@yutannihilation
Copy link
Member

Japanese characters don't raise errors for me.

@krlmlr
Copy link
Member

krlmlr commented Jan 23, 2019

Interesting.

@isteves: What about Hebrew characters other than א? (Not looking for a comprehensive answer here.)

@isteves
Copy link
Author

isteves commented Jan 23, 2019

@yutannihilation what does your locale info look like? (Sys.getlocale()). Do you have "LC_ALL"?
and nope, adding encoding = "UTF-8" doesn't make a difference

@yutannihilation
Copy link
Member

Here's is my locale, and a Japanese character for example.

Sys.getlocale(); ""
#> [1] "LC_COLLATE=Japanese_Japan.932;LC_CTYPE=Japanese_Japan.932;LC_MONETARY=Japanese_Japan.932;LC_NUMERIC=C;LC_TIME=Japanese_Japan.932"
#> [1] "髙"

Created on 2019-01-23 by the reprex package (v0.2.1)

and nope, adding encoding = "UTF-8" doesn't make a difference

Hmm, thanks.

@isteves
Copy link
Author

isteves commented Jan 23, 2019

Hmm something is weird about Hebrew. I tried also Japanese/Arabic/Russian for good measure.

image

Anyone else getting the same results? (I'm still in Hebrew/Israel locale)

reprex::reprex("")
reprex::reprex("الحمص")
reprex::reprex("חומוס")
reprex::reprex("хумус")

@krlmlr
Copy link
Member

krlmlr commented Jan 23, 2019

Can you please try remotes::install_github("krlmlr/reprex@f-xfun") ?

@isteves
Copy link
Author

isteves commented Jan 23, 2019

@krlmlr it did the trick! 🎊

@isteves
Copy link
Author

isteves commented Jan 23, 2019

@krlmlr I realized maybe I jumped the gun. The error from before no longer shows up, but I get this as a reprex output for Hebrew:

"׳—׳•׳�׳•׳¡"
#> [1] "׳—׳•׳\236׳•׳¡"

The other examples get <U+...>, for example:

"<U+9AD9>"
#> [1] "<U+9AD9>"

...I think this could be a different bug maybe?

@krlmlr
Copy link
Member

krlmlr commented Jan 23, 2019

It's getting better, though. Can you please try again, I updated the branch.

@isteves
Copy link
Author

isteves commented Jan 23, 2019

I'm getting the same behavior.

Btw I bumped into an old issue of mine (rmarkdown::render()) that was ultimately traced to sink: https://community.rstudio.com/t/problem-rendering-foreign-languages-in-rmd/17931/7 Not sure if these issues have started to converge or not.

Is there a way to trace all functions that any given function calls? (to quickly check if two functions are connected)

@yutannihilation
Copy link
Member

Not sure if these issues have started to converge or not.

It has converged to "won't fix", sadly...
r-lib/evaluate#59 (comment)

Is there a way to trace all functions that any given function calls? (to quickly check if two functions are connected)

I don't know a good way to do this, but here's how sink() is called by rmarkdown::render():

Note that, you cannot simply use debug() for sink() to investigate reprex(), since it knits R code in a fresh R session:

reprex/R/reprex.R

Lines 383 to 399 in 0b633ad

reprex_render <- function(input, std_out_err = NULL) {
callr::r_safe(
function(input) {
options(
keep.source = TRUE,
rlang_trace_top_env = globalenv(),
crayon.enabled = FALSE
)
rmarkdown::render(input, quiet = TRUE, envir = globalenv())
},
args = list(input = input),
spinner = interactive(),
stdout = std_out_err,
stderr = std_out_err
)
}

@isteves
Copy link
Author

isteves commented Jan 27, 2019

@yutannihilation thanks for the detailed explanation! I guess a manual trace always works...

Yeah callr is so cool, but it does make it difficult to debug. I can't believe I only discovered it at the conference!

@jennybc
Copy link
Member

jennybc commented May 16, 2019

I am hopefully doing a small reprex release at this very moment, to update a test for fs v1.3.1. That is intentionally a very low-risk release.

But assuming that goes through in a reasonable amount of time, I want to make some meatier changes soon in dev and let people accumulate some experience.

This is a long thread and the knitr/xfun context has changed a lot wrt UTF-8.

PR #237 from @krlmlr looks like the way to go.

@isteves and @yutannihilation would you consider updating your reprexes or thoughts here, after updating your entire knitr/rmarkdown stack?

@yutannihilation
Copy link
Member

Thanks for the notice. On my locale, the result of the reprexes are the same with the current master.

@isteves
Copy link
Author

isteves commented May 18, 2019

@jennybc on Hebrew locale, I'm getting the garbled output but no error:

"׳—׳•׳�׳•׳¡"
#> [1] "׳—׳•׳�׳•׳¡"

(strangely, the garbles are slightly different than the ones earlier in this thread)

Including my session info below in case I need to update any other packages:

R version 3.5.1 (2018-07-02)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17134)

Matrix products: default

locale:
[1] LC_COLLATE=Hebrew_Israel.1255  LC_CTYPE=Hebrew_Israel.1255   
[3] LC_MONETARY=Hebrew_Israel.1255 LC_NUMERIC=C                  
[5] LC_TIME=Hebrew_Israel.1255    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] reprex_0.3.0   rmarkdown_1.12 knitr_1.23    

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.0      ps_1.3.0        digest_0.6.18   R6_2.4.0        evaluate_0.13  
 [6] rlang_0.3.1     fs_1.2.6        callr_3.1.1     whisker_0.3-2   tools_3.5.1    
[11] xfun_0.7        compiler_3.5.1  processx_3.2.1  clipr_0.5.0     htmltools_0.3.6

@jennybc
Copy link
Member

jennybc commented May 18, 2019

I think between:

reprex is handling encoding as well as its dependencies allow (mostly especially the difficulties around encoding on Windows in R itself). I'm closing this. If anyone has a new challenging example, especially one that fails with dev reprex + knitr v1.23, please add it to #262.

@jennybc jennybc closed this as completed May 18, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants