Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The msg from fread should mark txt as declared encoding #4747

Closed
shrektan opened this issue Oct 10, 2020 · 0 comments · Fixed by #4751
Closed

The msg from fread should mark txt as declared encoding #4747

shrektan opened this issue Oct 10, 2020 · 0 comments · Fixed by #4751
Assignees
Labels
bug encoding issues related to Encoding fread
Milestone

Comments

@shrektan
Copy link
Member

shrektan commented Oct 10, 2020

On Windows, when the text is UTF-8 encoded and the printed message from fread() contains some text, the message will be displayed as garbage letters. The cause I believe is we didn't mark the txt as the declared encoding "UTF-8".

A reproducible example on Windows

Code

txt <- "A,B\n中文1,中文2\n中文3"
txt <- enc2utf8(txt)
data.table::fread(text = txt, encoding = 'UTF-8')

Output

       A     B
1: 中文1 中文2
Warning message:
In data.table::fread(text = txt, encoding = "UTF-8") :
  Discarded single-line footer: <<涓枃3>>

In contrast to native encoded txt which looks correct

Code

txt <- "A,B\n中文1,中文2\n中文3"
data.table::fread(text = txt)

Output

       A     B
1: 中文1 中文2
Warning message:
In data.table::fread(text = txt) : Discarded single-line footer: <<中文3>>

session Info

> sessionInfo()
R version 4.0.2 (2020-06-22)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 17763)

Matrix products: default

locale:
[1] LC_COLLATE=Chinese (Simplified)_China.936  LC_CTYPE=Chinese (Simplified)_China.936   
[3] LC_MONETARY=Chinese (Simplified)_China.936 LC_NUMERIC=C                              
[5] LC_TIME=Chinese (Simplified)_China.936    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] magrittr_1.5      data.table_1.13.0

Another example on Mac

x <- "fa\xE7ile"
Encoding(x) <- "latin1"
txt <- sprintf("A,B\n%s,%s\n%s", x, x, x)
Encoding(txt) <- "UTF-8"

data.table::fread(text = txt, encoding = 'UTF-8')

txt2 <- iconv(txt, "UTF-8", "latin1")
data.table::fread(text = txt2, encoding = 'Latin-1')
@shrektan shrektan self-assigned this Oct 10, 2020
@jangorecki jangorecki added the encoding issues related to Encoding label Oct 10, 2020
@mattdowle mattdowle added this to the 1.13.7 milestone Jan 4, 2021
@mattdowle mattdowle modified the milestones: 1.13.7, 1.14.0 Feb 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug encoding issues related to Encoding fread
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants