Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issues on windows #73

Closed
t-kalinowski opened this issue Nov 9, 2017 · 7 comments · Fixed by #185
Closed

Encoding issues on windows #73

t-kalinowski opened this issue Nov 9, 2017 · 7 comments · Fixed by #185

Comments

@t-kalinowski
Copy link
Contributor

Posting as an image since GH seems to mangle it differently.
image

Not sure the best way to fix this. Seems like the easiest fix is for upstream udunits2 to return strings with the correct Encoding.

@edzer
Copy link
Member

edzer commented Jun 8, 2018

Did you have a look how this is now, with the udunits branch (soon master)?

@t-kalinowski
Copy link
Contributor Author

So, as I mentioned in another post, I don't have easy access to a windows box these days. However, this issue also manifested on mac as well.

It seems that parts of it got fixed during the transition, but not completely. What remains to be done is to pull through udunits's encoding information (i.e., what was last passed to units:::ud_set_encoding()) and assign it to the character vector with Encoding<-() before returning to the user. Also, there are some minor differences between how udunits likes it's specification string vs how R likes it that we should patch over ( "UTF-8" vs "utf8"). A simple switch statement should probably do the trick.

Here is a screenshot of how this look on the mac currently (ths is with master branch from a few minutes ago):
image

@Enchufa2
Copy link
Member

Enchufa2 commented Jun 9, 2018

I don't think so, because character vectors don't seem to have any encoding by default:

y <- "dummy"
Encoding(y)
#> [1] "unknown"

Instead, I think that units should simply set udunits's encoding appropriately according to the current session (I don't know if there's a better way to get the encoding than utils::localeToCharset()). That means, AFAIK, UTF-8 for Unix systems and latin1 for Windows.

@t-kalinowski
Copy link
Contributor Author

t-kalinowski commented Jun 9, 2018

> Encoding("μ")
[1] "UTF-8"
> Encoding("abc")
[1] "unknown"

R drops the encoding information for ascii only vectors, and only retains the utf-8 encoding marker for strings if necessary

@t-kalinowski
Copy link
Contributor Author

Also, setting "UTF-8" encoding on ascii only vectors is safe

> x <- "abc"
> Encoding(x) <- "UTF-8"
> x
[1] "abc"
> Encoding(x)
[1] "unknown"

@Enchufa2
Copy link
Member

Enchufa2 commented Jun 9, 2018

You are right. Then, I would do both things: 1) set the proper encoding according to the locale and 2) apply Encoding on every string returned from udunits.

@t-kalinowski
Copy link
Contributor Author

I agree with both your points.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants