Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clean_names could transliterate accented characters #120

Closed
fernandovmacedo opened this issue May 21, 2017 · 7 comments
Closed

clean_names could transliterate accented characters #120

fernandovmacedo opened this issue May 21, 2017 · 7 comments

Comments

@fernandovmacedo
Copy link
Contributor

Data frames with accented characters (things like áôü) have to be wrapped with quotation in dplyr. If clean_names transliterate them to ASCII the problem would be solved.

The solution is quite simple, you would only need to add this line to the function before the duplicate handling part and also include stringi in the dependencies of the package.

stringi::stri_trans_general(., "Latin-ASCII"))

@sfirke
Copy link
Owner

sfirke commented May 22, 2017

I love this idea. I will look into it more; could you share a quick reproducible example ("reprex") I could test out?

@Tazinho
Copy link
Contributor

Tazinho commented May 22, 2017

This might be of interest.
gagolews/stringi#269

@fernandovmacedo
Copy link
Contributor Author

fernandovmacedo commented May 23, 2017

Well, the special characters got corrupted when I used reprex(), but I fixed it manually.

I was not aware of the platform problem pointed by @Tazinho, but I believe this solution is a good start until we find a better one. There it is:

library(dplyr) 
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

clean_names <- function(dat) {
    old_names <- names(dat)
    new_names <- old_names %>% 
        gsub("'", "", .) %>% 
        gsub("\"","", .) %>% 
        gsub("%", "percent", .) %>% 
        gsub("^[ ]+", "", .) %>% 
        make.names(.) %>% 
        gsub("[.]+", "_", .) %>% 
        gsub("[_]+", "_", .) %>% 
        tolower(.) %>% 
        gsub("_$", "", .) %>%
        ## here is the new line to transliterate the characters ##
        stringi::stri_trans_general("latin-ascii") 
    
    dupe_count <- sapply(1:length(new_names), function(i) {
        sum(new_names[i] == new_names[1:i])
    })
    new_names[dupe_count > 1] <- paste(new_names[dupe_count > 1], dupe_count[dupe_count > 1], sep = "_")
    stats::setNames(dat, new_names)
}


tmp_df <- data_frame(a = 1, b = 2, c = 3, d = 4, e = 5)
names(tmp_df) <- c("á", "ê", "ï", "õ", "ù")

tmp_df
#> # A tibble: 1 x 5
#>       á     ê     ï     õ     ù
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2     3     4     5

df_clean <- clean_names(tmp_df)

df_clean
#> # A tibble: 1 x 5
#>       a     e     i     o     u
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2     3     4     5
Session info
``` r

devtools::session_info()
#> Session info -------------------------------------------------------------
#> setting value
#> version R version 3.4.0 (2017-04-21)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate Portuguese_Brazil.1252
#> tz America/Sao_Paulo
#> date 2017-05-22
#> Packages -----------------------------------------------------------------
#> package * version date source
#> assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
#> backports 1.0.5 2017-01-18 CRAN (R 3.4.0)
#> base * 3.4.0 2017-04-21 local
#> compiler 3.4.0 2017-04-21 local
#> datasets * 3.4.0 2017-04-21 local
#> DBI 0.6-1 2017-04-01 CRAN (R 3.4.0)
#> devtools 1.13.1 2017-05-13 CRAN (R 3.4.0)
#> digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
#> dplyr * 0.5.0 2016-06-24 CRAN (R 3.4.0)
#> evaluate 0.10 2016-10-11 CRAN (R 3.4.0)
#> graphics * 3.4.0 2017-04-21 local
#> grDevices * 3.4.0 2017-04-21 local
#> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
#> knitr 1.16 2017-05-18 CRAN (R 3.4.0)
#> magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
#> memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
#> methods * 3.4.0 2017-04-21 local
#> R6 2.2.1 2017-05-10 CRAN (R 3.4.0)
#> Rcpp 0.12.10 2017-03-19 CRAN (R 3.4.0)
#> rlang 0.1.1 2017-05-18 CRAN (R 3.4.0)
#> rmarkdown 1.5 2017-04-26 CRAN (R 3.4.0)
#> rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
#> stats * 3.4.0 2017-04-21 local
#> stringi 1.1.1 2016-05-27 CRAN (R 3.3.0)
#> stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
#> tibble 1.3.1 2017-05-17 CRAN (R 3.4.0)
#> tools 3.4.0 2017-04-21 local
#> utils * 3.4.0 2017-04-21 local
#> withr 1.0.2 2016-06-20 CRAN (R 3.4.0)
#> yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)


</details> 

@sfirke
Copy link
Owner

sfirke commented May 23, 2017

Thanks @Tazinho for looking into this w/ stringi. As a Windows user I can (unfortunately) attest to the cross-platform differences. The example with á works for me on Windows, but stringi::stri_trans_general("œ", "Latin-ASCII") gets me "\u009c".

But, it looks like gagolews/stringi#270 will enable @fernandovmacedo's changes to clean_names() to work consistently across platforms, at which point I agree this would be a nice upgrade to clean_names. I subscribed to that issue and look forward to incorporating that functionality.

@sfirke
Copy link
Owner

sfirke commented Jun 23, 2017

Looks like that stringi fix for Windows worked, I was just making user errors on my end.

That fix isn't on CRAN yet. Until stringi 1.1.6 is on CRAN, if we add the line @fernandovmacedo suggests to clean_names(), for Windows users it will switch œ to \u009c. I don't see that as a huge problem, both are terrible and require manual intervention 😆

Then after stringi goes on CRAN, I suggest janitor be dependent on the latest version of stringi.

@fernandovmacedo - would you like to add that line, and some tests (like moving your above example into a formal test with testthat), as a pull request? Happy to advise a bit if you are interested. Or I can implement it, up to you.

@fernandovmacedo
Copy link
Contributor Author

Sure, I will work on the Pull Request this weekend.

@Tazinho
Copy link
Contributor

Tazinho commented Jul 11, 2017

added this into snakecase, see #96 (couldn't reference from there for some reason...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants