clean_names could transliterate accented characters #120

fernandovmacedo · 2017-05-21T17:42:42Z

Data frames with accented characters (things like áôü) have to be wrapped with quotation in dplyr. If clean_names transliterate them to ASCII the problem would be solved.

The solution is quite simple, you would only need to add this line to the function before the duplicate handling part and also include stringi in the dependencies of the package.

stringi::stri_trans_general(., "Latin-ASCII"))

The text was updated successfully, but these errors were encountered:

sfirke · 2017-05-22T19:01:02Z

I love this idea. I will look into it more; could you share a quick reproducible example ("reprex") I could test out?

Tazinho · 2017-05-22T22:23:32Z

This might be of interest.
gagolews/stringi#269

fernandovmacedo · 2017-05-23T01:51:19Z

Well, the special characters got corrupted when I used reprex(), but I fixed it manually.

I was not aware of the platform problem pointed by @Tazinho, but I believe this solution is a good start until we find a better one. There it is:

library(dplyr) 
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

clean_names <- function(dat) {
    old_names <- names(dat)
    new_names <- old_names %>% 
        gsub("'", "", .) %>% 
        gsub("\"","", .) %>% 
        gsub("%", "percent", .) %>% 
        gsub("^[ ]+", "", .) %>% 
        make.names(.) %>% 
        gsub("[.]+", "_", .) %>% 
        gsub("[_]+", "_", .) %>% 
        tolower(.) %>% 
        gsub("_$", "", .) %>%
        ## here is the new line to transliterate the characters ##
        stringi::stri_trans_general("latin-ascii") 
    
    dupe_count <- sapply(1:length(new_names), function(i) {
        sum(new_names[i] == new_names[1:i])
    })
    new_names[dupe_count > 1] <- paste(new_names[dupe_count > 1], dupe_count[dupe_count > 1], sep = "_")
    stats::setNames(dat, new_names)
}


tmp_df <- data_frame(a = 1, b = 2, c = 3, d = 4, e = 5)
names(tmp_df) <- c("á", "ê", "ï", "õ", "ù")

tmp_df
#> # A tibble: 1 x 5
#>       á     ê     ï     õ     ù
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2     3     4     5

df_clean <- clean_names(tmp_df)

df_clean
#> # A tibble: 1 x 5
#>       a     e     i     o     u
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     1     2     3     4     5

Session info

``` r

devtools::session_info()
#> Session info -------------------------------------------------------------
#> setting value
#> version R version 3.4.0 (2017-04-21)
#> system x86_64, mingw32
#> ui RTerm
#> language (EN)
#> collate Portuguese_Brazil.1252
#> tz America/Sao_Paulo
#> date 2017-05-22
#> Packages -----------------------------------------------------------------
#> package * version date source
#> assertthat 0.2.0 2017-04-11 CRAN (R 3.4.0)
#> backports 1.0.5 2017-01-18 CRAN (R 3.4.0)
#> base * 3.4.0 2017-04-21 local
#> compiler 3.4.0 2017-04-21 local
#> datasets * 3.4.0 2017-04-21 local
#> DBI 0.6-1 2017-04-01 CRAN (R 3.4.0)
#> devtools 1.13.1 2017-05-13 CRAN (R 3.4.0)
#> digest 0.6.12 2017-01-27 CRAN (R 3.4.0)
#> dplyr * 0.5.0 2016-06-24 CRAN (R 3.4.0)
#> evaluate 0.10 2016-10-11 CRAN (R 3.4.0)
#> graphics * 3.4.0 2017-04-21 local
#> grDevices * 3.4.0 2017-04-21 local
#> htmltools 0.3.6 2017-04-28 CRAN (R 3.4.0)
#> knitr 1.16 2017-05-18 CRAN (R 3.4.0)
#> magrittr 1.5 2014-11-22 CRAN (R 3.4.0)
#> memoise 1.1.0 2017-04-21 CRAN (R 3.4.0)
#> methods * 3.4.0 2017-04-21 local
#> R6 2.2.1 2017-05-10 CRAN (R 3.4.0)
#> Rcpp 0.12.10 2017-03-19 CRAN (R 3.4.0)
#> rlang 0.1.1 2017-05-18 CRAN (R 3.4.0)
#> rmarkdown 1.5 2017-04-26 CRAN (R 3.4.0)
#> rprojroot 1.2 2017-01-16 CRAN (R 3.4.0)
#> stats * 3.4.0 2017-04-21 local
#> stringi 1.1.1 2016-05-27 CRAN (R 3.3.0)
#> stringr 1.2.0 2017-02-18 CRAN (R 3.4.0)
#> tibble 1.3.1 2017-05-17 CRAN (R 3.4.0)
#> tools 3.4.0 2017-04-21 local
#> utils * 3.4.0 2017-04-21 local
#> withr 1.0.2 2016-06-20 CRAN (R 3.4.0)
#> yaml 2.1.14 2016-11-12 CRAN (R 3.4.0)


</details>

sfirke · 2017-05-23T12:42:09Z

Thanks @Tazinho for looking into this w/ stringi. As a Windows user I can (unfortunately) attest to the cross-platform differences. The example with á works for me on Windows, but stringi::stri_trans_general("œ", "Latin-ASCII") gets me "\u009c".

But, it looks like gagolews/stringi#270 will enable @fernandovmacedo's changes to clean_names() to work consistently across platforms, at which point I agree this would be a nice upgrade to clean_names. I subscribed to that issue and look forward to incorporating that functionality.

sfirke · 2017-06-23T13:56:36Z

Looks like that stringi fix for Windows worked, I was just making user errors on my end.

That fix isn't on CRAN yet. Until stringi 1.1.6 is on CRAN, if we add the line @fernandovmacedo suggests to clean_names(), for Windows users it will switch œ to \u009c. I don't see that as a huge problem, both are terrible and require manual intervention 😆

Then after stringi goes on CRAN, I suggest janitor be dependent on the latest version of stringi.

@fernandovmacedo - would you like to add that line, and some tests (like moving your above example into a formal test with testthat), as a pull request? Happy to advise a bit if you are interested. Or I can implement it, up to you.

fernandovmacedo · 2017-06-23T14:22:43Z

Sure, I will work on the Pull Request this weekend.

Tazinho · 2017-07-11T10:52:14Z

added this into snakecase, see #96 (couldn't reference from there for some reason...)

sfirke mentioned this issue May 23, 2017

offer other case options as an argument in clean_names #96

Closed

fernandovmacedo mentioned this issue Jul 12, 2017

clean_names() transliterates accented letters #126

Merged

sfirke closed this as completed in #126 Jul 13, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clean_names could transliterate accented characters #120

clean_names could transliterate accented characters #120

fernandovmacedo commented May 21, 2017

sfirke commented May 22, 2017

Tazinho commented May 22, 2017

fernandovmacedo commented May 23, 2017 •

edited

Loading

sfirke commented May 23, 2017

sfirke commented Jun 23, 2017

fernandovmacedo commented Jun 23, 2017

Tazinho commented Jul 11, 2017

clean_names could transliterate accented characters #120

clean_names could transliterate accented characters #120

Comments

fernandovmacedo commented May 21, 2017

sfirke commented May 22, 2017

Tazinho commented May 22, 2017

fernandovmacedo commented May 23, 2017 • edited Loading

sfirke commented May 23, 2017

sfirke commented Jun 23, 2017

fernandovmacedo commented Jun 23, 2017

Tazinho commented Jul 11, 2017

fernandovmacedo commented May 23, 2017 •

edited

Loading