-
Notifications
You must be signed in to change notification settings - Fork 133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clean_names could transliterate accented characters #120
Comments
I love this idea. I will look into it more; could you share a quick reproducible example ("reprex") I could test out? |
This might be of interest. |
Well, the special characters got corrupted when I used I was not aware of the platform problem pointed by @Tazinho, but I believe this solution is a good start until we find a better one. There it is: library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
clean_names <- function(dat) {
old_names <- names(dat)
new_names <- old_names %>%
gsub("'", "", .) %>%
gsub("\"","", .) %>%
gsub("%", "percent", .) %>%
gsub("^[ ]+", "", .) %>%
make.names(.) %>%
gsub("[.]+", "_", .) %>%
gsub("[_]+", "_", .) %>%
tolower(.) %>%
gsub("_$", "", .) %>%
## here is the new line to transliterate the characters ##
stringi::stri_trans_general("latin-ascii")
dupe_count <- sapply(1:length(new_names), function(i) {
sum(new_names[i] == new_names[1:i])
})
new_names[dupe_count > 1] <- paste(new_names[dupe_count > 1], dupe_count[dupe_count > 1], sep = "_")
stats::setNames(dat, new_names)
}
tmp_df <- data_frame(a = 1, b = 2, c = 3, d = 4, e = 5)
names(tmp_df) <- c("á", "ê", "ï", "õ", "ù")
tmp_df
#> # A tibble: 1 x 5
#> á ê ï õ ù
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 4 5
df_clean <- clean_names(tmp_df)
df_clean
#> # A tibble: 1 x 5
#> a e i o u
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 1 2 3 4 5 Session info
devtools::session_info()
|
Thanks @Tazinho for looking into this w/ stringi. As a Windows user I can (unfortunately) attest to the cross-platform differences. The example with But, it looks like gagolews/stringi#270 will enable @fernandovmacedo's changes to |
Looks like that stringi fix for Windows worked, I was just making user errors on my end. That fix isn't on CRAN yet. Until stringi 1.1.6 is on CRAN, if we add the line @fernandovmacedo suggests to Then after stringi goes on CRAN, I suggest janitor be dependent on the latest version of stringi. @fernandovmacedo - would you like to add that line, and some tests (like moving your above example into a formal test with |
Sure, I will work on the Pull Request this weekend. |
added this into snakecase, see #96 (couldn't reference from there for some reason...) |
Data frames with accented characters (things like áôü) have to be wrapped with quotation in dplyr. If clean_names transliterate them to ASCII the problem would be solved.
The solution is quite simple, you would only need to add this line to the function before the duplicate handling part and also include stringi in the dependencies of the package.
stringi::stri_trans_general(., "Latin-ASCII"))
The text was updated successfully, but these errors were encountered: