Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve by=<non-ASCII> test to be more robust #6573

Open
MichaelChirico opened this issue Oct 11, 2024 · 6 comments
Open

Improve by=<non-ASCII> test to be more robust #6573

MichaelChirico opened this issue Oct 11, 2024 · 6 comments

Comments

@MichaelChirico
Copy link
Member

          Hmm, thanks. It still leaves us with the `else` branch here on FreeBSD where this is unanticipated / the behavior of `DT[,.N, by=año]` goes untested there. But let's start with un-breaking the test.

Originally posted by @MichaelChirico in #6567 (comment)

As reported in #6559, the test will not run on FreeBSD. We should come up with an example that will test non-ASCII by= on such systems.

@MichaelChirico
Copy link
Member Author

@nunotexbsd, can you help to come up with an example of running by= with a non-ASCII character? We'd like to ensure the behavior in #4708/fix in #4711 works on your system as well.

@nunotexbsd
Copy link

nunotexbsd commented Oct 11, 2024

Following example on #4708:

> library(data.table)
> dt <- data.table(a=1:3,año=c(2020,2021,2023))
> dt[,sum(a),año]
     año    V1
   <num> <int>
1:  2020     1
2:  2021     2
3:  2023     3
> dt[,,by=año]
       a   año
   <int> <num>
1:     1  2020
2:     2  2021
3:     3  2023
Warning messages:
1: In `[.data.table`(dt, , , by = año) :
  Ignoring by/keyby because 'j' is not supplied
2: In `[.data.table`(dt, , , by = año) :
  i and j are both missing so ignoring the other arguments. This warning will be upgraded to error in future.

Did try variants like month==mês and it works fine too.

I didn't apply this patch yet to check testunit.

EDIT: execute dt[,,by=año]

@aitap
Copy link
Contributor

aitap commented Oct 11, 2024

R requires symbols to exist in the native encoding, so trying to construct a call to dt[,,by=año] in an R session incapable of representing ñ in the native encoding will result in something else. Depending on the exact approach, we'll always get either a?o or a<U+00F1>o in either the name of the variable or the source code of the overall expression.

The real underlying problem of #4708 is that for x of type STRSXP the %chin% operator performs additional conversion to UTF-8, but for x of type SYMSXP it only looks at the PRINTNAME(x) pointer. The original reporter used a pre-UTF-8 build of R (LC_CTYPE=Spanish_Ecuador.1252), so enc2native('año') (from the symbol x) and enc2utf8('año') (from the converted table) gave two different entries in the CHARSXP cache, so pointer comparison failed. Forcing as.character(x) switched Cchin to the "force UTF-8 for strings" logic, making the match succeed.

@MichaelChirico
Copy link
Member Author

Hmm, so @nunotexbsd what happens for this example then?

dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6)
dt[,sum(a), by=año]

@aitap
Copy link
Contributor

aitap commented Oct 11, 2024 via email

@nunotexbsd
Copy link

Hmm, so @nunotexbsd what happens for this example then?

dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6)
dt[,sum(a), by=año]
> dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6)
> dt[,sum(a), by=año]
     año    V1
   <num> <int>
1:  2020     1
2:  2021     2
3:  2023     3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants