Improve by=<non-ASCII> test to be more robust #6573

MichaelChirico · 2024-10-11T16:02:49Z

          Hmm, thanks. It still leaves us with the `else` branch here on FreeBSD where this is unanticipated / the behavior of `DT[,.N, by=año]` goes untested there. But let's start with un-breaking the test.

Originally posted by @MichaelChirico in #6567 (comment)

As reported in #6559, the test will not run on FreeBSD. We should come up with an example that will test non-ASCII by= on such systems.

The text was updated successfully, but these errors were encountered:

MichaelChirico · 2024-10-11T16:04:58Z

@nunotexbsd, can you help to come up with an example of running by= with a non-ASCII character? We'd like to ensure the behavior in #4708/fix in #4711 works on your system as well.

nunotexbsd · 2024-10-11T16:36:34Z

Following example on #4708:

> library(data.table)
> dt <- data.table(a=1:3,año=c(2020,2021,2023))
> dt[,sum(a),año]
     año    V1
   <num> <int>
1:  2020     1
2:  2021     2
3:  2023     3

> dt[,,by=año]
       a   año
   <int> <num>
1:     1  2020
2:     2  2021
3:     3  2023
Warning messages:
1: In `[.data.table`(dt, , , by = año) :
  Ignoring by/keyby because 'j' is not supplied
2: In `[.data.table`(dt, , , by = año) :
  i and j are both missing so ignoring the other arguments. This warning will be upgraded to error in future.

Did try variants like month==mês and it works fine too.

I didn't apply this patch yet to check testunit.

EDIT: execute dt[,,by=año]

aitap · 2024-10-11T16:40:10Z

R requires symbols to exist in the native encoding, so trying to construct a call to dt[,,by=año] in an R session incapable of representing ñ in the native encoding will result in something else. Depending on the exact approach, we'll always get either a?o or a<U+00F1>o in either the name of the variable or the source code of the overall expression.

The real underlying problem of #4708 is that for x of type STRSXP the %chin% operator performs additional conversion to UTF-8, but for x of type SYMSXP it only looks at the PRINTNAME(x) pointer. The original reporter used a pre-UTF-8 build of R (LC_CTYPE=Spanish_Ecuador.1252), so enc2native('año') (from the symbol x) and enc2utf8('año') (from the converted table) gave two different entries in the CHARSXP cache, so pointer comparison failed. Forcing as.character(x) switched Cchin to the "force UTF-8 for strings" logic, making the match succeed.

MichaelChirico · 2024-10-11T17:44:38Z

Hmm, so @nunotexbsd what happens for this example then?

dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6)
dt[,sum(a), by=año]

aitap · 2024-10-11T17:56:27Z

I think that @nunotexbsd is running interactive examples in the "C.UTF-8" locale (where everything works) but unit tests in the "C" locale (where enc2native('año') returns 'a?o').

nunotexbsd · 2024-10-12T08:21:19Z

Hmm, so @nunotexbsd what happens for this example then?
dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6)
dt[,sum(a), by=año]

> dt <- data.table(a=1:3,año=c(2020,2021,2023), `a?o` = 4:6)
> dt[,sum(a), by=año]
     año    V1
   <num> <int>
1:  2020     1
2:  2021     2
3:  2023     3

MichaelChirico mentioned this issue Oct 11, 2024

tests: Detect session encoding incapable of U+00F1 #6567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve by=<non-ASCII> test to be more robust #6573

Improve by=<non-ASCII> test to be more robust #6573

MichaelChirico commented Oct 11, 2024

MichaelChirico commented Oct 11, 2024

nunotexbsd commented Oct 11, 2024 •

edited

Loading

aitap commented Oct 11, 2024 •

edited

Loading

MichaelChirico commented Oct 11, 2024

aitap commented Oct 11, 2024 via email

nunotexbsd commented Oct 12, 2024

Improve by=<non-ASCII> test to be more robust #6573

Improve by=<non-ASCII> test to be more robust #6573

Comments

MichaelChirico commented Oct 11, 2024

MichaelChirico commented Oct 11, 2024

nunotexbsd commented Oct 11, 2024 • edited Loading

aitap commented Oct 11, 2024 • edited Loading

MichaelChirico commented Oct 11, 2024

aitap commented Oct 11, 2024 via email

nunotexbsd commented Oct 12, 2024

nunotexbsd commented Oct 11, 2024 •

edited

Loading

aitap commented Oct 11, 2024 •

edited

Loading