-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Windows Parsing Issue #6150
Conversation
great, thanks, can you please ask for a review from @Anirban166 ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, would be great if we could confirm it works 😅
Looking forward to trying it, fingers crossed 😄 |
I wonder how you would run the tests to check for this - Given how long it takes to run everything using For instance, I just modified the > test.data.table("inst/tests/tests.Rraw")
getDTthreads(verbose=TRUE):
OpenMP version (_OPENMP) 201811
omp_get_num_procs() 10
R_DATATABLE_NUM_PROCS_PERCENT unset (default 50)
R_DATATABLE_NUM_THREADS unset
R_DATATABLE_THROTTLE unset (default 1024)
omp_get_thread_limit() 2147483647
omp_get_max_threads() 10
OMP_THREAD_LIMIT unset
OMP_NUM_THREADS unset
RestoreAfterFork true
data.table is using 5 threads with throttle==1024. See ?setDTthreads.
test.data.table() running: /Users/anirban166/data.table/inst/tests/tests.Rraw
Running test id 2253.19
Mon May 27 21:28:15 2024 endian==little, sizeof(long double)==8, longdouble.digits==, sizeof(pointer)==8, TZ==unset, Sys.timezone()=='America/Phoenix', Sys.getlocale()=='en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8', l10n_info()=='MBCS=TRUE; UTF-8=TRUE; Latin-1=FALSE; codeset=UTF-8', getDTthreads()=='OpenMP version (_OPENMP)==201811; omp_get_num_procs()==10; R_DATATABLE_NUM_PROCS_PERCENT==unset (default 50); R_DATATABLE_NUM_THREADS==unset; R_DATATABLE_THROTTLE==unset (default 1024); omp_get_thread_limit()==2147483647; omp_get_max_threads()==10; OMP_THREAD_LIMIT==unset; OMP_NUM_THREADS==unset; RestoreAfterFork==true; data.table is using 5 threads with throttle==1024. See ?setDTthreads.', zlibVersion()==1.2.11 ZLIB_VERSION==1.2.11
10 longest running tests took 0s (100% of 0s)
ID time nTest
1: 2253 0.034 19
All 19 tests (last 2253.19) in data.table/inst/tests/tests.Rraw completed ok in 0.045s elapsed (0.037s cpu) tests.Rraw: require(methods)
if (exists("test.data.table", .GlobalEnv, inherits=FALSE)) {
if (!identical(suppressWarnings(packageDescription("data.table")), NA)) {
remove.packages("data.table")
stop("This is dev mode but data.table was installed. Uninstalled it. Please q() this R session and try cc() again. The installed namespace causes problems in dev mode for the S4 tests.\n")
}
if ((tt<-compiler::enableJIT(-1))>0)
cat("This is dev mode and JIT is enabled (level ", tt, ") so there will be a brief pause around the first test.\n", sep="")
rm_all = function() {}
DTfun = DT ## otherwise DT would be re-defined by many tests
} else {
require(data.table)
# Make symbols to the installed version's ::: so that we can i) test internal-only not-exposed R functions
# in the test suite when user runs test.data.table() from installed package AND ii) so that in dev the same
# tests can be used but in dev they test the package in .GlobalEnv. If we used ::: throughout tests, that
# would pick up the installed version and in dev you'd have to reinstall every time which slows down dev.
# NB: The string "data.table::" (which covers "data.table:::" too) should exist nowhere else in this file
# other than here inside this branch.
all.equal.data.table = data.table:::all.equal.data.table
allNA = data.table:::allNA
any_na = data.table:::any_na
as.data.table.array = data.table:::as.data.table.array
as.data.table.default = data.table:::as.data.table.default
as.IDate.default = data.table:::as.IDate.default
as.ITime.default = data.table:::as.ITime.default
binary = data.table:::binary
bmerge = data.table:::bmerge
brackify = data.table:::brackify
Ctest_dt_win_snprintf = data.table:::Ctest_dt_win_snprintf
chmatchdup = data.table:::chmatchdup
compactprint = data.table:::compactprint
cube.data.table = data.table:::cube.data.table
dcast.data.table = data.table:::dcast.data.table
DTfun = data.table:::DT
endsWith = data.table:::endsWith
endsWithAny = data.table:::endsWithAny
forder = data.table:::forder
forderv = data.table:::forderv
format.data.table = data.table:::format.data.table
format_col.default = data.table:::format_col.default
format_list_item.default = data.table:::format_list_item.default
getdots = data.table:::getdots
groupingsets.data.table = data.table:::groupingsets.data.table
guess = data.table:::guess
INT = data.table:::INT
is_na = data.table:::is_na
is.sorted = data.table:::is.sorted
isReallyReal = data.table:::isReallyReal
isRealReallyInt = data.table:::isRealReallyInt
is_utc = data.table:::is_utc
melt.data.table = data.table:::melt.data.table # for test 1953.4
null.data.table = data.table:::null.data.table
print.data.table = data.table:::print.data.table
replace_dot_alias = data.table:::replace_dot_alias
rollup.data.table = data.table:::rollup.data.table
rss = data.table:::rss
selfrefok = data.table:::selfrefok
setcoalesce = data.table:::setcoalesce
setdiff_ = data.table:::setdiff_
setreordervec = data.table:::setreordervec
shallow = data.table:::shallow # until exported
.shallow = data.table:::.shallow
split.data.table = data.table:::split.data.table
if (!exists('startsWith', 'package:base', inherits=FALSE)) startsWith = data.table:::startsWith
test = data.table:::test
uniqlengths = data.table:::uniqlengths
uniqlist = data.table:::uniqlist
which_ = data.table:::which_
which.first = data.table:::which.first
which.last = data.table:::which.last
`-.IDate` = data.table:::`-.IDate`
haszlib = data.table:::haszlib
# Also, for functions that are masked by other packages, we need to map the data.table one. Or else,
# the other package's function would be picked up. As above, we only need to do this because we desire
# to develop in .GlobalEnv with cc().
# This should be retained even if these packages are removed from Suggests, because the test() in this file
# checks against a data.table result which needs the data.table one to run. Otherwise the user can be
# sure by using :: themselves.
# masked by which package?
# =================================
setattr = data.table::setattr # bit
shift = data.table::shift # IRanges, GenomicRanges
between = data.table::between # plm
second = data.table::second # S4Vectors
dcast = data.table::dcast # reshape2
melt = data.table::melt # reshape2
last = data.table::last # xts
first = data.table::first # xts, S4Vectors
copy = data.table::copy # bit64 v4; bit64 offered to rename though so this is just in case bit64 unoffers
second = data.table::second # lubridate #1135
minute = data.table::minute # lubridate
hour = data.table::hour # lubridate
yday = data.table::yday # lubridate
wday = data.table::wday # lubridate
mday = data.table::mday # lubridate
week = data.table::week # lubridate
isoweek = data.table::isoweek # lubridate
month = data.table::month # lubridate
quarter = data.table::quarter # lubridate
year = data.table::year # lubridate
yearmon = data.table::yearmon # zoo
yearqtr = data.table::yearqtr # zoo
rm_all = function(env=parent.frame()) {
tt = setdiff(ls(envir=env), .do_not_rm)
rm(list=tt, envir=env)
gc()
invisible()
}
}
# strrep is used many times in tests, but is from R 3.3.0, so use this equivalent if it is missing.
if (!exists("strrep", "package:base")) {
strrep = function(x, times) mapply(function(x, times) paste(rep(x, times), collapse=""), rep_len(x, length(times)), times, USE.NAMES=FALSE)
}
local({
lc_ctype = Sys.getlocale('LC_CTYPE')
Sys.setlocale('LC_CTYPE', "en_US.UTF-8") # Japanese multibyte characters require utf8
on.exit({Sys.setlocale('LC_CTYPE', lc_ctype)})
accented_a = "\u0061\u0301"
ja_ichi = "\u4E00"
ja_ni = "\u4E8C"
ja_ko = "\u3053"
ja_n = "\u3093"
dots = "..."
clean_regex = "^\\d+:\\s+" # removes row numbering from beginning of output
# Tests for combining character latin a and acute accent, single row
DT = data.table(strrep(accented_a, 4L))
test(2253.01, options=list(datatable.prettyprint.char = 4L), DT, output=strrep(accented_a, 4L))
test(2253.02, options=list(datatable.prettyprint.char = 3L), DT, output=paste0(strrep(accented_a, 3L), dots))
test(2253.03, options=list(datatable.prettyprint.char = 1L), DT, output=paste0(strrep(accented_a, 1L), dots))
# Tests for full-width japanese character ichi, single row
DT = data.table(strrep(ja_ichi, 4L))
test(2253.04, options=list(datatable.prettyprint.char = 4L), DT, output=strrep(ja_ichi, 4L))
test(2253.05, options=list(datatable.prettyprint.char = 3L), DT, output=paste0(strrep(ja_ichi, 3L), dots))
test(2253.06, options=list(datatable.prettyprint.char = 1L), DT, output=paste0(strrep(ja_ichi, 1L), dots))
# Tests for multiple, different length combining character rows
DT = data.table(strrep(accented_a, 1L:4L))
test(2253.07, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), strrep(accented_a, 1:4L))
test(2253.08, options=list(datatable.prettyprint.char = 3L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(strrep(accented_a, 1:3), paste0(strrep(accented_a, 3L), dots)))
test(2253.09, options=list(datatable.prettyprint.char = 1L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(accented_a, rep(paste0(accented_a, dots), 3L)))
# Tests for multiple, different length full-width characters
DT = data.table(strrep(ja_ichi, 1L:4L))
test(2253.10, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), strrep(ja_ichi, 1:4L))
test(2253.11, options=list(datatable.prettyprint.char = 3L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(strrep(ja_ichi, 1:3), paste0(strrep(ja_ichi, 3L), dots)))
test(2253.12, options=list(datatable.prettyprint.char = 1L), gsub(clean_regex, "", capture.output(print(DT))[-1L]), c(ja_ichi, rep(paste0(ja_ichi, dots), 3L)))
# Tests for combined characters, multiple columns
DT = data.table(paste0(ja_ichi), strrep(ja_ni, 2L), strrep(ja_ko, 3L), strrep(accented_a, 2L), "aaa")
test(2253.13, options=list(datatable.prettyprint.char = 4L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, strrep(ja_ni, 2L), strrep(ja_ko, 3L), strrep(accented_a, 2L), "aaa"))
test(2253.14, options=list(datatable.prettyprint.char = 3L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, strrep(ja_ni, 2L), strrep(ja_ko, 3L), strrep(accented_a, 2L), "aaa"))
test(2253.15, options=list(datatable.prettyprint.char = 2L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, strrep(ja_ni, 2), paste0(strrep(ja_ko, 2), dots) , strrep(accented_a, 2), "aa..."))
test(2253.16, options=list(datatable.prettyprint.char = 1L), capture.output(print(DT))[-1L], paste("1:", ja_ichi, paste0(ja_ni, dots), paste0(ja_ko, dots), paste0(accented_a, dots), "a..."))
# Tests for multiple columns, multiple rows
DT = data.table(strrep(ja_ko, 1:3L), strrep(ja_n, 2:4L), strrep(accented_a, 3))
test(2253.17, options=list(datatable.prettyprint.char = 4L), gsub(clean_regex, "", capture.output(print(DT))[-1L]),
c(paste0(ja_ko, " ", strrep(ja_n, 2L), " ", strrep(accented_a, 3L)),
paste0(strrep(ja_ko, 2L), " ", strrep(ja_n, 3L), " ", strrep(accented_a, 3L)),
paste(strrep(ja_ko, 3L), strrep(ja_n, 4L), strrep(accented_a, 3L))))
test(2253.18, options=list(datatable.prettyprint.char = 3L), gsub(clean_regex, "", capture.output(print(DT))[-1L]),
c(paste0(ja_ko, " ", strrep(ja_n, 2L), " ", strrep(accented_a, 3L)),
paste0(strrep(ja_ko, 2L), " ", strrep(ja_n, 3L), " ", strrep(accented_a, 3L)),
paste(strrep(ja_ko, 3L), paste0(strrep(ja_n, 3L), dots), strrep(accented_a, 3L))))
test(2253.19, options=list(datatable.prettyprint.char = 1L), gsub(clean_regex, "", capture.output(print(DT))[-1L]),
c(paste0(ja_ko, " ", paste0(ja_n, dots), " ", paste0(accented_a, dots)),
paste0(c(ja_ko, ja_n, accented_a), dots, collapse=" "),
paste0(c(ja_ko, ja_n, accented_a), dots, collapse=" ")))
}) |
Closes #6141
Refactored tests to use
strrep
instead of strings, and wrap tests inUTF-8
locale.Need to run against Windows GHA to test changes