-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IDate conversion from "YYYY-MM-DD" is slow #2503
Comments
Could you add more detail? That works fine for me:
|
When I read in a file of 53,945,186 x 1 with only the date field, it works fine. But when I read in the other needed columns so the file is 53,945,186 rows x 14 columns, it freezes. Unless there is something else I've done, which I doubt, as a sanitized version of the code looks like this: message(paste0('Now reading ', cur_year))
DT <- fread(file_name, colClasses = colClass_Z, header = TRUE, select = select_Z, key = c('ID', 'Month'))
d_rows <- d_rows + nrow(DT)
if (!is.null(t_subset) || !is.null(v_subset) || !is.null(f_subset)){
DT <- DT[ID %chin% valid_IDs]
}
message(paste0('Now converting dates in ', cur_year))
DT[, Month := as.IDate(Month, format = "%Y-%m-%d")]
and it's been way longer than necessary to convert the dates as per the 1-column test. The code works just fine in 1.10.4-3, albeit much more slowly. |
Can you reproduce on a subset of the data? Can you share a smaller version of a guilty file (perhaps anonymized)? |
Thanks for reporting. Can you share the type of the column (e.g. output of |
Oh dear. I tried importing the original file to see what I could cut out, and fell into stack imbalance/unprotected pointer issues multiple times. > library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-11-27 22:38:43 UTC; travis
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> DT <- fread('2017-11-22_1999_Performance.csv', colClasses = colClass_Fred_P, header = TRUE, select = select_col_P, key = c('LoanID', 'Month'), verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file 2017-11-22_1999_Performance.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapping ... ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\r-only line endings are not allowed because \n is found in the data
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51101071551015107111111111111771111177715 Quote rule 0
Type codes (jump 001) : 511010715510151071111111111117711111777110 Quote rule 0
Type codes (jump 008) : 511010755510151071111111111117711111777110 Quote rule 0
Type codes (jump 009) : 51101075551010510711015555555717711111777510 Quote rule 0
Type codes (jump 042) : 55101075551010510711015555555717711111777510 Quote rule 0
Type codes (jump 064) : 551010755510105107110110555555717711111777510 Quote rule 0
Type codes (jump 100) : 551010755510105107110110555555717711111777510 Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 4 type and 23 drop user overrides : 001010700000000000000000070775555077750
[10] Allocate memory for the datatable
Allocating 14 column slots (37 - 23 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
Read 6%. ETA 00:00 Error in fread("2017-11-22_1999_Performance.csv", colClasses = colClass_Fred_P, :
unprotect_ptr: pointer not found |
The most recent fix for the stack imbalance issue in #2481 seems to have fixed the IDate conversion freeze as well. > colCLASS <- c(rep('integer', 2L), 'character', 'Date', 'numeric',
+ rep('integer', 3L), rep('character', 2L),
+ 'integer', 'Date', rep('numeric', 2L), 'Date',
+ rep('numeric', 12L), rep('integer', 5),
+ rep('numeric', 3L), 'integer', 'character')
> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-12-02 12:05:42 UTC; appveyor
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> sessionInfo()
R version 3.4.2 beta (2017-09-17 r73296)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows Server >= 2012 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
[4] LC_NUMERIC=C LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.5
loaded via a namespace (and not attached):
[1] compiler_3.4.2 tools_3.4.2
> DT <- fread('LargeFile.csv', colClasses = colCLASS, select = 'Month', key = 'Month', header = TRUE, verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
[01] Check arguments
Using 40 threads (omp_get_max_threads()=40, nth=40)
NAstrings = [<<NA>>]
None of the NAstrings look like numbers.
show progress = 1
0/1 column will be read as boolean
[02] Opening the file
Opening file LargeFile.csv
File opened, size = 6.355GB (6823372783 bytes).
Memory mapped ok
[03] Detect and skip BOM
[04] Arrange mmap to be \0 terminated
\n has been found in the data so any mixture of line endings is allowed other than \r-only line endings. This is common and ideal.
[05] Skipping initial rows if needed
Positioned on line 1 starting: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
[06] Detect separator, quoting rule, and ncolumns
Detecting sep ...
sep=',' with 100 lines of 37 fields using quote rule 0
Detected 37 columns on line 1. This line is either column names or first data row. Line starts as: <<Orig_Year,Orig_Qtr,LoanID,Mont>>
Quote rule picked = 0
fill=false and the most number of columns found is 37
[07] Detect column types, good nrow estimate and whether first row is column names
'header' changed by user from 'auto' to true
Number of sampling jump points = 101 because (6823372781 bytes from row 1 to eof) / (2 * 12905 jump0size) == 264369
Type codes (jump 000) : 51AA7155A15A7111111111111771111177715 Quote rule 0
Type codes (jump 001) : 51AA7155A15A711111111111177111117771A Quote rule 0
Type codes (jump 008) : 51AA7555A15A711111111111177111117771A Quote rule 0
Type codes (jump 009) : 51AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 042) : 55AA7555AA5A71A155555557177111117775A Quote rule 0
Type codes (jump 064) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
Type codes (jump 100) : 55AA7555AA5A71A1A5555557177111117775A Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points
Bytes from first data row on line 1 to the end of last row: 6823372781
Line length: mean=126.15 sd=8.30 min=100 max=359
Estimated number of rows: 6823372781 / 126.15 = 54088821
Initial alloc = 62279495 rows (54088821 + 15%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
[08] Assign column names
[09] Apply user overrides on column types
After 0 type and 36 drop user overrides : 000A000000000000000000000000000000000
[10] Allocate memory for the datatable
Allocating 1 column slots (37 - 36 dropped) with 62279495 rows
[11] Read the data
jumps=[0..6520), chunk_size=1046529, total_size=6823372422
|--------------------------------------------------|
|==================================================|
Read 53945186 rows x 1 columns from 6.355GB (6823372783 bytes) file in 00:05.951 wall clock time
[12] Finalizing the datatable
Type counts:
36 : drop '0'
1 : string 'A'
=============================
0.000s ( 0%) Memory map 6.355GB file
0.032s ( 1%) sep=',' ncol=37 and header detection
0.001s ( 0%) Column type detection using 10049 sample rows
0.355s ( 6%) Allocation of 62279495 rows x 37 cols (0.464GB) of which 53945186 ( 87%) rows used
5.563s ( 93%) Reading 6520 chunks of 0.998MB (8295 rows) using 40 threads
= 0.097s ( 2%) Finding first non-embedded \n after each jump
+ 1.396s ( 23%) Parse to row-major thread buffers (grown 0 times)
+ 2.108s ( 35%) Transpose
+ 1.962s ( 33%) Waiting
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
5.951s Total
> str(DT)
Classes ‘data.table’ and 'data.frame': 53945186 obs. of 1 variable:
$ Month: chr "1999-02-01" "1999-02-01" "1999-02-01" "1999-02-01" ...
- attr(*, ".internal.selfref")=<externalptr>
- attr(*, "sorted")= chr "Month"
> system.time(DT[, Month := as.IDate(Month, format = "%Y-%m-%d")])
user system elapsed
43.55 1.69 45.26
> str(DT)
Classes ‘data.table’ and 'data.frame': 53945186 obs. of 1 variable:
$ Month: IDate, format: "1999-02-01" "1999-02-01" "1999-02-01" "1999-02-01" ...
- attr(*, ".internal.selfref")=<externalptr> |
Thanks @aadler. Ok this gives us enough to work on. Yeah 45s for that conversion pretty slow! Will look into what's going wrong. |
@mattdowle that's similar to the times I'm seeing on similarly huge data sets (53M rows)... I guess this is a duplicate of #1451 if that's still slow 😬 c.f.
Typical time at 50,000,000 entries on my machine is roughly 3 minutes. So the bottleneck is real |
|
yes, but the latter step is substantially cheaper (same as converting
integer stored as numeric to integer stored as integer). it's duct tape,
but IIUC there was no motivation to re-write C-level parsing API initially.
…On Dec 6, 2017 12:40 AM, "Avraham Adler" ***@***.***> wrote:
as.IDate converts char to Date and then to IDate?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#2503 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AHQQdSJZSSiCAps5RP-DH9mfjl79sDaOks5s9XHrgaJpZM4QsNfL>
.
|
e.g.:
(scales roughly linearly -- with
|
@mattdowle random idea -- what do you think about keeping a lookup table to speed up char-to-
Example usage:
(roughly 20x speed-up). Could also add columns like And if the convertee happens to be keyed the factor goes to about 150x:
Downsides being: silent RAM usage (more an issue if we add the other columns), doesn't work for dates outside some range (200 years seems it would cover vast majority of use cases, and can easily revert to In any case, @aadler, you may want to use this in your case. |
NB: That being said, fread would certainly benefit from being able to parse various date/time formats natively. |
Is a clean room design of a fast parser possible? |
Is it possible to reuse the date parser in fread.c for an as.IDate.character? as a first pass for the ISO case, then revert to the generalized parser if that "doesn't work" |
@mattdowle, this time I'm using an issue and not twitter :)
The most recent version of data.table (built 11-23, IIRC) is freezing once again at IDate conversion. The call is:
I'm sorry I don't have session info and traces but I had to downgrade to 1.10.4-3 to get the script to run and I'm up against a time limit. Then again, what happens is that it freezes at the conversion and just doesn't proceed.
The text was updated successfully, but these errors were encountered: