-
Notifications
You must be signed in to change notification settings - Fork 986
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Misc] dev fread considerably slower (10x) than current CRAN fread #2107
Comments
Jumping in: I couldn't tell if you were using DT version 1.10.5 or older. I had the same issue with performance in 1.10.4 and removing the installing from github fixed it.
On my 36 core machine with NVME, I read a 3.6M row CSV with ~300 columns in 24 seconds with 1.10.5. With 1.10.4, pre-parallel (I think), it took 2m 42 seconds. |
@HughParsonage Hm. Interesting. No it's not expected to be that slow even with other apps open. Please run with verbose=TRUE and post full output. Also if it's Windows try rebooting and installing clean. Sometimes different versions of .dll can get mixed up on Windows. Something somewhere is going wrong. |
Updated OP to include full output with |
I'm just lurking on this issue...I did find 1.10.5 on Ubuntu 16.04 with R 3.3.3 'Another Canoe' is working well for me now. I'm using this tiny benchmark.r to test perf everytime I update. The https://gist.github.com/scottstanfield/2e1e71f32d397309d0faf6a32d49e91c |
Thanks for verbose output. All looks good in terms of compile and threads kicking in. All I can see is that the file is small at 50MB, then it's chunking into 60 pieces given to 12 threads. Each thread only gets 5 pieces to do. 5 < 12. Maybe it's sticking when that's the case. If so it's a bug in the logic and not to do Windows but this data size and nThread combination. I should be able to reproduce on Linux. Can you send me one of the files you see problem? What is the machine as well please (I can look it up online to see if it has 12MB of L2 cache?) |
I'm happy to give you the files, but in the meantime the issue reproduces also with I'm running an i7-6800K machine @ 3.40 GHz with 128 GB RAM. Hard-drive is a Samsung SSD 850 EVO 1TB. CRAN> set.seed(1)
> n=1e8
> DT = data.table( a=sample(1:1000,n,replace=TRUE),
+ b=sample(1:1000,n,replace=TRUE),
+ c=rnorm(n),
+ d=sample(c("foo","bar","baz","qux","quux"),n,replace=TRUE),
+ e=rnorm(n),
+ f=sample(1:1000,n,replace=TRUE) )
> DT[2,b:=NA_integer_]
> DT[4,c:=NA_real_]
> DT[3,d:=NA_character_]
> DT[5,d:=""]
> DT[2,e:=+Inf]
> DT[3,e:=-Inf]
> fwrite(DT, "DT.csv")
> sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
locale:
[1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 LC_MONETARY=English_Australia.1252
[4] LC_NUMERIC=C LC_TIME=English_Australia.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] data.table_1.10.4
loaded via a namespace (and not attached):
[1] tools_3.3.3
> fread("DT.csv", verbose = TRUE)
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 4.954535 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 after skip or autostart
This line is the autostart and not blank so searching up for the last non-blank ... line 1
Detecting sep ... ','
Detected 6 columns. Longest stretch was from line 1 to line 30
Starting data input on line 1 (either column names or first row of data). First 10 characters: a,b,c,d,e,
All the fields on line 1 are character fields. Treating as the column names.
Count of eol: 100000001 (including 1 at the end)
Count of sep: 500000000
nrow = MIN( nsep [500000000] / (ncol [6] -1), neol [100000001] - endblanks [1] ) = 100000000
Type codes (point 0): 113431
Type codes (point 1): 113431
Type codes (point 2): 113431
Type codes (point 3): 113431
Type codes (point 4): 113431
Type codes (point 5): 113431
Type codes (point 6): 113431
Type codes (point 7): 113431
Type codes (point 8): 113431
Type codes (point 9): 113431
Type codes (point 10): 113431
Type codes: 113431 (after applying colClasses and integer64)
Type codes: 113431 (after applying drop or select (if supplied)
Allocating 6 column slots (6 - 0 dropped)
Read 100000000 rows and 6 (of 6) columns from 4.955 GB file in 00:02:29
Read 100000000 rows. Exactly what was estimated and allocated up front
0.000s ( 0%) Memory map (rerun may be quicker)
0.000s ( 0%) sep and header detection
5.239s ( 4%) Count rows (wc -l)
0.004s ( 0%) Column type detection (100 rows at 10 points)
0.939s ( 1%) Allocation of 100000000x6 result (xMB) in RAM
141.995s ( 96%) Reading data
0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered
0.000s ( 0%) Coercing data already read in type bumps (if any)
0.075s ( 0%) Changing na.strings to NA
148.252s Total
a b c d e f
1: 266 593 0.3101614 quux -0.9514252 500
2: 373 NA 1.6320205 bar Inf 253
3: 573 754 -0.5334998 -Inf 451
4: 909 965 NA baz -0.1766243 740
5: 202 654 -0.4235275 -0.4945436 494
---
99999996: 127 100 0.9068599 foo -1.1319779 481
99999997: 201 344 -3.1681951 quux 0.0405684 119
99999998: 682 864 0.1459793 bar -0.7597580 145
99999999: 580 361 -0.9512592 baz -1.1614999 971
100000000: 664 793 0.1269717 foo -0.6184478 252 Dev version> library(data.table)
data.table 1.10.5 IN DEVELOPMENT built 2017-04-04 00:37:37 UTC; travis
The fastest way to learn (by data.table authors): https://www.datacamp.com/courses/data-analysis-the-data-table-way
Documentation: ?data.table, example(data.table) and browseVignettes("data.table")
Release notes, videos and slides: http://r-datatable.com
> fread("DT.csv", verbose = TRUE)
Parameter na.strings == <<NA>>
None of the 1 na.strings are numeric (such as '-9999').
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 4.954535 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 starting: <<a,b,c,d,e,f>>
Detecting sep ...
sep==','(ascii 44) with 101 lines of 6 fields using quote rule 0
Detected 6 columns on line 1. This line is either column names or first data row (first 30 chars): <<a,b,c,d,e,f>>
All the fields on line 1 are character fields. Treating as the column names.
Number of sampling jump points = 101 because 5319891523 bytes from row 1 to eof / (2 * 5274 jump0size) == 504350
Type codes (jump 000) : 224542 Quote rule 0
Type codes (jump 100) : 224542 Quote rule 0
=====
Sampled 10049 rows (handled \n inside quoted fields) at 101 jump points including middle and very end
Bytes from first data row on line 2 to the end of last row: 5319891523
Line length: mean=53.18 sd=1.46 min=34 max=60
Estimated nrow: 5319891523 / 53.18 = 100028797
Initial alloc = 110031677 rows (100028797 + 10%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
Type codes (colClasses) : 224542
Type codes (drop|select) : 224542
Allocating 6 column slots (6 - 0 dropped)
Reading 5076 chunks of 0.999MB (19706 rows) using 12 threads
Read 100000000 rows x 6 columns from 4.955GB file in 26:00.475 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Final type counts
0 : drop
0 : logical
3 : integer
0 : integer64
2 : double
1 : character
Thread buffers were grown 0 times (if all 12 threads each grew once, this figure would be 12)
=============================
0.001s ( 0%) Memory map
0.000s ( 0%) sep, ncol and header detection
0.032s ( 0%) Column type detection using 10049 sample rows
0.706s ( 0%) Allocation of 100000000 rows x 6 cols (3.689GB) plus 0.009GB of temporary buffers
1559.737s (100%) Reading data
0.000s ( 0%) Rereading 0 columns due to out-of-sample type exceptions
1560.475s Total
a b c d e f
1: 266 593 0.3101614 quux -0.9514252 500
2: 373 NA 1.6320205 bar Inf 253
3: 573 754 -0.5334998 -Inf 451
4: 909 965 NA baz -0.1766243 740
5: 202 654 -0.4235275 -0.4945436 494
---
99999996: 127 100 0.9068599 foo -1.1319779 481
99999997: 201 344 -3.1681951 quux 0.0405684 119
99999998: 682 864 0.1459793 bar -0.7597580 145
99999999: 580 361 -0.9512592 baz -1.1614999 971
100000000: 664 793 0.1269717 foo -0.6184478 252
|
I was able to reproduce the issue on a different Windows machine running i7 4770K @3.40GHz. I got 1.4s vs 12.1s for the example in the |
Attempt 1 did not work. No need to test. |
Ok attempt 2 worked. I borrowed a Windows 8.1 laptop and have confirmed. data.table 1.10.5 IN DEVELOPMENT built 2017-04-15 11:11:18 UTC; appveyor |
Confirmed. I get |
I am experiencing slow speeds with the new version of fread.
CRAN version
Dev version (same code), installed today
Similarly, with the
DT
example provided in?fread
:CRAN
Dev:
I did have other programs open. Is it expected that having any program open would result in such a reduction in speed?
sessionInfo()
R version 3.3.3 (2017-03-06)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
With
verbose = TRUE
. First 1.10.4, then 1.10.5(Restart R.)
The text was updated successfully, but these errors were encountered: