Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread devel error #2087

Closed
arunsrinivasan opened this issue Mar 28, 2017 · 0 comments
Closed

fread devel error #2087

arunsrinivasan opened this issue Mar 28, 2017 · 0 comments
Labels
Milestone

Comments

@arunsrinivasan
Copy link
Member

Here's the link to zipped version of file (50MB, unzips to ~620MB).

require(data.table)
system.time(ans <- fread("~/Downloads/example.csv"))
# Read 374058 rows x 4 columns from 0.580GB file in 00:01.874 wall clock time (can be slowed down by any other open apps even if seemingly idle)
# Error in fread("example.csv") : 
#   Jump 12 did not end exactly where jump 13 found its first good line start. end(0x10b4f26ef)<<"3","8","7","2013/11/18 18:18:03">> != start(prev+35)<<"3","8","7","2013/11/18 18:19:02">>
# Timing stopped at: 4.473 0.247 1.879 

with verbose=TRUE:

None of the 1 'na.strings' are numeric (such as '-9999') which is normal and best for performance.
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 0.579656 GB.
Memory mapping ... ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Positioned on line 1 starting: <<"station_id","bikes_available">>
Detecting sep ...
  sep==','(ascii 44)  with 101 lines of 4 fields using quote rule 0
Detected 4 columns on line 1. This line is either column names or first data row (first 30 chars): <<"station_id","bikes_available">>
All the fields on line 1 are character fields. Treating as the column names.
Number of sampling jump points  = 11 because 3600 startSize * 10 NJUMPS * 2 = 72000 <= 622401253 bytes from line 2 to eof
Type codes (jump 00)    : 4444  Quote rule 0
Type codes (jump 10)    : 4444  Quote rule 0
=====
 Sampled 1031 rows (handled \n inside quoted fields) at 11 jump points including middle and very end
 Bytes from first data row on line 2 to the end of last row: 622401253
 Line length: mean=36.75 sd=0.75 min=36 max=38
 Estimated nrow: 622401253 / 36.75 = 16936648
 Initial alloc = 18630313 rows (16936648 + 10%) using bytes/max(mean-2*sd,min) clamped between [1.1*estn, 2.0*estn]
=====
Type codes (colClasses) : 4444
Type codes (drop|select): 4444
Allocating 4 column slots (4 - 0 dropped)
Reading 596 chunks of 0.996MB (28417 rows) using 4 threads
Read 374058 rows x 4 columns from 0.580GB file in 00:02.029 wall clock time (can be slowed down by any other open apps even if seemingly idle)
Thread buffers were grown 0 times (if all 4 threads each grew once, this figure would be 4)
Error in fread("example.csv", verbose = TRUE) : 
  Jump 12 did not end exactly where jump 13 found its first good line start. end(0x10b4f26ef)<<"3","8","7","2013/11/18 18:18:03">> != start(prev+35)<<"3","8","7","2013/11/18 18:19:02">>

This file was working fine a couple of commits ago but was thrice as slow (10s on CRAN version vs 32s on 4 threads). And now it errors.

Aside: also note that the columns are all read as character, even though the first three columns could/should be integers. But this happens even on CRAN version. I haven't had to time to check why this happens.

@mattdowle mattdowle added this to the v1.10.6 milestone Mar 30, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants