Skip to content

Commit

Permalink
fread - types detected within quoted fields, fixed jump alignment for…
Browse files Browse the repository at this point in the history
… windows line ending. Supplied file added as test - thanks. Closes #2087
  • Loading branch information
mattdowle committed Mar 30, 2017
1 parent 26e5316 commit a061769
Show file tree
Hide file tree
Showing 4 changed files with 5,046 additions and 27 deletions.
3 changes: 2 additions & 1 deletion NEWS.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@
* `fread` has always jumped to the middle and to the end of the file for a much improved column type guess. This sample size is unchanged: 1,000 rows at 10 jump points. But it now **automatically rereads any columns with out-of-sample type exceptions** so you don't have to use `colClasses` yourself.
* Large number of columns support; e.g. **12,000 columns** tested.
* **Quoting rules** are more robust and flexible. See point 10 on the wiki page [here](https://github.com/Rdatatable/data.table/wiki/Convenience-features-of-fread#10-automatic-quote-escape-method-detection-including-no-escape).
* Numeric data that has been quoted is now detected and read as numeric.
* The ability to position `autostart` anywhere inside one of multiple tables in a single file is removed with warning. It used to search upwards from that line to find the start of the table based on a consistent number of columns. People appear to be using `skip="string"` or `skip=nrow` to find the header row exactly, which is retained and simpler. It was too difficult to retain search-upwards-autostart together with skipping blank lines, filling incomplete rows and parallelization. Varying format and height messy header info above the column names is still auto detected and auto skipped.
* Many thanks to @yaakovfeldman, Guillermo Ponce and more to add for testing before release to CRAN: [#2070](https://github.com/Rdatatable/data.table/issues/2070), [#2073](https://github.com/Rdatatable/data.table/issues/2073)
* Many thanks to @yaakovfeldman, Guillermo Ponce, Arun Srinivasan and more to add for testing before release to CRAN: [#2070](https://github.com/Rdatatable/data.table/issues/2070), [#2073](https://github.com/Rdatatable/data.table/issues/2073), [#2087](https://github.com/Rdatatable/data.table/issues/2087)

#### BUG FIXES

Expand Down
14 changes: 10 additions & 4 deletions inst/tests/tests.Rraw
Original file line number Diff line number Diff line change
Expand Up @@ -5043,13 +5043,13 @@ cat('A,B,C,D,E,F
"TX",77406,"business analyst\\\\\\\\\\\\\\","the boeing co","",""
"CA",94116,"na\\none","retired","",""
', file = f<-tempfile()) # aside: notice the \\ before n of none as well
test(1336.1, fread(f), data.table(A = c("12", "TX", "CA"), B = c(0L, 77406L, 94116L), C = c("teacher private nfp\\\\\\\\\"", "business analyst\\\\\\\\\\\\\\", "na\\none"), D = c("\"jacoleman high school", "the boeing co", "retired"), E = c("", "", ""), F = c("", "", "")))
test(1336.1, fread(f), data.table(A = c("12", "TX", "CA"), B = c(0L, 77406L, 94116L), C = c("teacher private nfp\\\\\\\\\"", "business analyst\\\\\\\\\\\\\\", "na\\none"), D = c("\"jacoleman high school", "the boeing co", "retired"), E = NA, F = NA))
cat('A,B,C,D,E,F
"12",0,"teacher private nfp\\\\\\\\"","jacoleman high school","",""
"TX",77406,"business analyst\\\\\\\\\\\\\\","the boeing co","",""
"CA",94116,"na\\none","retired","",""
', file = f)
test(1336.2, fread(f), data.table(A=c("12","TX","CA"), B=c(0L,77406L,94116L),C=c('teacher private nfp\\\\\\\\"','business analyst\\\\\\\\\\\\\\','na\\none'), D=c('jacoleman high school','the boeing co','retired'),E="",F=""))
test(1336.2, fread(f), data.table(A=c("12","TX","CA"), B=c(0L,77406L,94116L),C=c('teacher private nfp\\\\\\\\"','business analyst\\\\\\\\\\\\\\','na\\none'), D=c('jacoleman high school','the boeing co','retired'),E=NA,F=NA))
unlink(f)

# file names ending with \ (quite common)
Expand Down Expand Up @@ -9857,10 +9857,10 @@ test(1754, fread("allchar.csv")[c(1,2,17575,17576),col2], c("AAN","BAN","YZZ","Z

# unescaped embedded quotes from here: http://stackoverflow.com/questions/42939866/fread-multiple-separators-in-a-string
test(1755, fread("unescaped.csv"),
data.table(No =c('0','1'), # should in future be integer as all columns are quoted blindly
data.table(No =c(0L,1L),
Comment=c('he said:"wonderful."', 'The problem is: reading table, and also "a problem, yes." keep going on.'),
Type =c('A','A')))

# test duplicated colClasses
txt = "A,B,C,D\n1,3,5,7\n2,4,6,8\n"
test(1756.1, fread(txt), data.table(A=1:2, B=3:4, C=5:6, D=7:8))
Expand All @@ -9870,6 +9870,12 @@ test(1756.4, fread(txt, colClasses=list('numeric'=c(1,3),'character'=2)),
data.table(A=as.double(1:2), B=c("3","4"), C=as.double(5:6), D=7:8))
test(1756.5, fread(txt, colClasses=list('numeric'=c(1,2),'character'=2)), error="Column 'B' appears more than once")

# Windows \r\n line endings when using multiple threads and detecting type within quoted fields, #2087
test(1757, fread("winallquoted.csv")[c(1,2,4998,4999)],
data.table(station_id=2L, bikes_available=c(2L,2L,11L,11L), docks_available=c(25L,25L,16L,16L),
time=c("2013/08/29 12:06:01","2013/08/29 12:07:01","2013/09/02 08:48:01","2013/09/02 08:50:01")))

test(1758, sapply(fread("A,B\n,"),class), c(A="logical",B="logical"))

##########################

Expand Down
Loading

0 comments on commit a061769

Please sign in to comment.