Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fwrite(): final items #1664

Closed
15 tasks done
mattdowle opened this issue Apr 20, 2016 · 23 comments
Closed
15 tasks done

fwrite(): final items #1664

mattdowle opened this issue Apr 20, 2016 · 23 comments

Comments

@mattdowle
Copy link
Member

mattdowle commented Apr 20, 2016

  • malloc()s and write() need a thread safe way to error() if fail
  • Add progress % and ETA to appear after 2 seconds if more than 2 seconds are remaining
    Include R_CheckUserInterrupt and test closes worker team ok 1a4263f
  • Date, IDate and update this quesion
  • POSIXct (see fread/fwrite *base data types* directly for efficiency #1656)
  • ITime
  • integer64
    Add "NEW:" item to startup banner
  • Confirm fwrite() writes 10GB ok on Windows (it should do) to ensure 'big' file > 4GB ok. Thanks to Hugh Parsonage for testing as we don't have Windows other than via AppVeyor for test suite.
  • quote = 'auto'
  • sep2 (see fwrite to support secondary separator (sep2) for vectors in list columns #806)
  • match write.csv's scientific/decimal format exactly and add many tests. 6c1ed96
  • add dec='.' to R level and connect to already existing option at C level
  • add row.names for data.frames, default FALSE.
  • rename setthreads to setDTthreads so as not to affect other packages using OpenMP, add reference to its manual in fwrite man
  • refine ?fwrite
  • more tests qmethod= 'double' and 'escape'
@mattdowle mattdowle added this to the v1.9.8 milestone Apr 20, 2016
@mattdowle mattdowle changed the title fwrite final items fwrite(): final items Apr 20, 2016
@eantonya
Copy link
Contributor

Do people actually like having quote=TRUE when writing to csv? I find it to be a big nuisance and would much prefer for fwrite to have quote=FALSE by default.

@MichaelChirico
Copy link
Member

I find quote = TRUE to be more robust -- you never know when you have a JAMES SMITH, JR in a character column and it can be a huge pain to get a .csv read when it has nuisance commas strewn about.

@mattdowle
Copy link
Member Author

mattdowle commented Apr 20, 2016

@eantonya Agree. I prefer quote=FALSE too. The base R thinking I believe has numbers/ids with leading 0's stored as character format ... the default ensures they get read by Excel as character and the leading 0's not lost. But fwrite could detect that situation and quote just that situation by default. Where character columns contain letters and no embedded quotes, I really don't see why quotes are needed. Plus we save a bit on file size by saving the 2 extra quotes per field.

@mattdowle
Copy link
Member Author

mattdowle commented Apr 20, 2016

@MichaelChirico Agree with you too. fwrite can detect that and put the quotes in those situations. fwrite already does a first-pass through all strings to calculate maximum line length before allocating buffer sizes. It could test if there are any sep or quote in the string at that point. So I guess I'm suggesting quote='auto' by default.

@MichaelChirico
Copy link
Member

MichaelChirico commented Apr 20, 2016

@mattdowle great, good point. Should only marginally affect speed then.

PS IIRC Excel converts "001" to 1 anyway :|

@mattdowle
Copy link
Member Author

@MichaelChirico Now you mention it I do seem to remember Excel doing that. I haven't used Excel for many years now thankfully.

@jangorecki
Copy link
Member

jangorecki commented Apr 20, 2016

I would assume Excel behave in an inconsistent (os versions, office versions, os locales, office localces, 365s, etc.) way about that matter.

@rafapereirabr
Copy link

Are you planning to include the append = T ? Please?! Anyway, congrats for the great job with data.table that will become even greater with fwrite() !

@MichaelChirico
Copy link
Member

@rafapereirabr? append = TRUE in fact works for me, are you suggesting that should be the default?

@rafapereirabr
Copy link

@MichaelChirico , I didn't know it was already implemented ! I couldn't try it as I was planning to test it tonight. Just ignore my comment then. ps. I don't think this should be the default.

@MichaelChirico
Copy link
Member

MichaelChirico commented Jun 17, 2016

[ Update : quote='auto' now fully implemented ]

Can we please set quote = TRUE to default until auto is supported?

I'm being royally screwed right now by having written a data file I needed to carry remotely with quote = FALSE by accident. The file is now basically un-usable because of all the unpredictable commas scattered in some string fields, which really sucks because I have no way of fixing the mistake (other than by hand).

quote = "auto" sounds like the best solution, but I would hate for this to happen to others in the meantime.

Until then, the marginal cost to FALSE of adding " (conservatively, a 5% speed/file size hit) seems to be far outweighed by the cost to TRUE of creating un-usably dirty data files.

@MichaelChirico
Copy link
Member

MichaelChirico commented Aug 24, 2016

[ Update: now fixed and fwrite is consistent with write.csv ]

I find it a bit odd that fwrite distorts numerics, e.g.:

fwrite(data.table(a = -75.16374), "test.csv")

Has output:

a
-7.516374E1

fread and read.csv indeed recognize this as a numeric column still (fread("test.csv")$a is numeric), but I'm not sure this is robust across all readers -- I've got in mind outputting a .csv from R and sharing it with users who may be using any platform.

Not sure the ideal approach, as floating points are always going to cause headaches...

Also note that write.csv doesn't have this effect.

@mattdowle
Copy link
Member Author

integer64 implemented: 6d55d2f

mattdowle added a commit that referenced this issue Nov 3, 2016
mattdowle added a commit that referenced this issue Nov 4, 2016
mattdowle added a commit that referenced this issue Nov 5, 2016
…stimate based on sample for efficiency and to prep for sep2 now we can realloc the buffers if needed. #1664
@HughParsonage
Copy link
Member

HughParsonage commented Nov 7, 2016

Can you clarify what you are looking for in Confirm fwrite() writes 10GB ok on Windows (it should do) to ensure 'big' file > 4GB ok. User help needed please as we don't have Windows other than via AppVeyor for test suite.? I managed to do it (Windows 10) on a 14Gb file -- with a minor bug (there's a s. printed on the far right of the console afterwards).

It's also tremendously fast: less than a minute (fread takes 5 minutes; readRDS takes 3:40).

@mattdowle
Copy link
Member Author

mattdowle commented Nov 7, 2016

@HughParsonage Perfect - that's a pass then. Thanks! Windows has different C functions for reading from files bigger than 4GB so it was feasible that something extra was required for writing too.
I'll increase the blanking width that clears the progress status ... sounds like what the s. is.

mattdowle added a commit that referenced this issue Nov 7, 2016
…umns are present. Changed default sep2 from ; to | to distinguish it more from sep=, default. #1664
mattdowle added a commit that referenced this issue Nov 7, 2016
mattdowle added a commit that referenced this issue Nov 8, 2016
mattdowle added a commit that referenced this issue Nov 9, 2016
@MichaelChirico
Copy link
Member

excellent stuff Matt, thanks so much!!

@skanskan
Copy link

skanskan commented Nov 18, 2016

Do we need to use "library(bit64)" with fwrite and fread when we have long numbers or not anymore?

@stanislav-a
Copy link

Thank you very much for your work, this feature is really useful.
But last version looks not very stable.

I caught 2 strange issues:

  • showProgress should be set explicitly:
a <- c("1", "2", "3", "4", "5")
d <- rep("2016-11-21", 5)
c <- rep("a", 5)
m <- rep("0.5", 5)

data<-data.table(a, d, c, m)

fwrite(data, "e:/tmp_buf/tmp.csv", sep="~",
       col.names=FALSE, append=FALSE, ..turbo = T, quote = F)


#Error: isLOGICAL(showProgress) is not TRUE
  • eol delimiter does not work correctly
fwrite(data, "e:/tmp_buf/tmp.csv", sep="~",
       col.names=FALSE, append=FALSE, ..turbo = T, quote = F, showProgress = T)

#result:
#"1"~"2016-11-21"~"a"~"0.5""2"~"2016-11-21"~"a"~"0.5""3"~"2016-11-21"~"a"~"0.5""4"~"2016-11-21"~"a"~"0.5""5"~"2016-11-21"~"a"~"0.5"
#No eol delimeters

fwrite(data, "e:/tmp_buf/tmp.csv", sep="~",
       eol = "\r\n",
       col.names=FALSE, append=FALSE, ..turbo = T, quote = F, showProgress = T)
#result:
#"1"~"2016-11-21"~"a"~"0.5""2"~"2016-11-21"~"a"~"0.5""3"~"2016-11-21"~"a"~"0.5""4"~"2016-11-21"~"a"~"0.5""5"~"2016-11-21"~"a"~"0.5"
#Still no eol delimeters
  • Also I would like to know, how can I write .csv files without scientific notation. For example, 93434234223523523.5 converts to 9.34342342235235E+016. But if I want to use this file for bulk insert I'll have problems. Can I set explicitly number of decimal places?

@jangorecki
Copy link
Member

jangorecki commented Nov 21, 2016

@stanislav-a It would be useful if you could provide your sessionInfo() and read.dcf(system.file("DESCRIPTION", package="data.table"), "Commit"). And ideally re-run on latest version as there were lots of improvements made recently. I'm on linux and cannot reproduce problems you reported. Re scientific notation, it will round your number 93434234223523523.5 on writing, similarly to write.csv. Exact floating point range is mentioned in manual ?fwrite.
@skanskan how do you store your long numbers without bit64? if you store it as double, it will be processed as double and you don't need bit64.

@stanislav-a
Copy link

@jangorecki I reinstall package with last commit, now it works fine, thank you.

@thvasilo
Copy link

thvasilo commented Dec 2, 2016

I can confirm the first issue @stanislav-a has mentioned, I'm on the 1.9.8 release.

I use the following generated file, it's a simple csv file: https://gist.github.com/thvasilo/6edffdccda87f09572cbc4184662af47

surv_1k <- fread("surv_1k.csv")

fwrite(surv_1k, "copy.csv")

# Error: isLOGICAL(showProgress) is not TRUE

Session info:

> sessionInfo()
R version 3.3.2 (2016-10-31)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.1 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_US.UTF-8       
 [4] LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] purrr_0.2.2      caret_6.0-73     ggplot2_2.2.0    lattice_0.20-34  data.table_1.9.8

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.7        magrittr_1.5       splines_3.3.2      MASS_7.3-45       
 [5] munsell_0.4.3      colorspace_1.2-6   foreach_1.4.3      minqa_1.2.4       
 [9] stringr_1.1.0      car_2.1-4          plyr_1.8.4         tools_3.3.2       
[13] parallel_3.3.2     nnet_7.3-12        pbkrtest_0.4-6     grid_3.3.2        
[17] gtable_0.2.0       nlme_3.1-128       mgcv_1.8-16        quantreg_5.29     
[21] MatrixModels_0.4-1 iterators_1.0.8    lme4_1.1-12        lazyeval_0.2.0    
[25] assertthat_0.1     tibble_1.2         Matrix_1.2-7.1     nloptr_1.0.4      
[29] reshape2_1.4.2     ModelMetrics_1.1.0 codetools_0.2-15   stringi_1.1.1     
[33] scales_0.4.1       stats4_3.3.2       SparseM_1.74     

I haven't tried the latest master.

@MichaelChirico
Copy link
Member

MichaelChirico commented Dec 2, 2016 via email

@david-awam-jansen
Copy link

I just upgraded to 1.9.8 today and still have the same issue.
When I try and save a csv file using fwrite I still get "# Error: isLOGICAL(showProgress) is not TRUE"

R version 3.3.2 (2016-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 7 x64 (build 7601) Service Pack 1

locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252

attached base packages:
[1] stats graphics grDevices utils datasets methods base

other attached packages:
[1] xtable_1.8-2 lubridate_1.6.0 ggrepel_0.6.3 data.table_1.10.0 cowplot_0.7.0 ggplot2_2.2.0 RPostgreSQL_0.4-1 DBI_0.5-1

loaded via a namespace (and not attached):
[1] Rcpp_0.12.8 assertthat_0.1 grid_3.3.2 plyr_1.8.4 gtable_0.2.0 magrittr_1.5 scales_0.4.1 stringi_1.1.2 lazyeval_0.2.0 tools_3.3.2
[11] stringr_1.1.0 munsell_0.4.3 colorspace_1.3-0 knitr_1.15 tibble_1.2

@jangorecki
Copy link
Member

jangorecki commented Dec 5, 2016

@MichaelChirico David is already on 1.10 according to session info.
@david-awam-jansen Please open new issue with you report. If possible include code to reproduce (at least on your machine), but please include only relevant part. Now I see in your session info you have many other unrelated packages loaded. Before reporting it is always good to ensure that issue is reproducible in clean session in R console. #1111 is same issue but on fread, you may try one solution from there:

what solved the problem is the closing of all R sessions running on the computer before installing data.table

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests