Add CSVY support for fread() #1701

arunsrinivasan · 2016-05-12T06:07:44Z

The rio already allows reading of csvy formats by relying on the helper package (same author) csvy. It'd be great to leverage its features to read/write.

MichaelChirico · 2018-03-03T11:38:24Z

Just documenting how the .csvy reader of rio works:

rio encounters a .csvy file and dispatches to .import.rio_csvy
.import.rio_csvy calls read_csvy from the csvy package.
read_csvy uses readLines to digest the file, then uses grep to identify the YAML header.
yaml.load from the yaml package is called to convert the YAML content string to a list of component parts. This is already implemented in C so is presumably efficient.
read_csv then applies paste(., collapse = '\n') to the non-YAML portion of the file and can read it with fread.
Content of YAML header is applied to the output.

Major inefficiencies are:

Using readLines to digest the whole file (slow)
pasteing the file back into a format for fread to tackle only after stripping out the YAML part
Some parts of the YAML header are intended to assist fread, but the YAML data is not fed into fread.

My proposed solution (if we decide to tackle this):

Stream lines of the file until the end of the YAML header is reached (or until an out-of-format line is reached -> stop). I'm not sure the most efficient way to pass files line-by-line in R, or if we'll have to implement that ourselves in C. Keep track of the # of lines fed. I see this from StackOverflow: https://stackoverflow.com/q/9871307/3576984
Add yaml package to Suggests and rely on that to parse the YAML info
Extract any info relevant to fread itself (especially/most importantly colClasses; I'll have to read the csvy standard to see how open-ended the rest is)
fread the remainder of the file, using skip to jump past the YAML section, and including relevant info from the YAML.

Remaining API Q for me are: (1) do we try and detect YAML automatically (harder & beyond me to implement currently), or simply add a yaml (or similarly named) argument to fread and rely on user input? (2) What of fwrite?

For (1) I lean towards the latter primarily out of laziness (more sophisticated -- doesn't seem to pass the cost/benefit test given limited user requests for this feature). We can revisit in a future issue if this becomes more popular, I suppose. No opinions on (2), I only include it since we seem to be aiming to keep fread and fwrite as each others' inverse functions.

HughParsonage · 2018-03-03T11:45:36Z

I'm not sure the most efficient way to pass files line-by-line in R

fread(file, sep = NULL) (in dev) ?

Also readLines is likely to be much faster in R 3.5.0, possibly as fast as fread or readr::read_lines.

MichaelChirico · 2018-03-03T11:48:18Z

@HughParsonage that still leaves the issue of reading the file twice. The idea of streaming lines is to examine the file line-by-line and only read in, say, 10-20 lines of YAML metadata using readLines (or fread, or whatever) before parsing that and deploying fread on (presumably) the bulk of the file which follows the header

(anyway good to know they're finally getting around to improving readLines, it's silly how slow it is considering how minimal its responsibilities are)

MichaelChirico · 2018-03-03T18:13:05Z

Progress here:

https://github.com/Rdatatable/data.table/tree/csvy_support

jangorecki · 2018-03-05T16:42:01Z

I would avoid extra dependency, even in suggests, and use some helper function to extract fields from yaml header. Similar way as we would process DESCRIPTION file. Fwrite should be able to write csvy header the way that fread can read it.

MichaelChirico · 2018-03-05T16:45:05Z

as yaml can be arbitrarily nested I didn't see a need to reinvent the wheel.

especially as it seems at the moment to be a rather limited use case -- happy to revisit if this format takes off.

MichaelChirico · 2019-05-03T02:50:22Z

fwrite support not done yet... will file separately

arunsrinivasan added the feature request label May 12, 2016

jangorecki added the fread label May 12, 2016

mattdowle added the fwrite label May 13, 2016

MichaelChirico mentioned this issue Jul 15, 2016

FR: colClasses could accept a data.frame/table #1773

Closed

st-pasha mentioned this issue Jul 6, 2017

Master task for fread bugs / proposals #2247

Closed

MichaelChirico mentioned this issue Oct 24, 2017

optional arg that returns a list of parse symbols fread() used to intuit raw file #2437

Open

MichaelChirico self-assigned this Mar 3, 2018

MichaelChirico pushed a commit that referenced this issue Mar 4, 2018

Closes #1701 -- fread support for csvy format

fc63f9c

jangorecki added this to the 1.12.0 milestone Jun 26, 2018

jangorecki modified the milestones: 1.12.0, 1.12.2 Jan 5, 2019

mattdowle removed this from the 1.12.2 milestone Jan 14, 2019

MichaelChirico added a commit that referenced this issue Feb 10, 2019

Closes #1701 -- fread support for csvy format

acf13b6

MichaelChirico mentioned this issue May 2, 2019

csvy reading capabilities #2656

Merged

mattdowle added this to the 1.12.4 milestone May 2, 2019

mattdowle closed this as completed in #2656 May 2, 2019

MichaelChirico changed the title ~~Add CSVY support for fread() and fwrite()~~ Add CSVY support for fread() May 3, 2019

MichaelChirico mentioned this issue May 3, 2019

Add CSVY support for fwrite #3534

Closed

MichaelChirico removed the fwrite label May 3, 2019

MichaelChirico mentioned this issue May 4, 2019

CSVY wishlist #3540

Open

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CSVY support for fread() #1701

Add CSVY support for fread() #1701

arunsrinivasan commented May 12, 2016 •

edited by MichaelChirico

Loading

MichaelChirico commented Mar 3, 2018 •

edited

Loading

HughParsonage commented Mar 3, 2018

MichaelChirico commented Mar 3, 2018 •

edited

Loading

MichaelChirico commented Mar 3, 2018 •

edited

Loading

jangorecki commented Mar 5, 2018

MichaelChirico commented Mar 5, 2018

MichaelChirico commented May 3, 2019

Add CSVY support for fread() #1701

Add CSVY support for fread() #1701

Comments

arunsrinivasan commented May 12, 2016 • edited by MichaelChirico Loading

MichaelChirico commented Mar 3, 2018 • edited Loading

HughParsonage commented Mar 3, 2018

MichaelChirico commented Mar 3, 2018 • edited Loading

MichaelChirico commented Mar 3, 2018 • edited Loading

jangorecki commented Mar 5, 2018

MichaelChirico commented Mar 5, 2018

MichaelChirico commented May 3, 2019

arunsrinivasan commented May 12, 2016 •

edited by MichaelChirico

Loading

MichaelChirico commented Mar 3, 2018 •

edited

Loading

MichaelChirico commented Mar 3, 2018 •

edited

Loading

MichaelChirico commented Mar 3, 2018 •

edited

Loading