fread skip doesn't get the header names. #2080

skanskan · 2017-03-26T18:58:51Z

When you use fread with the option skip= the file is read skipping the first lines...
That's OK, but there is a small problem, the first line contains the header, and you end up having no column names in your data.table.
I solve the problem using fread twice. By first reading only the first line and saving its content, and later reading the file again skiping as desired, and the renaming the columns.

I think it would be a good idea if fread always read the first line and use it as column names in case you decided so and no matter the skip value.

MichaelChirico · 2017-03-26T20:35:37Z

This is consistent with other readers (e.g. read.csv/readLines) & in my experience typically the desired behavior -- most of the time I prefer to supply my own column names anyway, and usually when I need to skip in a file, the header is not on the first line, but pushed down several lines.

I suppose an option along the lines of header = 1L might work (but could also break code relying on 1L evaluating to TRUE?).

mattdowle · 2018-03-03T10:53:44Z

Thanks @skanskan. Yes I know what you mean and agree. A recent change in dev is that the skip= control determines which line the data starts on. Whether column names or not is now correctly determined by header=TRUE|FALSE|"auto" with default "auto". And the "auto" is more advanced now too.
Please check dev 1.10.5 as of now and raise a new issue if it's not as you want, including a small reproducible example too please.
@MichaelChirico If you could check the recent change to skip= too please, since you commented here.

MichaelChirico · 2018-03-03T11:04:08Z

I think current behavior is good. I notice header = 'auto' may not accomplish what was intended in this thread?

# very nice
fread('# some metadata
# created by
# created date
# column types/YAML
X1,X2,X3,X4
1,2,3,4
5,6,7,8', skip = 4L)
#    X1 X2 X3 X4
# 1:  1  2  3  4
# 2:  5  6  7  8

# not as intended?
fread('X1,X2,X3,X4
gobbledygook
lorum ipsum
spaz typing
1,2,3,4
5,6,7,8', skip = 4L)
#    V1 V2 V3 V4  <- should be X1:4, no?
# 1:  1  2  3  4
# 2:  5  6  7  8

mattdowle · 2018-04-22T06:52:28Z

That output looks as intended; i.e., no attempt made (not now nor in future) to remove junk lines between the column names and the first data row. If column names are present, they must be on the line immediately before where the data rows start (well, other than blank lines, depending on blank.lines.skip). I'm only aware of files having banners above the column names.

datafj · 2020-06-05T20:11:06Z

If you view skipped rows as junk lines, this makes sense. However, if you use skipping rows as a way to save reading time, this does not make sense.

I have million rows data saved as CSV, which has timestamp in the first column as sorted index. I read the first column first to locate the rows I needed, and then I read the full data between row a and row b use skip and nrows.

Not saying which way is right or wrong, just present a valid use case for keeping the first row as the header row.

jangorecki · 2020-06-05T20:22:18Z

@jflycn To reliably implement that there should be two different arguments, one for the purpose of skipping junk rows, and another one using for chunking. I recall @st-pasha recently explained why that matters.
Not sure if there is a FR for that already, but you can always create new one if you cannot find existing.

st-pasha · 2020-06-05T21:08:16Z

Currently the header argument could be TRUE, FALSE, or "auto". Conceivably, we could extend it to also accept integers, so that for example header=1 would mean "the header is on the 1st line of the file". Similarly, header=3 would mean that the header is on the 3rd line, etc.

This would be independent of skipping, so that you can say header=1 and skip=1000000 to skip the first 1M lines, while still taking the column names from the first line.

This could be taken even further: header=1:2 could be used for files with multi-line headers. I seem to recall there was a request for this not too long ago.

trashbirdecology · 2021-11-20T00:35:29Z

@st-pasha, is there another open or resolved issue associated with this? I see it was closed but has no reference to a PR.

I am very much interested in this feature.

arunsrinivasan added the fread label Mar 28, 2017

st-pasha added the enhancement label Jul 6, 2017

st-pasha mentioned this issue Jul 6, 2017

Master task for fread bugs / proposals #2247

Closed

mattdowle closed this as completed Mar 3, 2018

MichaelChirico reopened this Nov 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fread skip doesn't get the header names. #2080

fread skip doesn't get the header names. #2080

skanskan commented Mar 26, 2017

MichaelChirico commented Mar 26, 2017

mattdowle commented Mar 3, 2018

MichaelChirico commented Mar 3, 2018 •

edited

Loading

mattdowle commented Apr 22, 2018

datafj commented Jun 5, 2020 •

edited

Loading

jangorecki commented Jun 5, 2020

st-pasha commented Jun 5, 2020

trashbirdecology commented Nov 20, 2021 •

edited

Loading

fread skip doesn't get the header names. #2080

fread skip doesn't get the header names. #2080

Comments

skanskan commented Mar 26, 2017

MichaelChirico commented Mar 26, 2017

mattdowle commented Mar 3, 2018

MichaelChirico commented Mar 3, 2018 • edited Loading

mattdowle commented Apr 22, 2018

datafj commented Jun 5, 2020 • edited Loading

jangorecki commented Jun 5, 2020

st-pasha commented Jun 5, 2020

trashbirdecology commented Nov 20, 2021 • edited Loading

MichaelChirico commented Mar 3, 2018 •

edited

Loading

datafj commented Jun 5, 2020 •

edited

Loading

trashbirdecology commented Nov 20, 2021 •

edited

Loading