Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fread skip doesn't get the header names. #2080

Open
skanskan opened this issue Mar 26, 2017 · 8 comments
Open

fread skip doesn't get the header names. #2080

skanskan opened this issue Mar 26, 2017 · 8 comments

Comments

@skanskan
Copy link

When you use fread with the option skip= the file is read skipping the first lines...
That's OK, but there is a small problem, the first line contains the header, and you end up having no column names in your data.table.
I solve the problem using fread twice. By first reading only the first line and saving its content, and later reading the file again skiping as desired, and the renaming the columns.

I think it would be a good idea if fread always read the first line and use it as column names in case you decided so and no matter the skip value.

@MichaelChirico
Copy link
Member

This is consistent with other readers (e.g. read.csv/readLines) & in my experience typically the desired behavior -- most of the time I prefer to supply my own column names anyway, and usually when I need to skip in a file, the header is not on the first line, but pushed down several lines.

I suppose an option along the lines of header = 1L might work (but could also break code relying on 1L evaluating to TRUE?).

@mattdowle
Copy link
Member

Thanks @skanskan. Yes I know what you mean and agree. A recent change in dev is that the skip= control determines which line the data starts on. Whether column names or not is now correctly determined by header=TRUE|FALSE|"auto" with default "auto". And the "auto" is more advanced now too.
Please check dev 1.10.5 as of now and raise a new issue if it's not as you want, including a small reproducible example too please.
@MichaelChirico If you could check the recent change to skip= too please, since you commented here.

@MichaelChirico
Copy link
Member

MichaelChirico commented Mar 3, 2018

I think current behavior is good. I notice header = 'auto' may not accomplish what was intended in this thread?

# very nice
fread('# some metadata
# created by
# created date
# column types/YAML
X1,X2,X3,X4
1,2,3,4
5,6,7,8', skip = 4L)
#    X1 X2 X3 X4
# 1:  1  2  3  4
# 2:  5  6  7  8
# not as intended?
fread('X1,X2,X3,X4
gobbledygook
lorum ipsum
spaz typing
1,2,3,4
5,6,7,8', skip = 4L)
#    V1 V2 V3 V4  <- should be X1:4, no?
# 1:  1  2  3  4
# 2:  5  6  7  8

@mattdowle
Copy link
Member

That output looks as intended; i.e., no attempt made (not now nor in future) to remove junk lines between the column names and the first data row. If column names are present, they must be on the line immediately before where the data rows start (well, other than blank lines, depending on blank.lines.skip). I'm only aware of files having banners above the column names.

@datafj
Copy link

datafj commented Jun 5, 2020

If you view skipped rows as junk lines, this makes sense. However, if you use skipping rows as a way to save reading time, this does not make sense.

I have million rows data saved as CSV, which has timestamp in the first column as sorted index. I read the first column first to locate the rows I needed, and then I read the full data between row a and row b use skip and nrows.

Not saying which way is right or wrong, just present a valid use case for keeping the first row as the header row.

@jangorecki
Copy link
Member

@jflycn To reliably implement that there should be two different arguments, one for the purpose of skipping junk rows, and another one using for chunking. I recall @st-pasha recently explained why that matters.
Not sure if there is a FR for that already, but you can always create new one if you cannot find existing.

@st-pasha
Copy link
Contributor

st-pasha commented Jun 5, 2020

Currently the header argument could be TRUE, FALSE, or "auto". Conceivably, we could extend it to also accept integers, so that for example header=1 would mean "the header is on the 1st line of the file". Similarly, header=3 would mean that the header is on the 3rd line, etc.

This would be independent of skipping, so that you can say header=1 and skip=1000000 to skip the first 1M lines, while still taking the column names from the first line.

This could be taken even further: header=1:2 could be used for files with multi-line headers. I seem to recall there was a request for this not too long ago.

@trashbirdecology
Copy link

trashbirdecology commented Nov 20, 2021

@st-pasha, is there another open or resolved issue associated with this? I see it was closed but has no reference to a PR.

I am very much interested in this feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants