Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

read_csv fails to guess type of integer column #1094

Closed
tmalsburg opened this issue May 14, 2020 · 6 comments
Closed

read_csv fails to guess type of integer column #1094

tmalsburg opened this issue May 14, 2020 · 6 comments

Comments

@tmalsburg
Copy link

tmalsburg commented May 14, 2020

The data has 73 rows and all entries in the last column are integers. Yet read_csv guesses col_double().

Example:

library(readr)
read_csv("nettle_1999_climate.csv") -> nettle

Results in:

Parsed with column specification:
cols(
  Country = col_character(),
  Population = col_double(),
  Area = col_double(),
  MGS = col_double(),
  Langs = col_double()
)

Last line should be Langs = col_integer().

This is using up-to-date readr from CRAN and R version 3.6.3 (2020-02-29).

@jimhester
Copy link
Collaborator

This is actually by design, readr never guesses columns are of type integer.

In earlier versions of readr we did guess columns were of type integer, however this caused numerous issues in practice when the first X number of rows were integers, with floating point numbers later in the file.

Since all 32 bit integers will fit in a 64 bit double this ensures we don't lose information. If you know you want Langs to be integer, you can specify that explicitly, e.g. read_csv("nettle_1999_climate.csv", col_types = list(Langs = "i"))

@tmalsburg
Copy link
Author

Makes sense. However, if newer versions of readr don't guess integer, that is important info that should be added to the documentation. Is it okay if I open a documentation issue?

@jimhester
Copy link
Collaborator

@tmalsburg
Copy link
Author

Shouldn't it go into the man page of read_delim et al? I doubt that users will find it in the vignette.

@gregreich
Copy link

Hi, wouldn't it be most consistent if an option like guess_integer would be available for read_csv(), maybe with default value FALSE -- just as there is for parse_guess()? If I know and/or trust my file, I could still benefit from automatic integer detection.

@andysouth
Copy link

I came across this because I too didn't know why integers weren't guessed by read_csv() and I wanted them to be :-)

My use case is that I'm reading in tables with ID columns from a standard format (OMOP) that other code expects to be integer and it causes later join failures if they are double.

Found this solution on stackoverflow that seems to work for me.
It reads all columns as character and uses type_convert (also from readr) that does have a guess_integer arg.
read_csv(col_types = cols(.default = "c")) %>% type_convert(guess_integer = TRUE)

(Although I do feel a little nervous deliberately circumventing a design feature ...)

A little more explicit documentation in the man page of read_csv would be useful :-)
Thanks all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants