-
Notifications
You must be signed in to change notification settings - Fork 991
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fread - if there is one string in a numeric column, guess it represents NA #2100
Comments
Once recurring large negative value could be optionally converted to NA, as well. This would save providing a mechanism to define different NA numeric values for different columns; e.g. -999 in one column but -9999.9 in another column where 999 is valid observation value. Especially if the large negative outlier is the only negative. |
As described above, one string value such as "#N/A" is relatively straightforward to automatically detect: the column would be full of numeric values other than that one value which wasn't a valid numeric. How to detect the "obvious" outlier when it's numeric though e.g., -9999, or even 99 in a column of numbers in the range [1,40]. Currently the user can pass in such numeric values manually in Was just speaking to Leland Wilkinson in our office about this. Proposal: fread could use its large sample to calculate the min (min1) and max (max1) trivially, but also the 2nd smallest (min2) and 2nd largest value (max2). Simple and efficient without a 2nd pass. If one of Views? |
There are naturally occurring quantities that are distributed on log-scale. These could be: GDPs of countries, individual incomes, sizes of files on disk, masses of stars, energies of particles in cosmic rays (see Oh-My-God particle), etc. For quantities like these it is expected that the largest value would look like an outlier. Nonetheless, it would be a valid value that should not be altered or removed in any way. Even as I agree with Leland that there are systems that encode NAs as 999s (shame on them!), and that has led real people to make errors in their data analysis -- nevertheless, I think it is ultimately a judgement call whether a particular outlier is NA or anything else, and |
I'm with @st-pasha here (w.r.t automatically detecting "numeric" As an aside, I also think there are valid (ish) cases for using "numeric" NA -- the example that comes to mind is the Common Core of Data. See here --
Basically, |
Instead of bumping a numeric column to character on seeing the first character value, it could wait and see if that was the only character value present in the whole column. If so it could assume it is an NA value with warning and keep the column as numeric. Saves the user having to rerun by passing na.strings.
(Suggested by Pasha not me, a great idea.)
The text was updated successfully, but these errors were encountered: