You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When doing inflation of sparse columns that are entirely NA in the initial sampling range, type inflation appears to destroy NA values, resulting in incorrect null strings.
input<-'"Integer","Numeric","Logical","Character"NA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NA1,1.1,FALSE,"a"NA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NANA,NA,NA,NA'table<- fread(input, verbose=TRUE)
# Input contains a \n (or is ""). Taking this to be text input (not a filename)# Detected eol as \n only (no \r afterwards), the UNIX and Mac standard.# Using line 30 to detect sep (the last non blank line in the first 'autostart') ... sep=','# Found 4 columns# First row with 4 fields occurs on line 1 (either column names or first row of data)# All the fields on line 1 are character fields. Treating as the column names.# Count of eol after first data row: 34# Subtracted 0 for last eol and any trailing empty lines, leaving 34 data rows# Type codes: 1111 (first 5 rows)# Type codes: 1111 (+middle 5 rows)# Type codes: 1111 (+last 5 rows)# Type codes: 1111 (after applying colClasses and integer64)# Type codes: 1111 (after applying drop or select (if supplied)# Allocating 4 column slots (4 - 0 NULL)# Bumping column 2 from INT to INT64 on data row 26, field contains '1.1'# Bumping column 2 from INT64 to REAL on data row 26, field contains '1.1'# Bumping column 3 from INT to INT64 on data row 26, field contains 'FALSE'# Bumping column 3 from INT64 to REAL on data row 26, field contains 'FALSE'# Bumping column 3 from REAL to STR on data row 26, field contains 'FALSE'# Bumping column 4 from INT to INT64 on data row 26, field contains '"a"'# Bumping column 4 from INT64 to REAL on data row 26, field contains '"a"'# Bumping column 4 from REAL to STR on data row 26, field contains '"a"'# 0.000s ( 3%) Memory map (rerun may be quicker)# 0.000s ( 5%) sep and header detection# 0.000s ( 1%) Count rows (wc -l)# 0.001s ( 46%) Column type detection (first, middle and last 5 rows)# 0.000s ( 1%) Allocation of 34x4 result (xMB) in RAM# 0.000s ( 1%) Reading data# 0.000s ( 0%) Allocation for type bumps (if any), including gc time if triggered# 0.001s ( 41%) Coercing data already read in type bumps (if any)# 0.000s ( 0%) Changing na.strings to NA# 0.002s Total
...
Warning messages:
1: In fread(input, verbose = TRUE) :
Bumped column 3 to type character on data row 26, field contains 'FALSE'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
2: In fread(input, verbose = TRUE) :
Bumped column 4 to type character on data row 26, field contains '"a"'. Coercing previously read values in this column from integer or numeric back to character which may not be lossless; e.g., if '00' and '000' occurred before they will now be just '0', and there may be inconsistencies with treatment of ',,' and ',NA,' too (if they occurred in this column before the bump). If this matters please rerun and set 'colClasses' to 'character' for this column. Please note that column type detection uses the first 5 rows, the middle 5 rows and the last 5 rows, so hopefully this message should be very rare. If reporting to datatable-help, please rerun and include the output from verbose=TRUE.
Please note that the report you see above was done on 1.9.2, which is why it is using default types of 1111. But a colleague has run the same thing on 1.9.3 and confirms that while the logical-detection portion of the above is fixed, the character column behaves similarly.
The text was updated successfully, but these errors were encountered:
If there's anything we can do to assist in getting this fixed and the new release with this and the other logical type fixes out, feel free to contact me at adam.kennedy@kaggle.com and let me know.
arunsrinivasan
changed the title
NA values are destroyed in sparse character columns
NA values are destroyed in sparse character columns in fread
Aug 1, 2014
, elements in the CSV that are the character ? are correctly coerced into an NA, in the output data.table. But any column containing this ? element becomes a character type.
After removing all NA values from my data with sed -i '/\?/d' K9.edited.data, the problem is solved. All float columns in the CSV become numeric columns in the data.table.
…lumns (colClasses=...), in order to remove a warning encountered when the '?' column value is read. The warning seems to be a bug in fread, documented here: Rdatatable/data.table#737
When doing inflation of sparse columns that are entirely NA in the initial sampling range, type inflation appears to destroy NA values, resulting in incorrect null strings.
Confirm this is broken in 1.9.3.
Please note that the report you see above was done on 1.9.2, which is why it is using default types of 1111. But a colleague has run the same thing on 1.9.3 and confirms that while the logical-detection portion of the above is fixed, the character column behaves similarly.
The text was updated successfully, but these errors were encountered: