Type inference problem with read_csv #9669

diehl · 2015-03-17T04:46:50Z

I have a large CSV file that contains a single column of integers. When loading the CSV file with read_csv, nearly 2/3rds of the approximately 1.5 million values are loaded as ints while the remaining values are loaded as strings. I see no obvious problem with the file that would lead to this behavior.
The file I used is available here: https://drive.google.com/file/d/0ByZvgdTf0yfAT2dPdHRvc2hLVkU/view?usp=sharing

test_df = pd.read_csv('test.csv',header=0)
from collections import Counter
c = Counter()
els = []
for el in test_df['SERIALNO'].values:
    c[type(el)] += 1
print c
Counter({<type 'int'>: 952026, <type 'str'>: 524288})

It appears that a continguous block of rows were read as strings - a block in the middle of the file. Very curious.

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.5.0
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 2.1.0
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None

The text was updated successfully, but these errors were encountered:

jreback · 2015-03-17T10:32:30Z

xref is #3866

something is throwing the inference engine off.
You can try iterating by chunks and seeing where the invalid data is.

After its read in, you could do .convert_objects(convert_numeric=True). If this works might be a deeper issue.

diehl · 2015-03-17T14:05:09Z

@jreback I just tried the convert_objects call and it works. I've manually inspected the CSV file hunting for hidden characters and have seen nothing.

I found a similar problem on Stack Overflow that suggests this problem has been around for a bit.
http://stackoverflow.com/questions/18471859/pandas-read-csv-dtype-inference-issue

All the evidence so far points to a deeper issue.

jreback · 2015-03-17T19:40:12Z

@diehl your pointed to issue is really completely different, though if you DID have actual string-likes then I suppose it could be the same (though it IS really hard to inspect visually for this kind of thing). That's why I suggested you iterate in read_csv using chunksize=1000 or something, and narrow down which chunk gives this error. OR if it doesn't then we can discuss a bug from there.

diehl · 2015-03-17T23:19:49Z

@jreback false alarm. after more digging I found the issue with the file. thanks for the feedback.

jreback · 2015-03-17T23:55:36Z

np

jreback added Bug IO CSV read_csv, to_csv labels Mar 17, 2015

diehl closed this as completed Mar 17, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Type inference problem with read_csv #9669

Type inference problem with read_csv #9669

diehl commented Mar 17, 2015

jreback commented Mar 17, 2015

Uh oh!

diehl commented Mar 17, 2015

Uh oh!

jreback commented Mar 17, 2015

Uh oh!

diehl commented Mar 17, 2015

Uh oh!

jreback commented Mar 17, 2015

Uh oh!

Uh oh!

Type inference problem with read_csv #9669

Type inference problem with read_csv #9669

Comments

diehl commented Mar 17, 2015

INSTALLED VERSIONS

jreback commented Mar 17, 2015

Uh oh!

diehl commented Mar 17, 2015

Uh oh!

jreback commented Mar 17, 2015

Uh oh!

diehl commented Mar 17, 2015

Uh oh!

jreback commented Mar 17, 2015

Uh oh!