Skip to content

Type inference problem with read_csv #9669

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
diehl opened this issue Mar 17, 2015 · 5 comments
Closed

Type inference problem with read_csv #9669

diehl opened this issue Mar 17, 2015 · 5 comments
Labels
Bug IO CSV read_csv, to_csv

Comments

@diehl
Copy link

diehl commented Mar 17, 2015

I have a large CSV file that contains a single column of integers. When loading the CSV file with read_csv, nearly 2/3rds of the approximately 1.5 million values are loaded as ints while the remaining values are loaded as strings. I see no obvious problem with the file that would lead to this behavior.
The file I used is available here: https://drive.google.com/file/d/0ByZvgdTf0yfAT2dPdHRvc2hLVkU/view?usp=sharing

test_df = pd.read_csv('test.csv',header=0)
from collections import Counter
c = Counter()
els = []
for el in test_df['SERIALNO'].values:
    c[type(el)] += 1
print c
Counter({<type 'int'>: 952026, <type 'str'>: 524288})

It appears that a continguous block of rows were read as strings - a block in the middle of the file. Very curious.

pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Darwin
OS-release: 13.4.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.5.0
IPython: 3.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.0
openpyxl: 2.1.0
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None

@jreback
Copy link
Contributor

jreback commented Mar 17, 2015

xref is #3866

something is throwing the inference engine off.
You can try iterating by chunks and seeing where the invalid data is.

After its read in, you could do .convert_objects(convert_numeric=True). If this works might be a deeper issue.

@jreback jreback added Bug IO CSV read_csv, to_csv labels Mar 17, 2015
@diehl
Copy link
Author

diehl commented Mar 17, 2015

@jreback I just tried the convert_objects call and it works. I've manually inspected the CSV file hunting for hidden characters and have seen nothing.

I found a similar problem on Stack Overflow that suggests this problem has been around for a bit.
http://stackoverflow.com/questions/18471859/pandas-read-csv-dtype-inference-issue

All the evidence so far points to a deeper issue.

@jreback
Copy link
Contributor

jreback commented Mar 17, 2015

@diehl your pointed to issue is really completely different, though if you DID have actual string-likes then I suppose it could be the same (though it IS really hard to inspect visually for this kind of thing). That's why I suggested you iterate in read_csv using chunksize=1000 or something, and narrow down which chunk gives this error. OR if it doesn't then we can discuss a bug from there.

@diehl
Copy link
Author

diehl commented Mar 17, 2015

@jreback false alarm. after more digging I found the issue with the file. thanks for the feedback.

@diehl diehl closed this as completed Mar 17, 2015
@jreback
Copy link
Contributor

jreback commented Mar 17, 2015

np

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

2 participants