Skip to content

pandas not loading the csv/tsv completely #23533

Closed
@bicepjai

Description

@bicepjai

Code Sample, a copy-pastable example if possible

data = pd.read_csv("filename.tsv", sep='\t')
data.shape

Problem description

I have a file for analysis and it gets loaded only partially. I cannot share the file. This doesn't happen with all the files. Even with error_bad_lines=True, nothing is reported. No warning messages either. If I run wc -l filename.tsv I get the exact number of lines in file, this doesn't match the rows reported in pandas. Tableau parses the file properly.
Note: I spent lot of time in trying to understand the behavior and lost the bet saying tableau was erroneous :(

Behaviour

I tried a lot of different options to find out what behavior causes the issue.

After I converted the tsv to csv using

awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' filename.tsv > filename.csv

and then use the following piece of code, I was able to read in the whole file.

data = pd.read_csv("filename.csv", sep='\t')
data.shape

Please note ignoring the sep='\t' loaded only partial file

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.8.2
pip: 18.1
setuptools: 40.4.3
Cython: None
numpy: 1.15.2
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.0.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

Metadata

Metadata

Assignees

No one assigned

    Labels

    IO CSVread_csv, to_csvNeeds InfoClarification about behavior needed to assess issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions