Description
Code Sample, a copy-pastable example if possible
data = pd.read_csv("filename.tsv", sep='\t')
data.shape
Problem description
I have a file for analysis and it gets loaded only partially. I cannot share the file. This doesn't happen with all the files. Even with error_bad_lines=True, nothing is reported. No warning messages either. If I run wc -l filename.tsv
I get the exact number of lines in file, this doesn't match the rows reported in pandas. Tableau parses the file properly.
Note: I spent lot of time in trying to understand the behavior and lost the bet saying tableau was erroneous :(
Behaviour
I tried a lot of different options to find out what behavior causes the issue.
After I converted the tsv to csv using
awk 'BEGIN { FS="\t"; OFS="," } {$1=$1; print}' filename.tsv > filename.csv
and then use the following piece of code, I was able to read in the whole file.
data = pd.read_csv("filename.csv", sep='\t')
data.shape
Please note ignoring the sep='\t'
loaded only partial file
Expected Output
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.23.4
pytest: 3.8.2
pip: 18.1
setuptools: 40.4.3
Cython: None
numpy: 1.15.2
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 7.0.1
sphinx: None
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: None
tables: 3.4.4
numexpr: 2.6.8
feather: None
matplotlib: 3.0.0
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None