Skip to content

pd.io.parsers.read_csv ignores skiprows when parse_dates is set to a dict #4382

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
cancan101 opened this issue Jul 27, 2013 · 3 comments · Fixed by #4969
Closed

pd.io.parsers.read_csv ignores skiprows when parse_dates is set to a dict #4382

cancan101 opened this issue Jul 27, 2013 · 3 comments · Fixed by #4969
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Milestone

Comments

@cancan101
Copy link
Contributor

For example:

pd.io.parsers.read_csv("http://www.datazoa.com/publish/export.asp?hash=yjPceG6fHL&uid=dzadmin&a=exportcsv",skiprows=range(1,13+1),skipfooter=4,parse_dates={"date":[0]})

has 907 rows.

As does:

pd.io.parsers.read_csv("http://www.datazoa.com/publish/export.asp?hash=yjPceG6fHL&uid=dzadmin&a=exportcsv",skipfooter=4,parse_dates={"date":[0]})

whereas:

pd.io.parsers.read_csv("http://www.datazoa.com/publish/export.asp?hash=yjPceG6fHL&uid=dzadmin&a=exportcsv",skiprows=range(1,13+1),skipfooter=4,parse_dates=[0])

has 894 rows.

I am on Pandas v0.11.0

@guyrt
Copy link
Contributor

guyrt commented Sep 24, 2013

This is a symptom of a bigger problem:

s = "a,b,c\n" + "\n".join([",".join([str(i), str(i+1), str(i+2)]) for i in xrange(500)])
print pd.read_csv(StringIO(s),skiprows=[200, 202], engine='python')
  <class 'pandas.core.frame.DataFrame'>
  Int64Index: 500 entries, 0 to 499
  Data columns (total 3 columns):
  a    500  non-null values
  b    500  non-null values
  c    500  non-null values
  dtypes: int64(3)

Somehow, skiprows got removed in python engine except in the code that sniffs for the header. That's why the header isn't getting properly removed. Fix coming.

@cancan101
Copy link
Contributor Author

I assume the issue exists in both the python and non-python engines?

@guyrt
Copy link
Contributor

guyrt commented Sep 24, 2013

Just python. However, since skiprows is defined in the example, read_csv silently fails over to the python engine.

This is a prime example of why I don't like to have silent failover to an unanticipated code path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants