Skip to content

BUG: Corrects stopping logic when nrows argument is supplied (#7626) #14747

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

jeffcarey
Copy link
Contributor

@jeffcarey jeffcarey commented Nov 25, 2016

Conditions for error:

  1. There are different "shapes" within a tabular data file, i.e. different numbers of columns.
  2. A "narrower" set of columns is followed by a "wider" (more columns) one, and the narrower set is laid out such that the end of a 262144-byte block occurs within it.

Issue summary:
The C engine for parsing files reads in 262144 bytes at a time. Previously, the "start_lines" variable in tokenizer.c/tokenize_bytes() was set incorrectly to the first line in that chunk, rather than the overall first row requested. This lead to incorrect logic on when to stop reading when nrows is supplied by the user. This always happened but only caused a crash when a wider set of columns followed in the file. In other cases, extra rows were read in but then harmlessly discarded.

This pull request always uses the first requested row for comparisons, so only nrows will be parsed when supplied.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

can i create a small test that replicates the issue and fails on master (and passes here)?

@jeffcarey
Copy link
Contributor Author

Are you asking me to create the test or asking if it's OK for you to? If the latter, sure.

It's hard for me to say how large the file needs to be to create a crash but the bug can be reproduced at will using the file and row numbers jzwinck posted on the issue: https://gist.githubusercontent.com/jzwinck/838882fbc07f7c3a53992696ef364f66

@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

@jeffcarey I am asking you to. The point is to have a regression test so future changes don't break this.

@jreback
Copy link
Contributor

jreback commented Nov 25, 2016

The issue / example is a last case resort. We much prefer to have a code based test (you can simply generate the appropriate data in memory; the size doesn't really matter). I just don't want to include external files like this. But if its impossible / hard, then you would need to include this test file.

@jreback jreback added Bug IO CSV read_csv, to_csv labels Nov 25, 2016
@codecov-io
Copy link

codecov-io commented Nov 26, 2016

Current coverage is 85.27% (diff: 100%)

No coverage report found for master at 725453d.

Powered by Codecov. Last update 725453d...e9c5bee

@jeffcarey
Copy link
Contributor Author

@jreback I've created a test case that does not rely on any external files, including the example. Where do you think the best place for it is in the test suite? I figured somewhere in the frame directory since it relates to creating a DataFrame, but I don't see anything about input. Let me know if you think it fits best into something that already exists or if it deserves to be in its own file.

@jreback
Copy link
Contributor

jreback commented Nov 26, 2016

@jeffcarey jeffcarey force-pushed the fix/7626 branch 2 times, most recently from 29a887c to f4c3c13 Compare November 26, 2016 19:19
@jeffcarey
Copy link
Contributor Author

@jreback I've added a test and updated the comment here about the root cause. Please take a look.

@jeffcarey
Copy link
Contributor Author

@jreback Anything else needed here? I've fixed my test so that it complies with linting, all checks are green.

test_input = (header_narrow + data_narrow * 1050 +
header_wide + data_wide * 2)

df = self.read_table(StringIO(test_input), nrows=1010)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use self.read_csv here.

I think this tests should be in c_parser only as well.

@jreback
Copy link
Contributor

jreback commented Dec 1, 2016

pls add a whatsnew note in 0.19.2.

…andas-dev#7626)

Fixed code formatting

Added test to C Parser Only suite, added whatsnew entry
@jeffcarey
Copy link
Contributor Author

Requested changes made, all green again. Please review.

@@ -427,6 +427,23 @@ def test_read_nrows(self):
with tm.assertRaisesRegexp(ValueError, msg):
self.read_csv(StringIO(self.data1), nrows='foo')

def test_read_nrows_large(self):
# GH-7626 - Read only nrows of data in for large inputs (>262144b)
header_narrow = '\t'.join(['COL_HEADER_' + str(i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is duplicating the above test in c_parser_only

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor change - ping on green

@jreback jreback added this to the 0.19.2 milestone Dec 5, 2016
@jreback
Copy link
Contributor

jreback commented Dec 5, 2016

@jeffcarey lgtm. ping on green.

@jeffcarey
Copy link
Contributor Author

@jreback All green

@jreback jreback closed this in 4378f82 Dec 6, 2016
@jreback
Copy link
Contributor

jreback commented Dec 6, 2016

thanks @jeffcarey

@jeffcarey jeffcarey deleted the fix/7626 branch December 8, 2016 02:14
jorisvandenbossche pushed a commit that referenced this pull request Dec 15, 2016
closes #7626

Subsets of tabular files with different "shapes"
will now load when a valid skiprows/nrows is given as an argument   -

Conditions
for error:  1) There are different "shapes" within a tabular data
file, i.e. different numbers of columns.  2) A "narrower" set of
columns is followed by a "wider" (more columns) one, and the narrower
set is laid out such that the end of a 262144-byte block occurs within
it.    Issue summary:   The C engine for parsing files reads in 262144
bytes at a time. Previously, the "start_lines" variable in
tokenizer.c/tokenize_bytes() was set incorrectly to the first line in
that chunk, rather than the overall first row requested. This lead to
incorrect logic on when to stop reading when nrows is supplied by the
user. This always happened but only caused a crash when a wider set of
columns followed in the file. In other cases, extra rows were read in
but then harmlessly discarded.    This pull request always uses the
first requested row for comparisons, so only nrows will be parsed
when supplied.

Author: Jeff Carey <jeff.carey@gmail.com>

Closes #14747 from jeffcarey/fix/7626 and squashes the following commits:

cac1bac [Jeff Carey] Removed duplicative test
6f1965a [Jeff Carey] BUG: Corrects stopping logic when nrows argument is supplied (Fixes #7626)

(cherry picked from commit 4378f82)

 Conflicts:
	pandas/io/tests/parser/c_parser_only.py
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

nrows limit fails reading well formed csv files from Australian electricity market data
3 participants