-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Corrects stopping logic when nrows argument is supplied (#7626) #14747
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
can i create a small test that replicates the issue and fails on master (and passes here)? |
Are you asking me to create the test or asking if it's OK for you to? If the latter, sure. It's hard for me to say how large the file needs to be to create a crash but the bug can be reproduced at will using the file and row numbers jzwinck posted on the issue: https://gist.githubusercontent.com/jzwinck/838882fbc07f7c3a53992696ef364f66 |
@jeffcarey I am asking you to. The point is to have a regression test so future changes don't break this. |
The issue / example is a last case resort. We much prefer to have a code based test (you can simply generate the appropriate data in memory; the size doesn't really matter). I just don't want to include external files like this. But if its impossible / hard, then you would need to include this test file. |
Current coverage is 85.27% (diff: 100%)
|
@jreback I've created a test case that does not rely on any external files, including the example. Where do you think the best place for it is in the test suite? I figured somewhere in the frame directory since it relates to creating a DataFrame, but I don't see anything about input. Let me know if you think it fits best into something that already exists or if it deserves to be in its own file. |
https://github.com/pandas-dev/pandas/tree/master/pandas/io/tests/parser has extensive tests for the parser |
29a887c
to
f4c3c13
Compare
@jreback I've added a test and updated the comment here about the root cause. Please take a look. |
@jreback Anything else needed here? I've fixed my test so that it complies with linting, all checks are green. |
test_input = (header_narrow + data_narrow * 1050 + | ||
header_wide + data_wide * 2) | ||
|
||
df = self.read_table(StringIO(test_input), nrows=1010) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use self.read_csv
here.
I think this tests should be in c_parser only as well.
pls add a whatsnew note in 0.19.2. |
…andas-dev#7626) Fixed code formatting Added test to C Parser Only suite, added whatsnew entry
Requested changes made, all green again. Please review. |
@@ -427,6 +427,23 @@ def test_read_nrows(self): | |||
with tm.assertRaisesRegexp(ValueError, msg): | |||
self.read_csv(StringIO(self.data1), nrows='foo') | |||
|
|||
def test_read_nrows_large(self): | |||
# GH-7626 - Read only nrows of data in for large inputs (>262144b) | |||
header_narrow = '\t'.join(['COL_HEADER_' + str(i) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is duplicating the above test in c_parser_only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor change - ping on green
@jeffcarey lgtm. ping on green. |
@jreback All green |
thanks @jeffcarey |
closes #7626 Subsets of tabular files with different "shapes" will now load when a valid skiprows/nrows is given as an argument - Conditions for error: 1) There are different "shapes" within a tabular data file, i.e. different numbers of columns. 2) A "narrower" set of columns is followed by a "wider" (more columns) one, and the narrower set is laid out such that the end of a 262144-byte block occurs within it. Issue summary: The C engine for parsing files reads in 262144 bytes at a time. Previously, the "start_lines" variable in tokenizer.c/tokenize_bytes() was set incorrectly to the first line in that chunk, rather than the overall first row requested. This lead to incorrect logic on when to stop reading when nrows is supplied by the user. This always happened but only caused a crash when a wider set of columns followed in the file. In other cases, extra rows were read in but then harmlessly discarded. This pull request always uses the first requested row for comparisons, so only nrows will be parsed when supplied. Author: Jeff Carey <jeff.carey@gmail.com> Closes #14747 from jeffcarey/fix/7626 and squashes the following commits: cac1bac [Jeff Carey] Removed duplicative test 6f1965a [Jeff Carey] BUG: Corrects stopping logic when nrows argument is supplied (Fixes #7626) (cherry picked from commit 4378f82) Conflicts: pandas/io/tests/parser/c_parser_only.py
git diff upstream/master | flake8 --diff
Conditions for error:
Issue summary:
The C engine for parsing files reads in 262144 bytes at a time. Previously, the "start_lines" variable in tokenizer.c/tokenize_bytes() was set incorrectly to the first line in that chunk, rather than the overall first row requested. This lead to incorrect logic on when to stop reading when nrows is supplied by the user. This always happened but only caused a crash when a wider set of columns followed in the file. In other cases, extra rows were read in but then harmlessly discarded.
This pull request always uses the first requested row for comparisons, so only nrows will be parsed when supplied.