BUG: Corrects stopping logic when nrows argument is supplied (#7626) #14747

jeffcarey · 2016-11-25T19:44:27Z

closes nrows limit fails reading well formed csv files from Australian electricity market data #7626 - Subsets of tabular files with different "shapes" will now load when a valid skiprows/nrows is given as an argument
tests added / passed - Test case added to io/tests/parser/common.py. nosetests suite passed successfully.
passes git diff upstream/master | flake8 --diff

Conditions for error:

There are different "shapes" within a tabular data file, i.e. different numbers of columns.
A "narrower" set of columns is followed by a "wider" (more columns) one, and the narrower set is laid out such that the end of a 262144-byte block occurs within it.

Issue summary:
The C engine for parsing files reads in 262144 bytes at a time. Previously, the "start_lines" variable in tokenizer.c/tokenize_bytes() was set incorrectly to the first line in that chunk, rather than the overall first row requested. This lead to incorrect logic on when to stop reading when nrows is supplied by the user. This always happened but only caused a crash when a wider set of columns followed in the file. In other cases, extra rows were read in but then harmlessly discarded.

This pull request always uses the first requested row for comparisons, so only nrows will be parsed when supplied.

jreback · 2016-11-25T19:50:21Z

can i create a small test that replicates the issue and fails on master (and passes here)?

jeffcarey · 2016-11-25T20:00:46Z

Are you asking me to create the test or asking if it's OK for you to? If the latter, sure.

It's hard for me to say how large the file needs to be to create a crash but the bug can be reproduced at will using the file and row numbers jzwinck posted on the issue: https://gist.githubusercontent.com/jzwinck/838882fbc07f7c3a53992696ef364f66

jreback · 2016-11-25T20:27:05Z

@jeffcarey I am asking you to. The point is to have a regression test so future changes don't break this.

jreback · 2016-11-25T20:28:45Z

The issue / example is a last case resort. We much prefer to have a code based test (you can simply generate the appropriate data in memory; the size doesn't really matter). I just don't want to include external files like this. But if its impossible / hard, then you would need to include this test file.

codecov-io · 2016-11-26T01:05:58Z

Current coverage is 85.27% (diff: 100%)

No coverage report found for master at 725453d.

Powered by Codecov. Last update 725453d...e9c5bee

jeffcarey · 2016-11-26T09:31:15Z

@jreback I've created a test case that does not rely on any external files, including the example. Where do you think the best place for it is in the test suite? I figured somewhere in the frame directory since it relates to creating a DataFrame, but I don't see anything about input. Let me know if you think it fits best into something that already exists or if it deserves to be in its own file.

jreback · 2016-11-26T12:16:46Z

https://github.com/pandas-dev/pandas/tree/master/pandas/io/tests/parser

has extensive tests for the parser

jeffcarey · 2016-11-26T21:58:59Z

@jreback I've added a test and updated the comment here about the root cause. Please take a look.

jeffcarey · 2016-12-01T19:30:27Z

@jreback Anything else needed here? I've fixed my test so that it complies with linting, all checks are green.

jreback · 2016-12-01T22:32:14Z

pandas/io/tests/parser/common.py

+        test_input = (header_narrow + data_narrow * 1050 +
+                      header_wide + data_wide * 2)
+
+        df = self.read_table(StringIO(test_input), nrows=1010)


use self.read_csv here.

I think this tests should be in c_parser only as well.

jreback · 2016-12-01T22:32:33Z

pls add a whatsnew note in 0.19.2.

…andas-dev#7626) Fixed code formatting Added test to C Parser Only suite, added whatsnew entry

jeffcarey · 2016-12-05T02:15:27Z

Requested changes made, all green again. Please review.

jreback · 2016-12-05T11:09:10Z

pandas/io/tests/parser/common.py

@@ -427,6 +427,23 @@ def test_read_nrows(self):
        with tm.assertRaisesRegexp(ValueError, msg):
            self.read_csv(StringIO(self.data1), nrows='foo')

+    def test_read_nrows_large(self):
+        # GH-7626 - Read only nrows of data in for large inputs (>262144b)
+        header_narrow = '\t'.join(['COL_HEADER_' + str(i)


this is duplicating the above test in c_parser_only

jreback

minor change - ping on green

jreback · 2016-12-05T23:13:41Z

@jeffcarey lgtm. ping on green.

jeffcarey · 2016-12-06T01:36:48Z

@jreback All green

jreback · 2016-12-06T01:48:21Z

thanks @jeffcarey

closes #7626 Subsets of tabular files with different "shapes" will now load when a valid skiprows/nrows is given as an argument - Conditions for error: 1) There are different "shapes" within a tabular data file, i.e. different numbers of columns. 2) A "narrower" set of columns is followed by a "wider" (more columns) one, and the narrower set is laid out such that the end of a 262144-byte block occurs within it. Issue summary: The C engine for parsing files reads in 262144 bytes at a time. Previously, the "start_lines" variable in tokenizer.c/tokenize_bytes() was set incorrectly to the first line in that chunk, rather than the overall first row requested. This lead to incorrect logic on when to stop reading when nrows is supplied by the user. This always happened but only caused a crash when a wider set of columns followed in the file. In other cases, extra rows were read in but then harmlessly discarded. This pull request always uses the first requested row for comparisons, so only nrows will be parsed when supplied. Author: Jeff Carey <jeff.carey@gmail.com> Closes #14747 from jeffcarey/fix/7626 and squashes the following commits: cac1bac [Jeff Carey] Removed duplicative test 6f1965a [Jeff Carey] BUG: Corrects stopping logic when nrows argument is supplied (Fixes #7626) (cherry picked from commit 4378f82) Conflicts: pandas/io/tests/parser/c_parser_only.py

jreback added Bug IO CSV read_csv, to_csv labels Nov 25, 2016

jeffcarey force-pushed the fix/7626 branch 2 times, most recently from 29a887c to f4c3c13 Compare November 26, 2016 19:19

jeffcarey force-pushed the fix/7626 branch from a60623f to e9c5bee Compare November 29, 2016 05:31

jreback reviewed Dec 1, 2016

View reviewed changes

BUG: Corrects stopping logic when nrows argument is supplied (Fixes p…

6f1965a

…andas-dev#7626) Fixed code formatting Added test to C Parser Only suite, added whatsnew entry

jeffcarey force-pushed the fix/7626 branch from e9c5bee to 6f1965a Compare December 2, 2016 01:57

jreback requested changes Dec 5, 2016

View reviewed changes

Removed duplicative test

cac1bac

jreback added this to the 0.19.2 milestone Dec 5, 2016

jreback approved these changes Dec 5, 2016

View reviewed changes

jreback closed this in 4378f82 Dec 6, 2016

jorisvandenbossche mentioned this pull request Dec 6, 2016

TST: failing builds on master #14808

Closed

2 tasks

jeffcarey deleted the fix/7626 branch December 8, 2016 02:14

Uh oh!

BUG: Corrects stopping logic when nrows argument is supplied (#7626) #14747

BUG: Corrects stopping logic when nrows argument is supplied (#7626) #14747

Uh oh!

Conversation

jeffcarey commented Nov 25, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jreback commented Nov 25, 2016

Uh oh!

jeffcarey commented Nov 25, 2016

Uh oh!

jreback commented Nov 25, 2016

Uh oh!

jreback commented Nov 25, 2016

Uh oh!

codecov-io commented Nov 26, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 85.27% (diff: 100%)

Uh oh!

jeffcarey commented Nov 26, 2016

Uh oh!

jreback commented Nov 26, 2016

Uh oh!

jeffcarey commented Nov 26, 2016

Uh oh!

jeffcarey commented Dec 1, 2016

Uh oh!

jreback Dec 1, 2016

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 1, 2016

Uh oh!

jeffcarey commented Dec 5, 2016

Uh oh!

jreback Dec 5, 2016

Choose a reason for hiding this comment

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 5, 2016

Uh oh!

jeffcarey commented Dec 6, 2016

Uh oh!

jreback commented Dec 6, 2016

Uh oh!

Uh oh!

jeffcarey commented Nov 25, 2016 •

edited

Loading

codecov-io commented Nov 26, 2016 •

edited

Loading