Skip to content

read_csv newline fix #10023

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
May 8, 2015
Merged

Conversation

jblackburne
Copy link
Contributor

Slight change to the logic in the tokenize_delimited() and tokenize_delim_customterm() functions of the C parser.

Fixes #10022.

I believe the new logic is correct, but perhaps someone with more familiarity can double-check.

@jreback
Copy link
Contributor

jreback commented Apr 29, 2015

you need a test. As not really sure what you are fixing.

@jblackburne
Copy link
Contributor Author

Ok, test forthcoming.

@jblackburne
Copy link
Contributor Author

There is a small self-contained test in the comments on issue #10022. Would it be desirable to make it into a unit test? It takes about a second to run on my machine.

@jreback
Copy link
Contributor

jreback commented Apr 30, 2015

yep
the idea would be to add the test ; it fails on master; after your fix everything passes

@jreback jreback added the IO CSV read_csv, to_csv label Apr 30, 2015
@jblackburne
Copy link
Contributor Author

Unit test added. Any further comments?

@@ -359,6 +359,11 @@ def test_empty_field_eof(self):
names=list('abcd'), engine='c')
assert_frame_equal(df, c)

def test_chunk_begins_with_newline_whitespace(self):
data = '\n hello\nworld\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment here

@jreback
Copy link
Contributor

jreback commented May 7, 2015

even though the fix is only in the c-parser, IIRC, this should work in python parser as well? hence pls move the tests to test_parser.py which will tests with all parsers.

@jreback jreback added this to the 0.17.0 milestone May 7, 2015
@jreback
Copy link
Contributor

jreback commented May 7, 2015

@jblackburne looks good.

  • pls add a release note in v0.16.1
  • pls squash
  • ping when green.
    here for detailed git instructions if you need

@jreback jreback modified the milestones: 0.16.1, 0.17.0 May 7, 2015
@jblackburne
Copy link
Contributor Author

Squash all into a single commit?

@shoyer
Copy link
Member

shoyer commented May 7, 2015

@jblackburne Yes, that's what @jreback is asking for

…that start with newline.

Changed a condition in tokenize_delim_customterm to account for data chunks that start with terminator.

Added a unit test that fails in master and passes in this branch.

Moved new unit test in order to test all parser engines. Added GH issue number.

Added release note.
@jblackburne jblackburne force-pushed the read_csv-newline-chunk branch from 8f37413 to e693c3a Compare May 7, 2015 18:46
@jreback
Copy link
Contributor

jreback commented May 7, 2015

pls take a look

cc @evanpw
cc @mdmueller
cc @selasley

@evanpw
Copy link
Contributor

evanpw commented May 7, 2015

The logic looks right to me.

@jreback
Copy link
Contributor

jreback commented May 8, 2015

@jblackburne pls ping when this is green.

@jblackburne
Copy link
Contributor Author

Ok, green.

jreback added a commit that referenced this pull request May 8, 2015
@jreback jreback merged commit 2840bea into pandas-dev:master May 8, 2015
@jblackburne jblackburne deleted the read_csv-newline-chunk branch May 9, 2015 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv(engine='c') can insert spurious rows full of NaNs
4 participants