Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH/BUG: ignore line comments in CSV files GH2685 #4505

Closed
wants to merge 2 commits into from

Conversation

holocronweaver
Copy link

closes #2685

I have added the ability for both the C and Python CSV parsers to ignore commented lines (i.e., lines beginning with a comment character). Currently the C parser preserves commented lines as empty lines (all NaN), while the Python parser ignores them all together.

In addition, I fixed a small related problem with the CSV format sniffer in the Python parser.

I plan to finish up this work by ignoring empty lines as per #4466.

@@ -1282,9 +1280,8 @@ class MyDialect(csv.Dialect):

sniff_sep = True

if sep is not None:
if (sep is not None) and (dia.quotechar is not None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for parens here is not binds tighter than and

@cpcloud
Copy link
Member

cpcloud commented Aug 8, 2013

Can you add a test and release notes? thx!

@holocronweaver
Copy link
Author

Sorry, I am new to pandas dev. I am guessing a unit test for commented lines in a CSV file is what you have in mind?

@cpcloud
Copy link
Member

cpcloud commented Aug 8, 2013

Yep!

@holocronweaver
Copy link
Author

Where should the test be created? There does not seem to be a particular file for parsers. Maybe test_frame since parsers return frames?

@cpcloud
Copy link
Member

cpcloud commented Aug 9, 2013

check out pandas/tests/test_parsers.py

@holocronweaver
Copy link
Author

Sorry, missed the tests folder in pandas/io.

Having trouble setting up the test to expect different output for C and Python parsers. The tests seem to lock the parser engine and ignore the engine parameter in read_csv, causing my test to fail. The Python parser omits empty lines, while the C parser does not. In #4466 I propose making the behavior the same by making the C parser follow the Python parser behavior. Perhaps I should just go ahead and implement my suggestion? Or is there a method to query the current engine from within a test?

@jreback
Copy link
Contributor

jreback commented Aug 12, 2013

it normally goes thru 3 different version of the parser if you put your test in ParserTests, python parsing, c parsing, and i think low memory c parsing. You can put a test in say PythonParsing if you only want to have it run on that. Best prob to put it in the main test class if you want to have similar behavior in all parsers.

if you step thru it it call read_csv (and not pd.read_csv), which sets the engine depending on the iteration. I think you can set engine='python' in any event when you call read_csv to specify locally

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

@holocronweaver how's this coming along?

@holocronweaver
Copy link
Author

Almost done, though temporarily delayed due to work. I will try to get this finished up tomorrow if possible. Worst case would be next weekend.

@jreback
Copy link
Contributor

jreback commented Aug 23, 2013

gr8
ping when ready

@jreback
Copy link
Contributor

jreback commented Sep 20, 2013

@holocronweaver how's this coming along?

@holocronweaver
Copy link
Author

@jreback It is basically done, but I need time to test and debug. I am currently finishing a GSoC project which ends next week, so I will have a bit of free time again and will try to push this as soon as I get a chance.

@jreback
Copy link
Contributor

jreback commented Sep 21, 2013

@holocronweaver perfect...pls ping back when to take a look

@jreback
Copy link
Contributor

jreback commented Oct 2, 2013

@holocronweaver how's this coming?

@jreback
Copy link
Contributor

jreback commented Oct 7, 2013

@holocronweaver ping!

@jreback
Copy link
Contributor

jreback commented Oct 11, 2013

@holocronweaver going to be able to rebase this in the next couple of days?

@holocronweaver
Copy link
Author

@jreback Sorry, have been very busy at work. Will be at least another week, though I will try to get it done sooner. Apologies again for the long delay.

@jreback
Copy link
Contributor

jreback commented Oct 14, 2013

@holocronweaver ok...let us know

@jreback jreback closed this Jan 3, 2014
@jreback jreback reopened this Jan 3, 2014
@jreback
Copy link
Contributor

jreback commented Jan 3, 2014

@holocronweaver can are to rebase this?

@holocronweaver
Copy link
Author

@jreback Sure, when I get back from holiday travels.

@jreback
Copy link
Contributor

jreback commented Feb 16, 2014

@holocronweaver progress on this?

@holocronweaver
Copy link
Author

@jreback No, but it is on my TODO list. Crunch time is preventing anything extracurricular.

@jreback jreback modified the milestones: 0.15.0, 0.14.0 Feb 26, 2014
@jreback
Copy link
Contributor

jreback commented Mar 9, 2014

@holocronweaver update?

@jreback
Copy link
Contributor

jreback commented Apr 5, 2014

@holocronweaver update on this?

@jreback
Copy link
Contributor

jreback commented Jun 16, 2014

closing in favor of #7470

@jreback jreback closed this Jun 16, 2014
@jreback
Copy link
Contributor

jreback commented Dec 28, 2014

see here: https://github.com/pydata/pandas/pull/7470/files

try skip_blank_lines=False (is the original behavior)

@amanshei
Copy link

Thanks!!

On Sat, Dec 27, 2014 at 6:27 PM, jreback notifications@github.com wrote:

see here: https://github.com/pydata/pandas/pull/7470/files

try skip_blank_lines=False (is the original behavior)


Reply to this email directly or view it on GitHub
#4505 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv IO Data IO issues that don't fit into a more specific label
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Option for reading files with a variable number of comment lines at start
4 participants