ENH: ignoring comment lines and empty lines in CSV files #7470

mdmueller · 2014-06-16T03:26:23Z

jreback · 2014-06-16T12:35:37Z

@AmrAS1 thanks for taking this on! lmk if you need any help

we could put this in 0.14.1 if you can do this in next 2 weeks

mdmueller · 2014-06-16T16:38:43Z

@jreback - No problem! I should be done with this pretty soon. Right now I'm just getting test failures with the Python parsing engine, so I'll add this functionality to PythonParser.

jreback · 2014-06-16T16:47:51Z

does this cover partial whitespace in the beginning? (e.g. say "\t\t \n") ?
(obviously sep is not whitespace).
I see that you cover comment char at non-zero location
check for >1 char of comment char as well (I think its really problematic to allow it to be a reg-ex).
pls run a perf check (and post if anything looks odd)
release note in v0.14.1.txt (note that I changed the above from the solving PR to the actual issue, use these 2 issues in the release notes)

thanks!

jreback · 2014-06-16T16:48:33Z

will this close #4623 ?

mdmueller · 2014-06-17T13:28:43Z

This will close #4623 according to current output:

In [3]: s1 = '# notes\na,b,c\n# more notes\n1,2,3'

In [4]: s2 = 'a,b,c\n# more notes\n1,2,3'

In [5]: pd.read_csv(StringIO(s1), comment='#')
Out[5]: 
   a  b  c
0  1  2  3

In [6]: pd.read_csv(StringIO(s2), comment='#')
Out[6]: 
   a  b  c
0  1  2  3

In [7]: pd.read_csv(StringIO(s1), comment='#', header=None)
Out[7]: 
   0  1  2
0  a  b  c
1  1  2  3

I'll check out the "\t\t \n" case -- I think this would have to be addressed after tokenization in the Cython code, unless there's a better way. I'm a little confused about what you mean by checking for >1 char of the comment char though, doesn't the C parser disallow regex comments anyway?

jreback · 2014-06-17T13:32:52Z

what I mean is you should disallow comment='comment' , e.g. a comment string with len > 1 (I think this would be very tricky and not necessary), just raise a ValueError.

for the \t\t just trying to think of cases (I agree its prob handled but not sure if their are tests for something like this, e.g. whitespace preceding a line terminator (and nothing else)

mdmueller · 2014-06-17T13:46:23Z

Unless I'm missing something, it looks like that's the current functionality already:

In [9]: pd.read_csv(StringIO(s1), comment='##', header=None)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-455c49b9113b> in <module>()
----> 1 pd.read_csv(StringIO(s1), comment='##', header=None)

/home/michael/git/pandas/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
    441                     infer_datetime_format=infer_datetime_format)
    442 
--> 443         return _read(filepath_or_buffer, kwds)
    444 
    445     parser_f.__name__ = name

/home/michael/git/pandas/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    226 
    227     # Create the parser.
--> 228     parser = TextFileReader(filepath_or_buffer, **kwds)
    229 
    230     if nrows is not None:

/home/michael/git/pandas/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    531             self.options['has_index_names'] = kwds['has_index_names']
    532 
--> 533         self._make_engine(self.engine)
    534 
    535     def _get_options_with_defaults(self, engine):

/home/michael/git/pandas/pandas/io/parsers.pyc in _make_engine(self, engine)
    668     def _make_engine(self, engine='c'):
    669         if engine == 'c':
--> 670             self._engine = CParserWrapper(self.f, **self.options)
    671         else:
    672             if engine == 'python':

/home/michael/git/pandas/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1030         kwds['allow_leading_cols'] = self.index_col is not False
   1031 
-> 1032         self._reader = _parser.TextReader(src, **kwds)
   1033 
   1034         # XXX

/home/michael/git/pandas/pandas/parser.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3607)()

ValueError: Only length-1 comment characters supported

The tricky thing with the whitespace is for cases like '\t\t ,\t' or '\t a' where the delimiter or non-whitespace show up eventually. I guess I could backtrack in the tokenization algorithm like how \r line terminators are handled, would that be preferable to post-processing?

jreback · 2014-06-17T13:52:13Z

@AmrAS1 if their is a test for multi-char comments then good

I think handling the back-tracking similar to \r is good (don't want to slow up the general case).

mdmueller · 2014-06-18T20:17:29Z

@jreback - I'm almost done, but I found a test failure involving a file which ends with a bunch of tab-filled lines using read_table (which defaults to sep='\t'). Would the correct behavior be not to ignore these lines, i.e. to count any line including the delimiter?

jreback · 2014-06-18T20:37:14Z

I think if u have the delimiter then the line counts

mdmueller · 2014-06-18T21:36:41Z

@jreback - I think I'm done here, but let me know if there are any problems.

jreback · 2014-06-18T21:40:08Z

doc/source/v0.14.1.txt

say it more like: read_csv/read_table file parsers now ignore comments provided with the comment=.... (say something about single character), and ignore starting at the comment

jreback · 2014-06-18T21:41:59Z

pls run a perf check (test_perf.sh -b master -t HEAD) to see if anything > 1.2 shows up (if it does pls post it and investigate)

pls squash this down to a small number of commits

otherwise looks good!

mdmueller · 2014-06-18T22:31:00Z

There were two with a ratio >1.2 but both are unrelated to parsing. Here is the bottom of the results:

dtype_infer_int64                            |   1.0494 |   0.9537 |   1.1003 |
stats_rank_average                           |  22.0106 |  19.8573 |   1.1084 |
dtype_infer_int32                            |   0.3977 |   0.3580 |   1.1108 |
dti_reset_index                              |   0.2803 |   0.2510 |   1.1168 |
dtype_infer_float64                          |   0.9977 |   0.8846 |   1.1278 |
indexing_dataframe_boolean_rows              |   0.2737 |   0.2414 |   1.1340 |
frame_ctor_dtindex_YearBeginx1               |   0.7073 |   0.6123 |   1.1551 |
groupby_multi_different_numpy_functions      |   9.1350 |   7.7774 |   1.1746 |
merge_2intkey_nosort                         |  12.1953 |   9.9773 |   1.2223 |
groupby_simple_compress_timing               |  22.3273 |  15.4157 |   1.4484 |

I assume the slower performance is due to previous commits. I'll squash the commits on this PR now.

jreback · 2014-06-18T22:32:17Z

perf looks fine

jreback · 2014-06-18T22:33:46Z

you have some failing tests

mdmueller · 2014-06-18T22:49:55Z

Looks like it was the same error, I fixed it in the last commit so now I'm waiting on Travis to make sure.

mdmueller · 2014-06-19T12:24:53Z

Hm, it's fixed now except for an error in test_round_trip_exception_ under test_json, which calls read_csv('https://raw.github.com/hayd/lahman2012/master/csvs/Teams.csv'). I tried it locally and it works fine, not sure what that's about. I'll check it out.

jreback · 2014-06-22T12:35:04Z

@AmrAS1 hows' this coming along?

mdmueller · 2014-06-22T18:18:58Z

@jreback - I tried running Travis on my repo with these changes and it worked except for a timeout. Not sure what's up, is there any way we can restart Travis on this PR?

jreback · 2014-06-22T18:24:29Z

restarted

do the doc-string / docs need updating
maybe add a small section on comments? (if it's not obvious from doc string)

mdmueller · 2014-06-23T12:38:20Z

Updated -- Travis seems to be stuck on a timeout.

jreback · 2014-06-23T13:36:19Z

push this up again

git commit -C HEAD --amend
then force push

it will force a new travis build

mdmueller · 2014-09-13T14:12:23Z

Travis is failing for some odd reason that seems unrelated to recent changes. @jreback do you know what this could be?

jreback · 2014-09-13T14:15:50Z

those are seg faults
check and see if it works locally

mdmueller · 2014-09-13T14:18:04Z

It works locally for me

jreback · 2014-09-13T14:23:06Z

so a way to debug is I have nosetests output each test (to see where it is failing)

add the -v argument in ci/script.py (only temporary of course)

mdmueller · 2014-09-13T18:58:41Z

Ah, must've forgotten to recompile extensions. The problem was that I forgot to change the parameter name in parser.pyx, I'll amend this.

jreback · 2014-09-17T12:59:55Z

doc/source/io.rst

This is an API change (that blank lines will be skipped). Maybe make a warning here that this is new in 0.15?

mdmueller · 2014-09-17T18:50:02Z

Ok added API change notice

jorisvandenbossche · 2014-09-17T19:09:07Z

doc/source/v0.15.0.txt

Maybe mention explicitely that it is read_csv (and read_table, I suppose)? That is more clear for users I think.

Amended to add this in

jreback · 2014-09-19T12:30:15Z

pls rebase and squash

commit 0e9d792fc9d5159179efd810a1092671dbbef3b1 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Sep 17 14:49:31 2014 -0400 Added warnings about API changes commit 06472c21000b489841cc8e486ceddf05fd87a1c5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 22:36:06 2014 -0400 Changed parameter name to skip_blank_lines commit afd3be30b4afcae0d9bc6278237aab6a4c9e7eb8 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 21:50:08 2014 -0400 Minor doc changes commit b47876e074f5f683a9a51768e480e24d9d3249ab Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 19:26:22 2014 -0400 Extended blank line skipping to custom line terminated/whitespace delimited reading commit 3f4a20a831b1bc0ca29779b315dc72d78ad2301e Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 11:35:17 2014 -0400 Changed around io docs section commit 223e17ecdcbe377cc69fd962221e03412f5e54d3 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Sep 9 23:13:37 2014 -0400 Turned empty line skipping into a keyword parameter feature commit dcd31ca6bd0849eab87ea1c3c5441c8630ca3a35 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Sep 3 21:35:09 2014 -0400 Squashed commit of the following: commit 9aea77954681c2f7d1336d94366221222d186c2b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Aug 26 22:43:21 2014 -0400 Fixed header/skiprows combination issue commit 1975affea3bf0bd6f1769a79e4b0c7fde17962df Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 25 19:35:24 2014 -0400 Added warning/notes about functionality change in docs, removed HTML changes commit 693c820092d9f17f9101074d29c2d7d53fa5a8ae Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 25 15:38:41 2014 -0400 Fixed problem with HTML reading and infinite loop in PythonParser __init__ commit 2a0a4babac7a5e53279eaa8281d0a51406caeb27 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 23 08:37:33 2014 -0400 Updated docs with new read_csv functionality, removed unreachable code commit 19b5811e8d78c4e618e19ff5768aa2cfff041620 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 21:43:47 2014 -0400 Fixed error in empty/whitespace removal function commit 3fd11a822cc0bee123d68240c62627da11ee88c2 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 18:48:08 2014 -0400 Squashed commit of the following: commit 60a1cd1bc1042a9959ae75ff006052c433d98825 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 18:40:17 2014 -0400 Fixed error with string/numerical types commit 7fe1bcf75466ea2b19d947aff0769c9f03bc23f5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 17:47:56 2014 -0400 release notes commit 835e490c8d3a3a96aeb6a6c3846217d36469656b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 17:15:17 2014 -0400 Release note commit 25cee3167b81b9c81e969629cd83968c6736a94f Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 16:56:44 2014 -0400 Fixed whitespace issue, made C parser check for delimiters in whitespace lines commit 593495eb15162833de78d2da65f377fa977ad225 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 15:41:52 2014 -0400 Added new functionality to Python reader commit 8a8325ed883034f176c929b41fe6fad16420e9b5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Jun 17 19:52:41 2014 -0400 Adjusted tokenizer to ignore whitespace-only lines, fixed tests commit 3ea2eed22884a63a6e8dec1b795acdf29b030949 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 16 12:36:14 2014 -0400 Moved tests to C parsing suite, corrected multi-index test commit d5540311ca44992148932ae27e16fc4d02a2a018 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 16 12:35:46 2014 -0400 Changed empty file handling so that a ValueError is raised as expected commit 03a4c3d27c18052f04bd7cb862d289eabbc773ba Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Sun Jun 15 23:07:17 2014 -0400 Wrote tests for empty lines and comment lines commit 01db817e97fc8ee0da85cc17603578b56d294b1b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Sun Jun 15 23:02:04 2014 -0400 Modified C tokenizer so that comments and empty lines are ignored

mdmueller · 2014-09-19T15:37:00Z

@jreback - rebased and squashed

jreback · 2014-09-19T15:39:05Z

doc/source/io.rst

put the default here (True)

jreback · 2014-09-19T15:40:45Z

@AmrAS1 couple of minor documenting changes. ping when when green.

mdmueller · 2014-09-19T17:19:35Z

@jreback - Green

jreback · 2014-09-19T17:49:26Z

side issue I happen to have an old version of cython installed on 1 of my machines (0.17.1). didn't compile the parser.pyx correctly, but the newer ones do. odd - so going to change the min version (to 0.19.1) which is still pretty old

mdmueller · 2014-09-19T17:50:32Z

Interesting, do you know what the compile error was?

jreback · 2014-09-19T17:55:47Z

merged via 31c2558

jreback · 2014-09-19T17:56:09Z

I broke on the float_precision pointer-to-func (was a formatting error). 0.19.1 works fine though

jorisvandenbossche · 2014-12-29T21:27:01Z

@amanshei Did you got it sorted out?

amanshei · 2014-12-29T21:35:48Z

I did, thank you!

jreback added CSV labels Jun 16, 2014

jreback added this to the 0.15.0 milestone Jun 16, 2014

jreback modified the milestones: 0.14.1, 0.15.0 Jun 16, 2014

jreback mentioned this pull request Jun 16, 2014

ENH/BUG: ignore line comments in CSV files GH2685 #4505

Closed

jreback reviewed Jun 18, 2014
View reviewed changes

mdmueller force-pushed the ignore-comments-emptylines branch from 65e7572 to 99f162f Compare September 13, 2014 15:00

mdmueller force-pushed the ignore-comments-emptylines branch from 99f162f to e2f02d7 Compare September 13, 2014 18:59

jreback reviewed Sep 17, 2014
View reviewed changes

jorisvandenbossche reviewed Sep 17, 2014
View reviewed changes

mdmueller force-pushed the ignore-comments-emptylines branch from 1d95b5c to cee026d Compare September 17, 2014 19:13

mdmueller force-pushed the ignore-comments-emptylines branch from cee026d to e4bcb5c Compare September 19, 2014 15:36

jreback reviewed Sep 19, 2014
View reviewed changes

doc/source/io.rst Outdated

Copy link

Contributor

jreback Sep 19, 2014

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the default here (True)

Minor doc changes

29e8c3d

jreback closed this Sep 19, 2014

jreback mentioned this pull request Sep 19, 2014

ENH: Ignoring empty lines in files #4466

Closed

jreback mentioned this pull request Jan 9, 2015

pandas error kills IPython kernel #9205

Closed

Uh oh!

ENH: ignoring comment lines and empty lines in CSV files #7470

ENH: ignoring comment lines and empty lines in CSV files #7470

Uh oh!

Conversation

mdmueller commented Jun 16, 2014

Uh oh!

jreback commented Jun 16, 2014

Uh oh!

mdmueller commented Jun 16, 2014

Uh oh!

jreback commented Jun 16, 2014

Uh oh!

jreback commented Jun 16, 2014

Uh oh!

mdmueller commented Jun 17, 2014

Uh oh!

jreback commented Jun 17, 2014

Uh oh!

mdmueller commented Jun 17, 2014

Uh oh!

jreback commented Jun 17, 2014

Uh oh!

mdmueller commented Jun 18, 2014

Uh oh!

jreback commented Jun 18, 2014

Uh oh!

mdmueller commented Jun 18, 2014

Uh oh!

jreback Jun 18, 2014

Choose a reason for hiding this comment

Uh oh!

jreback commented Jun 18, 2014

Uh oh!

mdmueller commented Jun 18, 2014

Uh oh!

jreback commented Jun 18, 2014

Uh oh!

jreback commented Jun 18, 2014

Uh oh!

mdmueller commented Jun 18, 2014

Uh oh!

mdmueller commented Jun 19, 2014

Uh oh!

jreback commented Jun 22, 2014

Uh oh!

mdmueller commented Jun 22, 2014

Uh oh!

jreback commented Jun 22, 2014

Uh oh!

mdmueller commented Jun 23, 2014

Uh oh!

jreback commented Jun 23, 2014

Uh oh!

mdmueller commented Sep 13, 2014

Uh oh!

jreback commented Sep 13, 2014

Uh oh!

mdmueller commented Sep 13, 2014

Uh oh!

jreback commented Sep 13, 2014

Uh oh!

mdmueller commented Sep 13, 2014

Uh oh!

jreback Sep 17, 2014

Choose a reason for hiding this comment

Uh oh!

mdmueller commented Sep 17, 2014

Uh oh!

jorisvandenbossche Sep 17, 2014

Choose a reason for hiding this comment

Uh oh!

mdmueller Sep 17, 2014

Choose a reason for hiding this comment

Uh oh!

jreback commented Sep 19, 2014

Uh oh!

mdmueller commented Sep 19, 2014

Uh oh!

jreback Sep 19, 2014