Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: ignoring comment lines and empty lines in CSV files #7470

Closed

Conversation

mdmueller
Copy link
Contributor

closes #4466

@jreback jreback added this to the 0.15.0 milestone Jun 16, 2014
@jreback
Copy link
Contributor

jreback commented Jun 16, 2014

@AmrAS1 thanks for taking this on! lmk if you need any help

we could put this in 0.14.1 if you can do this in next 2 weeks

@mdmueller
Copy link
Contributor Author

@jreback - No problem! I should be done with this pretty soon. Right now I'm just getting test failures with the Python parsing engine, so I'll add this functionality to PythonParser.

@jreback
Copy link
Contributor

jreback commented Jun 16, 2014

  • does this cover partial whitespace in the beginning? (e.g. say "\t\t \n") ?
    (obviously sep is not whitespace).
  • I see that you cover comment char at non-zero location
    check for >1 char of comment char as well (I think its really problematic to allow it to be a reg-ex).
  • pls run a perf check (and post if anything looks odd)
  • release note in v0.14.1.txt (note that I changed the above from the solving PR to the actual issue, use these 2 issues in the release notes)

thanks!

@jreback jreback modified the milestones: 0.14.1, 0.15.0 Jun 16, 2014
@jreback
Copy link
Contributor

jreback commented Jun 16, 2014

will this close #4623 ?

@mdmueller
Copy link
Contributor Author

This will close #4623 according to current output:

In [3]: s1 = '# notes\na,b,c\n# more notes\n1,2,3'

In [4]: s2 = 'a,b,c\n# more notes\n1,2,3'

In [5]: pd.read_csv(StringIO(s1), comment='#')
Out[5]: 
   a  b  c
0  1  2  3

In [6]: pd.read_csv(StringIO(s2), comment='#')
Out[6]: 
   a  b  c
0  1  2  3

In [7]: pd.read_csv(StringIO(s1), comment='#', header=None)
Out[7]: 
   0  1  2
0  a  b  c
1  1  2  3

I'll check out the "\t\t \n" case -- I think this would have to be addressed after tokenization in the Cython code, unless there's a better way. I'm a little confused about what you mean by checking for >1 char of the comment char though, doesn't the C parser disallow regex comments anyway?

@jreback
Copy link
Contributor

jreback commented Jun 17, 2014

what I mean is you should disallow comment='comment' , e.g. a comment string with len > 1 (I think this would be very tricky and not necessary), just raise a ValueError.

for the \t\t just trying to think of cases (I agree its prob handled but not sure if their are tests for something like this, e.g. whitespace preceding a line terminator (and nothing else)

@mdmueller
Copy link
Contributor Author

Unless I'm missing something, it looks like that's the current functionality already:

In [9]: pd.read_csv(StringIO(s1), comment='##', header=None)
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-9-455c49b9113b> in <module>()
----> 1 pd.read_csv(StringIO(s1), comment='##', header=None)

/home/michael/git/pandas/pandas/io/parsers.pyc in parser_f(filepath_or_buffer, sep, dialect, compression, doublequote, escapechar, quotechar, quoting, skipinitialspace, lineterminator, header, index_col, names, prefix, skiprows, skipfooter, skip_footer, na_values, na_fvalues, true_values, false_values, delimiter, converters, dtype, usecols, engine, delim_whitespace, as_recarray, na_filter, compact_ints, use_unsigned, low_memory, buffer_lines, warn_bad_lines, error_bad_lines, keep_default_na, thousands, comment, decimal, parse_dates, keep_date_col, dayfirst, date_parser, memory_map, nrows, iterator, chunksize, verbose, encoding, squeeze, mangle_dupe_cols, tupleize_cols, infer_datetime_format)
    441                     infer_datetime_format=infer_datetime_format)
    442 
--> 443         return _read(filepath_or_buffer, kwds)
    444 
    445     parser_f.__name__ = name

/home/michael/git/pandas/pandas/io/parsers.pyc in _read(filepath_or_buffer, kwds)
    226 
    227     # Create the parser.
--> 228     parser = TextFileReader(filepath_or_buffer, **kwds)
    229 
    230     if nrows is not None:

/home/michael/git/pandas/pandas/io/parsers.pyc in __init__(self, f, engine, **kwds)
    531             self.options['has_index_names'] = kwds['has_index_names']
    532 
--> 533         self._make_engine(self.engine)
    534 
    535     def _get_options_with_defaults(self, engine):

/home/michael/git/pandas/pandas/io/parsers.pyc in _make_engine(self, engine)
    668     def _make_engine(self, engine='c'):
    669         if engine == 'c':
--> 670             self._engine = CParserWrapper(self.f, **self.options)
    671         else:
    672             if engine == 'python':

/home/michael/git/pandas/pandas/io/parsers.pyc in __init__(self, src, **kwds)
   1030         kwds['allow_leading_cols'] = self.index_col is not False
   1031 
-> 1032         self._reader = _parser.TextReader(src, **kwds)
   1033 
   1034         # XXX

/home/michael/git/pandas/pandas/parser.so in pandas.parser.TextReader.__cinit__ (pandas/parser.c:3607)()

ValueError: Only length-1 comment characters supported

The tricky thing with the whitespace is for cases like '\t\t ,\t' or '\t a' where the delimiter or non-whitespace show up eventually. I guess I could backtrack in the tokenization algorithm like how \r line terminators are handled, would that be preferable to post-processing?

@jreback
Copy link
Contributor

jreback commented Jun 17, 2014

@AmrAS1 if their is a test for multi-char comments then good

I think handling the back-tracking similar to \r is good (don't want to slow up the general case).

@mdmueller
Copy link
Contributor Author

@jreback - I'm almost done, but I found a test failure involving a file which ends with a bunch of tab-filled lines using read_table (which defaults to sep='\t'). Would the correct behavior be not to ignore these lines, i.e. to count any line including the delimiter?

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

I think if u have the delimiter then the line counts

@mdmueller
Copy link
Contributor Author

@jreback - I think I'm done here, but let me know if there are any problems.

@@ -100,7 +100,12 @@ Enhancements



- Changed the behavior of file parsing so that line comments are ignored instead
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

say it more like: read_csv/read_table file parsers now ignore comments provided with the comment=.... (say something about single character), and ignore starting at the comment

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

pls run a perf check (test_perf.sh -b master -t HEAD) to see if anything > 1.2 shows up (if it does pls post it and investigate)

pls squash this down to a small number of commits

otherwise looks good!

@mdmueller
Copy link
Contributor Author

There were two with a ratio >1.2 but both are unrelated to parsing. Here is the bottom of the results:

dtype_infer_int64                            |   1.0494 |   0.9537 |   1.1003 |
stats_rank_average                           |  22.0106 |  19.8573 |   1.1084 |
dtype_infer_int32                            |   0.3977 |   0.3580 |   1.1108 |
dti_reset_index                              |   0.2803 |   0.2510 |   1.1168 |
dtype_infer_float64                          |   0.9977 |   0.8846 |   1.1278 |
indexing_dataframe_boolean_rows              |   0.2737 |   0.2414 |   1.1340 |
frame_ctor_dtindex_YearBeginx1               |   0.7073 |   0.6123 |   1.1551 |
groupby_multi_different_numpy_functions      |   9.1350 |   7.7774 |   1.1746 |
merge_2intkey_nosort                         |  12.1953 |   9.9773 |   1.2223 |
groupby_simple_compress_timing               |  22.3273 |  15.4157 |   1.4484 |

I assume the slower performance is due to previous commits. I'll squash the commits on this PR now.

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

perf looks fine

@jreback
Copy link
Contributor

jreback commented Jun 18, 2014

you have some failing tests

@mdmueller
Copy link
Contributor Author

Looks like it was the same error, I fixed it in the last commit so now I'm waiting on Travis to make sure.

@mdmueller
Copy link
Contributor Author

Hm, it's fixed now except for an error in test_round_trip_exception_ under test_json, which calls read_csv('https://raw.github.com/hayd/lahman2012/master/csvs/Teams.csv'). I tried it locally and it works fine, not sure what that's about. I'll check it out.

@jreback
Copy link
Contributor

jreback commented Jun 22, 2014

@AmrAS1 hows' this coming along?

@mdmueller
Copy link
Contributor Author

@jreback - I tried running Travis on my repo with these changes and it worked except for a timeout. Not sure what's up, is there any way we can restart Travis on this PR?

@jreback
Copy link
Contributor

jreback commented Jun 22, 2014

restarted

do the doc-string / docs need updating
maybe add a small section on comments? (if it's not obvious from doc string)

@mdmueller
Copy link
Contributor Author

Updated -- Travis seems to be stuck on a timeout.

@jreback
Copy link
Contributor

jreback commented Jun 23, 2014

push this up again

git commit -C HEAD --amend
then force push

it will force a new travis build

@mdmueller
Copy link
Contributor Author

Travis is failing for some odd reason that seems unrelated to recent changes. @jreback do you know what this could be?

@jreback
Copy link
Contributor

jreback commented Sep 13, 2014

those are seg faults
check and see if it works locally

@mdmueller
Copy link
Contributor Author

It works locally for me

@jreback
Copy link
Contributor

jreback commented Sep 13, 2014

so a way to debug is I have nosetests output each test (to see where it is failing)

add the -v argument in ci/script.py (only temporary of course)

@mdmueller mdmueller force-pushed the ignore-comments-emptylines branch from 65e7572 to 99f162f Compare September 13, 2014 15:00
@mdmueller
Copy link
Contributor Author

Ah, must've forgotten to recompile extensions. The problem was that I forgot to change the parameter name in parser.pyx, I'll amend this.

@mdmueller mdmueller force-pushed the ignore-comments-emptylines branch from 99f162f to e2f02d7 Compare September 13, 2014 18:59
Ignoring line comments and empty lines
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
If the ``comment`` parameter is specified, then completely commented lines will
be ignored. By default, completely blank lines will be ignored as well:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an API change (that blank lines will be skipped). Maybe make a warning here that this is new in 0.15?

@mdmueller
Copy link
Contributor Author

Ok added API change notice

@@ -147,6 +147,11 @@ API changes

ewma(s, com=3., min_periods=2)

- Made both the C-based and Python file readers ignore empty lines in input as well as
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe mention explicitely that it is read_csv (and read_table, I suppose)? That is more clear for users I think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amended to add this in

@mdmueller mdmueller force-pushed the ignore-comments-emptylines branch from 1d95b5c to cee026d Compare September 17, 2014 19:13
@jreback
Copy link
Contributor

jreback commented Sep 19, 2014

pls rebase and squash

commit 0e9d792fc9d5159179efd810a1092671dbbef3b1
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Wed Sep 17 14:49:31 2014 -0400

    Added warnings about API changes

commit 06472c21000b489841cc8e486ceddf05fd87a1c5
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Fri Sep 12 22:36:06 2014 -0400

    Changed parameter name to skip_blank_lines

commit afd3be30b4afcae0d9bc6278237aab6a4c9e7eb8
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Fri Sep 12 21:50:08 2014 -0400

    Minor doc changes

commit b47876e074f5f683a9a51768e480e24d9d3249ab
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Fri Sep 12 19:26:22 2014 -0400

    Extended blank line skipping to custom line terminated/whitespace delimited reading

commit 3f4a20a831b1bc0ca29779b315dc72d78ad2301e
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Fri Sep 12 11:35:17 2014 -0400

    Changed around io docs section

commit 223e17ecdcbe377cc69fd962221e03412f5e54d3
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Tue Sep 9 23:13:37 2014 -0400

    Turned empty line skipping into a keyword parameter feature

commit dcd31ca6bd0849eab87ea1c3c5441c8630ca3a35
Author: Michael Mueller <michaeldmueller7@gmail.com>
Date:   Wed Sep 3 21:35:09 2014 -0400

    Squashed commit of the following:

    commit 9aea77954681c2f7d1336d94366221222d186c2b
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Tue Aug 26 22:43:21 2014 -0400

        Fixed header/skiprows combination issue

    commit 1975affea3bf0bd6f1769a79e4b0c7fde17962df
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Wed Jun 25 19:35:24 2014 -0400

        Added warning/notes about functionality change in docs, removed HTML changes

    commit 693c820092d9f17f9101074d29c2d7d53fa5a8ae
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Wed Jun 25 15:38:41 2014 -0400

        Fixed problem with HTML reading and infinite loop in PythonParser __init__

    commit 2a0a4babac7a5e53279eaa8281d0a51406caeb27
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Mon Jun 23 08:37:33 2014 -0400

        Updated docs with new read_csv functionality, removed unreachable code

    commit 19b5811e8d78c4e618e19ff5768aa2cfff041620
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Wed Jun 18 21:43:47 2014 -0400

        Fixed error in empty/whitespace removal function

    commit 3fd11a822cc0bee123d68240c62627da11ee88c2
    Author: Michael Mueller <michaeldmueller7@gmail.com>
    Date:   Wed Jun 18 18:48:08 2014 -0400

        Squashed commit of the following:

        commit 60a1cd1bc1042a9959ae75ff006052c433d98825
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Wed Jun 18 18:40:17 2014 -0400

            Fixed error with string/numerical types

        commit 7fe1bcf75466ea2b19d947aff0769c9f03bc23f5
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Wed Jun 18 17:47:56 2014 -0400

            release notes

        commit 835e490c8d3a3a96aeb6a6c3846217d36469656b
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Wed Jun 18 17:15:17 2014 -0400

            Release note

        commit 25cee3167b81b9c81e969629cd83968c6736a94f
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Wed Jun 18 16:56:44 2014 -0400

            Fixed whitespace issue, made C parser check for delimiters in whitespace lines

        commit 593495eb15162833de78d2da65f377fa977ad225
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Wed Jun 18 15:41:52 2014 -0400

            Added new functionality to Python reader

        commit 8a8325ed883034f176c929b41fe6fad16420e9b5
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Tue Jun 17 19:52:41 2014 -0400

            Adjusted tokenizer to ignore whitespace-only lines, fixed tests

        commit 3ea2eed22884a63a6e8dec1b795acdf29b030949
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Mon Jun 16 12:36:14 2014 -0400

            Moved tests to C parsing suite, corrected multi-index test

        commit d5540311ca44992148932ae27e16fc4d02a2a018
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Mon Jun 16 12:35:46 2014 -0400

            Changed empty file handling so that a ValueError is raised as expected

        commit 03a4c3d27c18052f04bd7cb862d289eabbc773ba
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Sun Jun 15 23:07:17 2014 -0400

            Wrote tests for empty lines and comment lines

        commit 01db817e97fc8ee0da85cc17603578b56d294b1b
        Author: Michael Mueller <michaeldmueller7@gmail.com>
        Date:   Sun Jun 15 23:02:04 2014 -0400

            Modified C tokenizer so that comments and empty lines are ignored
@mdmueller mdmueller force-pushed the ignore-comments-emptylines branch from cee026d to e4bcb5c Compare September 19, 2014 15:36
@mdmueller
Copy link
Contributor Author

@jreback - rebased and squashed

data rather than the first line of the file.
ignores commented lines and empty lines if ``skip_blank_lines=True``, so header=0
denotes the first line of data rather than the first line of the file.
- ``skip_blank_lines``: whether to skip over blank lines rather than interpreting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put the default here (True)

@jreback
Copy link
Contributor

jreback commented Sep 19, 2014

@AmrAS1 couple of minor documenting changes. ping when when green.

@mdmueller
Copy link
Contributor Author

@jreback - Green

@jreback
Copy link
Contributor

jreback commented Sep 19, 2014

side issue I happen to have an old version of cython installed on 1 of my machines (0.17.1). didn't compile the parser.pyx correctly, but the newer ones do. odd - so going to change the min version (to 0.19.1) which is still pretty old

@mdmueller
Copy link
Contributor Author

Interesting, do you know what the compile error was?

@jreback
Copy link
Contributor

jreback commented Sep 19, 2014

merged via 31c2558

@jreback jreback closed this Sep 19, 2014
@jreback
Copy link
Contributor

jreback commented Sep 19, 2014

I broke on the float_precision pointer-to-func (was a formatting error). 0.19.1 works fine though

@jorisvandenbossche
Copy link
Member

@amanshei Did you got it sorted out?

@amanshei
Copy link

I did, thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Ignoring empty lines in files
6 participants