-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: ignoring comment lines and empty lines in CSV files #7470
Conversation
@AmrAS1 thanks for taking this on! lmk if you need any help we could put this in 0.14.1 if you can do this in next 2 weeks |
@jreback - No problem! I should be done with this pretty soon. Right now I'm just getting test failures with the Python parsing engine, so I'll add this functionality to |
thanks! |
will this close #4623 ? |
This will close #4623 according to current output:
I'll check out the "\t\t \n" case -- I think this would have to be addressed after tokenization in the Cython code, unless there's a better way. I'm a little confused about what you mean by checking for >1 char of the comment char though, doesn't the C parser disallow regex comments anyway? |
what I mean is you should disallow for the |
Unless I'm missing something, it looks like that's the current functionality already:
The tricky thing with the whitespace is for cases like '\t\t ,\t' or '\t a' where the delimiter or non-whitespace show up eventually. I guess I could backtrack in the tokenization algorithm like how \r line terminators are handled, would that be preferable to post-processing? |
@AmrAS1 if their is a test for multi-char comments then good I think handling the back-tracking similar to \r is good (don't want to slow up the general case). |
@jreback - I'm almost done, but I found a test failure involving a file which ends with a bunch of tab-filled lines using |
I think if u have the delimiter then the line counts |
@jreback - I think I'm done here, but let me know if there are any problems. |
@@ -100,7 +100,12 @@ Enhancements | |||
|
|||
|
|||
|
|||
- Changed the behavior of file parsing so that line comments are ignored instead |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
say it more like: read_csv/read_table
file parsers now ignore comments provided with the comment=....
(say something about single character), and ignore starting at the comment
pls run a perf check ( pls squash this down to a small number of commits otherwise looks good! |
There were two with a ratio >1.2 but both are unrelated to parsing. Here is the bottom of the results:
I assume the slower performance is due to previous commits. I'll squash the commits on this PR now. |
perf looks fine |
you have some failing tests |
Looks like it was the same error, I fixed it in the last commit so now I'm waiting on Travis to make sure. |
Hm, it's fixed now except for an error in |
@AmrAS1 hows' this coming along? |
@jreback - I tried running Travis on my repo with these changes and it worked except for a timeout. Not sure what's up, is there any way we can restart Travis on this PR? |
restarted do the doc-string / docs need updating |
Updated -- Travis seems to be stuck on a timeout. |
push this up again
it will force a new travis build |
those are seg faults |
It works locally for me |
so a way to debug is I have nosetests output each test (to see where it is failing) add the -v argument in ci/script.py (only temporary of course) |
65e7572
to
99f162f
Compare
Ah, must've forgotten to recompile extensions. The problem was that I forgot to change the parameter name in parser.pyx, I'll amend this. |
99f162f
to
e2f02d7
Compare
Ignoring line comments and empty lines | ||
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
If the ``comment`` parameter is specified, then completely commented lines will | ||
be ignored. By default, completely blank lines will be ignored as well: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an API change (that blank lines will be skipped). Maybe make a warning here that this is new in 0.15?
Ok added API change notice |
@@ -147,6 +147,11 @@ API changes | |||
|
|||
ewma(s, com=3., min_periods=2) | |||
|
|||
- Made both the C-based and Python file readers ignore empty lines in input as well as |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe mention explicitely that it is read_csv
(and read_table
, I suppose)? That is more clear for users I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amended to add this in
1d95b5c
to
cee026d
Compare
pls rebase and squash |
commit 0e9d792fc9d5159179efd810a1092671dbbef3b1 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Sep 17 14:49:31 2014 -0400 Added warnings about API changes commit 06472c21000b489841cc8e486ceddf05fd87a1c5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 22:36:06 2014 -0400 Changed parameter name to skip_blank_lines commit afd3be30b4afcae0d9bc6278237aab6a4c9e7eb8 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 21:50:08 2014 -0400 Minor doc changes commit b47876e074f5f683a9a51768e480e24d9d3249ab Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 19:26:22 2014 -0400 Extended blank line skipping to custom line terminated/whitespace delimited reading commit 3f4a20a831b1bc0ca29779b315dc72d78ad2301e Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 11:35:17 2014 -0400 Changed around io docs section commit 223e17ecdcbe377cc69fd962221e03412f5e54d3 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Sep 9 23:13:37 2014 -0400 Turned empty line skipping into a keyword parameter feature commit dcd31ca6bd0849eab87ea1c3c5441c8630ca3a35 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Sep 3 21:35:09 2014 -0400 Squashed commit of the following: commit 9aea77954681c2f7d1336d94366221222d186c2b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Aug 26 22:43:21 2014 -0400 Fixed header/skiprows combination issue commit 1975affea3bf0bd6f1769a79e4b0c7fde17962df Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 25 19:35:24 2014 -0400 Added warning/notes about functionality change in docs, removed HTML changes commit 693c820092d9f17f9101074d29c2d7d53fa5a8ae Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 25 15:38:41 2014 -0400 Fixed problem with HTML reading and infinite loop in PythonParser __init__ commit 2a0a4babac7a5e53279eaa8281d0a51406caeb27 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 23 08:37:33 2014 -0400 Updated docs with new read_csv functionality, removed unreachable code commit 19b5811e8d78c4e618e19ff5768aa2cfff041620 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 21:43:47 2014 -0400 Fixed error in empty/whitespace removal function commit 3fd11a822cc0bee123d68240c62627da11ee88c2 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 18:48:08 2014 -0400 Squashed commit of the following: commit 60a1cd1bc1042a9959ae75ff006052c433d98825 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 18:40:17 2014 -0400 Fixed error with string/numerical types commit 7fe1bcf75466ea2b19d947aff0769c9f03bc23f5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 17:47:56 2014 -0400 release notes commit 835e490c8d3a3a96aeb6a6c3846217d36469656b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 17:15:17 2014 -0400 Release note commit 25cee3167b81b9c81e969629cd83968c6736a94f Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 16:56:44 2014 -0400 Fixed whitespace issue, made C parser check for delimiters in whitespace lines commit 593495eb15162833de78d2da65f377fa977ad225 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 15:41:52 2014 -0400 Added new functionality to Python reader commit 8a8325ed883034f176c929b41fe6fad16420e9b5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Jun 17 19:52:41 2014 -0400 Adjusted tokenizer to ignore whitespace-only lines, fixed tests commit 3ea2eed22884a63a6e8dec1b795acdf29b030949 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 16 12:36:14 2014 -0400 Moved tests to C parsing suite, corrected multi-index test commit d5540311ca44992148932ae27e16fc4d02a2a018 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 16 12:35:46 2014 -0400 Changed empty file handling so that a ValueError is raised as expected commit 03a4c3d27c18052f04bd7cb862d289eabbc773ba Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Sun Jun 15 23:07:17 2014 -0400 Wrote tests for empty lines and comment lines commit 01db817e97fc8ee0da85cc17603578b56d294b1b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Sun Jun 15 23:02:04 2014 -0400 Modified C tokenizer so that comments and empty lines are ignored
cee026d
to
e4bcb5c
Compare
@jreback - rebased and squashed |
data rather than the first line of the file. | ||
ignores commented lines and empty lines if ``skip_blank_lines=True``, so header=0 | ||
denotes the first line of data rather than the first line of the file. | ||
- ``skip_blank_lines``: whether to skip over blank lines rather than interpreting |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
put the default here (True)
@AmrAS1 couple of minor documenting changes. ping when when green. |
@jreback - Green |
side issue I happen to have an old version of cython installed on 1 of my machines (0.17.1). didn't compile the parser.pyx correctly, but the newer ones do. odd - so going to change the min version (to 0.19.1) which is still pretty old |
Interesting, do you know what the compile error was? |
merged via 31c2558 |
I broke on the float_precision pointer-to-func (was a formatting error). 0.19.1 works fine though |
@amanshei Did you got it sorted out? |
I did, thank you! |
closes #4466