Skip to content

Commit e4bcb5c

Browse files
committed
Squashed commit of the following:
commit 0e9d792fc9d5159179efd810a1092671dbbef3b1 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Sep 17 14:49:31 2014 -0400 Added warnings about API changes commit 06472c21000b489841cc8e486ceddf05fd87a1c5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 22:36:06 2014 -0400 Changed parameter name to skip_blank_lines commit afd3be30b4afcae0d9bc6278237aab6a4c9e7eb8 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 21:50:08 2014 -0400 Minor doc changes commit b47876e074f5f683a9a51768e480e24d9d3249ab Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 19:26:22 2014 -0400 Extended blank line skipping to custom line terminated/whitespace delimited reading commit 3f4a20a831b1bc0ca29779b315dc72d78ad2301e Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 11:35:17 2014 -0400 Changed around io docs section commit 223e17ecdcbe377cc69fd962221e03412f5e54d3 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Sep 9 23:13:37 2014 -0400 Turned empty line skipping into a keyword parameter feature commit dcd31ca6bd0849eab87ea1c3c5441c8630ca3a35 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Sep 3 21:35:09 2014 -0400 Squashed commit of the following: commit 9aea77954681c2f7d1336d94366221222d186c2b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Aug 26 22:43:21 2014 -0400 Fixed header/skiprows combination issue commit 1975affea3bf0bd6f1769a79e4b0c7fde17962df Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 25 19:35:24 2014 -0400 Added warning/notes about functionality change in docs, removed HTML changes commit 693c820092d9f17f9101074d29c2d7d53fa5a8ae Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 25 15:38:41 2014 -0400 Fixed problem with HTML reading and infinite loop in PythonParser __init__ commit 2a0a4babac7a5e53279eaa8281d0a51406caeb27 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 23 08:37:33 2014 -0400 Updated docs with new read_csv functionality, removed unreachable code commit 19b5811e8d78c4e618e19ff5768aa2cfff041620 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 21:43:47 2014 -0400 Fixed error in empty/whitespace removal function commit 3fd11a822cc0bee123d68240c62627da11ee88c2 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 18:48:08 2014 -0400 Squashed commit of the following: commit 60a1cd1bc1042a9959ae75ff006052c433d98825 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 18:40:17 2014 -0400 Fixed error with string/numerical types commit 7fe1bcf75466ea2b19d947aff0769c9f03bc23f5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 17:47:56 2014 -0400 release notes commit 835e490c8d3a3a96aeb6a6c3846217d36469656b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 17:15:17 2014 -0400 Release note commit 25cee3167b81b9c81e969629cd83968c6736a94f Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 16:56:44 2014 -0400 Fixed whitespace issue, made C parser check for delimiters in whitespace lines commit 593495eb15162833de78d2da65f377fa977ad225 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 15:41:52 2014 -0400 Added new functionality to Python reader commit 8a8325ed883034f176c929b41fe6fad16420e9b5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Jun 17 19:52:41 2014 -0400 Adjusted tokenizer to ignore whitespace-only lines, fixed tests commit 3ea2eed22884a63a6e8dec1b795acdf29b030949 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 16 12:36:14 2014 -0400 Moved tests to C parsing suite, corrected multi-index test commit d5540311ca44992148932ae27e16fc4d02a2a018 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 16 12:35:46 2014 -0400 Changed empty file handling so that a ValueError is raised as expected commit 03a4c3d27c18052f04bd7cb862d289eabbc773ba Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Sun Jun 15 23:07:17 2014 -0400 Wrote tests for empty lines and comment lines commit 01db817e97fc8ee0da85cc17603578b56d294b1b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Sun Jun 15 23:02:04 2014 -0400 Modified C tokenizer so that comments and empty lines are ignored
1 parent 0714c02 commit e4bcb5c

File tree

7 files changed

+330
-56
lines changed

7 files changed

+330
-56
lines changed

doc/source/io.rst

+49-24
Original file line numberDiff line numberDiff line change
@@ -100,8 +100,10 @@ They can take a number of arguments:
100100
a list of integers that specify row locations for a multi-index on the columns
101101
E.g. [0,1,3]. Intervening rows that are not specified will be
102102
skipped (e.g. 2 in this example are skipped). Note that this parameter
103-
ignores commented lines, so header=0 denotes the first line of
104-
data rather than the first line of the file.
103+
ignores commented lines and empty lines if ``skip_blank_lines=True``, so header=0
104+
denotes the first line of data rather than the first line of the file.
105+
- ``skip_blank_lines``: whether to skip over blank lines rather than interpreting
106+
them as NaN values
105107
- ``skiprows``: A collection of numbers for rows in the file to skip. Can
106108
also be an integer to skip the first ``n`` rows
107109
- ``index_col``: column number, column name, or list of column numbers/names,
@@ -149,7 +151,7 @@ They can take a number of arguments:
149151
- ``escapechar`` : string, to specify how to escape quoted data
150152
- ``comment``: Indicates remainder of line should not be parsed. If found at the
151153
beginning of a line, the line will be ignored altogether. This parameter
152-
must be a single character. Also, fully commented lines
154+
must be a single character. Like empty lines, fully commented lines
153155
are ignored by the parameter `header` but not by `skiprows`. For example,
154156
if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
155157
result in '1,2,3' being treated as the header.
@@ -261,27 +263,6 @@ after a delimiter:
261263
print(data)
262264
pd.read_csv(StringIO(data), skipinitialspace=True)
263265
264-
Moreover, ``read_csv`` ignores any completely commented lines:
265-
266-
.. ipython:: python
267-
268-
data = 'a,b,c\n# commented line\n1,2,3\n#another comment\n4,5,6'
269-
print(data)
270-
pd.read_csv(StringIO(data), comment='#')
271-
272-
.. note::
273-
274-
The presence of ignored lines might create ambiguities involving line numbers;
275-
the parameter ``header`` uses row numbers (ignoring commented
276-
lines), while ``skiprows`` uses line numbers (including commented lines):
277-
278-
.. ipython:: python
279-
280-
data = '#comment\na,b,c\nA,B,C\n1,2,3'
281-
pd.read_csv(StringIO(data), comment='#', header=1)
282-
data = 'A,B,C\n#comment\na,b,c\n1,2,3'
283-
pd.read_csv(StringIO(data), comment='#', skiprows=2)
284-
285266
The parsers make every attempt to "do the right thing" and not be very
286267
fragile. Type inference is a pretty big deal. So if a column can be coerced to
287268
integer dtype without altering the contents, it will do so. Any non-numeric
@@ -358,6 +339,50 @@ file, either using the column names or position numbers:
358339
pd.read_csv(StringIO(data), usecols=['b', 'd'])
359340
pd.read_csv(StringIO(data), usecols=[0, 2, 3])
360341
342+
.. _io.skiplines:
343+
344+
Ignoring line comments and empty lines
345+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
346+
If the ``comment`` parameter is specified, then completely commented lines will
347+
be ignored. By default, completely blank lines will be ignored as well. Both of
348+
these are API changes introduced in version 0.15.
349+
350+
.. ipython:: python
351+
352+
data = '\na,b,c\n \n# commented line\n1,2,3\n\n4,5,6'
353+
print(data)
354+
pd.read_csv(StringIO(data), comment='#')
355+
356+
If ``skip_blank_lines=False``, then ``read_csv`` will not ignore blank lines:
357+
358+
.. ipython:: python
359+
360+
data = 'a,b,c\n\n1,2,3\n\n\n4,5,6'
361+
pd.read_csv(StringIO(data), skip_blank_lines=False)
362+
363+
.. warning::
364+
365+
The presence of ignored lines might create ambiguities involving line numbers;
366+
the parameter ``header`` uses row numbers (ignoring commented/empty
367+
lines), while ``skiprows`` uses line numbers (including commented/empty lines):
368+
369+
.. ipython:: python
370+
371+
data = '#comment\na,b,c\nA,B,C\n1,2,3'
372+
pd.read_csv(StringIO(data), comment='#', header=1)
373+
data = 'A,B,C\n#comment\na,b,c\n1,2,3'
374+
pd.read_csv(StringIO(data), comment='#', skiprows=2)
375+
376+
If both ``header`` and ``skiprows`` are specified, ``header`` will be
377+
relative to the end of ``skiprows``. For example:
378+
379+
.. ipython:: python
380+
381+
data = '# empty\n# second empty line\n# third empty' \
382+
'line\nX,Y,Z\n1,2,3\nA,B,C\n1,2.,4.\n5.,NaN,10.0'
383+
print(data)
384+
pd.read_csv(StringIO(data), comment='#', skiprows=4, header=1)
385+
361386
.. _io.unicode:
362387

363388
Dealing with Unicode Data

doc/source/v0.15.0.txt

+5-2
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,11 @@ API changes
153153

154154
ewma(s, com=3., min_periods=2)
155155

156+
- Made both the C-based and Python engines for `read_csv` and `read_table` ignore empty lines in input as well as
157+
whitespace-filled lines, as long as `sep` is not whitespace. This is an API change
158+
that can be controlled by the keyword parameter `skip_blank_lines`.
159+
(:issue:`4466`)
160+
156161
- :func:`ewmstd`, :func:`ewmvol`, :func:`ewmvar`, :func:`ewmcov`, and :func:`ewmcorr`
157162
now have an optional ``adjust`` argument, just like :func:`ewma` does,
158163
affecting how the weights are calculated.
@@ -678,8 +683,6 @@ Enhancements
678683

679684

680685

681-
682-
683686
- ``tz_localize`` now accepts the ``ambiguous`` keyword which allows for passing an array of bools
684687
indicating whether the date belongs in DST or not, 'NaT' for setting transition times to NaT,
685688
'infer' for inferring DST/non-DST, and 'raise' (default) for an AmbiguousTimeError to be raised (:issue:`7943`).

pandas/io/parsers.py

+42-13
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,8 @@ class ParserWarning(Warning):
6565
a list of integers that specify row locations for a multi-index on the
6666
columns E.g. [0,1,3]. Intervening rows that are not specified will be
6767
skipped (e.g. 2 in this example are skipped). Note that this parameter
68-
ignores commented lines, so header=0 denotes the first line of
69-
data rather than the first line of the file.
68+
ignores commented lines and empty lines if ``skip_blank_lines=True``, so header=0
69+
denotes the first line of data rather than the first line of the file.
7070
skiprows : list-like or integer
7171
Line numbers to skip (0-indexed) or number of lines to skip (int)
7272
at the start of the file
@@ -110,10 +110,11 @@ class ParserWarning(Warning):
110110
comment : str, default None
111111
Indicates remainder of line should not be parsed. If found at the
112112
beginning of a line, the line will be ignored altogether. This parameter
113-
must be a single character. Also, fully commented lines
114-
are ignored by the parameter `header` but not by `skiprows`. For example,
115-
if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
116-
result in '1,2,3' being treated as the header.
113+
must be a single character. Like empty lines (as long as ``skip_blank_lines=True``),
114+
fully commented lines are ignored by the parameter `header`
115+
but not by `skiprows`. For example, if comment='#', parsing
116+
'#empty\n1,2,3\na,b,c' with `header=0` will result in '1,2,3' being
117+
treated as the header.
117118
decimal : str, default '.'
118119
Character to recognize as decimal point. E.g. use ',' for European data
119120
nrows : int, default None
@@ -160,6 +161,8 @@ class ParserWarning(Warning):
160161
infer_datetime_format : boolean, default False
161162
If True and parse_dates is enabled for a column, attempt to infer
162163
the datetime format to speed up the processing
164+
skip_blank_lines : boolean, default True
165+
If True, skip over blank lines rather than interpreting as NaN values
163166
164167
Returns
165168
-------
@@ -288,6 +291,7 @@ def _read(filepath_or_buffer, kwds):
288291
'mangle_dupe_cols': True,
289292
'tupleize_cols': False,
290293
'infer_datetime_format': False,
294+
'skip_blank_lines': True
291295
}
292296

293297

@@ -378,7 +382,8 @@ def parser_f(filepath_or_buffer,
378382
squeeze=False,
379383
mangle_dupe_cols=True,
380384
tupleize_cols=False,
381-
infer_datetime_format=False):
385+
infer_datetime_format=False,
386+
skip_blank_lines=True):
382387

383388
# Alias sep -> delimiter.
384389
if delimiter is None:
@@ -449,7 +454,8 @@ def parser_f(filepath_or_buffer,
449454
buffer_lines=buffer_lines,
450455
mangle_dupe_cols=mangle_dupe_cols,
451456
tupleize_cols=tupleize_cols,
452-
infer_datetime_format=infer_datetime_format)
457+
infer_datetime_format=infer_datetime_format,
458+
skip_blank_lines=skip_blank_lines)
453459

454460
return _read(filepath_or_buffer, kwds)
455461

@@ -1338,6 +1344,7 @@ def __init__(self, f, **kwds):
13381344
self.quoting = kwds['quoting']
13391345
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True)
13401346
self.usecols = kwds['usecols']
1347+
self.skip_blank_lines = kwds['skip_blank_lines']
13411348

13421349
self.names_passed = kwds['names'] or None
13431350

@@ -1393,6 +1400,7 @@ def __init__(self, f, **kwds):
13931400

13941401
# needs to be cleaned/refactored
13951402
# multiple date column thing turning into a real spaghetti factory
1403+
13961404
if not self._has_complex_date_col:
13971405
(index_names,
13981406
self.orig_names, self.columns) = self._get_index_name(self.columns)
@@ -1590,6 +1598,7 @@ def _infer_columns(self):
15901598

15911599
while self.line_pos <= hr:
15921600
line = self._next_line()
1601+
15931602
unnamed_count = 0
15941603
this_columns = []
15951604
for i, c in enumerate(line):
@@ -1727,25 +1736,35 @@ def _next_line(self):
17271736
line = self._check_comments([self.data[self.pos]])[0]
17281737
self.pos += 1
17291738
# either uncommented or blank to begin with
1730-
if self._empty(self.data[self.pos - 1]) or line:
1739+
if not self.skip_blank_lines and (self._empty(self.data[
1740+
self.pos - 1]) or line):
17311741
break
1742+
elif self.skip_blank_lines:
1743+
ret = self._check_empty([line])
1744+
if ret:
1745+
line = ret[0]
1746+
break
17321747
except IndexError:
17331748
raise StopIteration
17341749
else:
17351750
while self.pos in self.skiprows:
1736-
next(self.data)
17371751
self.pos += 1
1752+
next(self.data)
17381753

17391754
while True:
17401755
orig_line = next(self.data)
17411756
line = self._check_comments([orig_line])[0]
17421757
self.pos += 1
1743-
if self._empty(orig_line) or line:
1758+
if not self.skip_blank_lines and (self._empty(orig_line) or line):
17441759
break
1760+
elif self.skip_blank_lines:
1761+
ret = self._check_empty([line])
1762+
if ret:
1763+
line = ret[0]
1764+
break
17451765

17461766
self.line_pos += 1
17471767
self.buf.append(line)
1748-
17491768
return line
17501769

17511770
def _check_comments(self, lines):
@@ -1766,6 +1785,15 @@ def _check_comments(self, lines):
17661785
ret.append(rl)
17671786
return ret
17681787

1788+
def _check_empty(self, lines):
1789+
ret = []
1790+
for l in lines:
1791+
# Remove empty lines and lines with only one whitespace value
1792+
if len(l) > 1 or len(l) == 1 and (not isinstance(l[0],
1793+
compat.string_types) or l[0].strip()):
1794+
ret.append(l)
1795+
return ret
1796+
17691797
def _check_thousands(self, lines):
17701798
if self.thousands is None:
17711799
return lines
@@ -1901,7 +1929,6 @@ def _get_lines(self, rows=None):
19011929

19021930
# already fetched some number
19031931
if rows is not None:
1904-
19051932
# we already have the lines in the buffer
19061933
if len(self.buf) >= rows:
19071934
new_rows, self.buf = self.buf[:rows], self.buf[rows:]
@@ -1966,6 +1993,8 @@ def _get_lines(self, rows=None):
19661993
lines = lines[:-self.skip_footer]
19671994

19681995
lines = self._check_comments(lines)
1996+
if self.skip_blank_lines:
1997+
lines = self._check_empty(lines)
19691998
return self._check_thousands(lines)
19701999

19712000

0 commit comments

Comments
 (0)