Skip to content

Commit 31c2558

Browse files
mdmuellerjreback
authored andcommitted
Squashed commit of the following:
commit 0e9d792fc9d5159179efd810a1092671dbbef3b1 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Sep 17 14:49:31 2014 -0400 Added warnings about API changes commit 06472c21000b489841cc8e486ceddf05fd87a1c5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 22:36:06 2014 -0400 Changed parameter name to skip_blank_lines commit afd3be30b4afcae0d9bc6278237aab6a4c9e7eb8 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 21:50:08 2014 -0400 Minor doc changes commit b47876e074f5f683a9a51768e480e24d9d3249ab Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 19:26:22 2014 -0400 Extended blank line skipping to custom line terminated/whitespace delimited reading commit 3f4a20a831b1bc0ca29779b315dc72d78ad2301e Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Fri Sep 12 11:35:17 2014 -0400 Changed around io docs section commit 223e17ecdcbe377cc69fd962221e03412f5e54d3 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Sep 9 23:13:37 2014 -0400 Turned empty line skipping into a keyword parameter feature commit dcd31ca6bd0849eab87ea1c3c5441c8630ca3a35 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Sep 3 21:35:09 2014 -0400 Squashed commit of the following: commit 9aea77954681c2f7d1336d94366221222d186c2b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Aug 26 22:43:21 2014 -0400 Fixed header/skiprows combination issue commit 1975affea3bf0bd6f1769a79e4b0c7fde17962df Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 25 19:35:24 2014 -0400 Added warning/notes about functionality change in docs, removed HTML changes commit 693c820092d9f17f9101074d29c2d7d53fa5a8ae Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 25 15:38:41 2014 -0400 Fixed problem with HTML reading and infinite loop in PythonParser __init__ commit 2a0a4babac7a5e53279eaa8281d0a51406caeb27 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 23 08:37:33 2014 -0400 Updated docs with new read_csv functionality, removed unreachable code commit 19b5811e8d78c4e618e19ff5768aa2cfff041620 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 21:43:47 2014 -0400 Fixed error in empty/whitespace removal function commit 3fd11a822cc0bee123d68240c62627da11ee88c2 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 18:48:08 2014 -0400 Squashed commit of the following: commit 60a1cd1bc1042a9959ae75ff006052c433d98825 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 18:40:17 2014 -0400 Fixed error with string/numerical types commit 7fe1bcf75466ea2b19d947aff0769c9f03bc23f5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 17:47:56 2014 -0400 release notes commit 835e490c8d3a3a96aeb6a6c3846217d36469656b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 17:15:17 2014 -0400 Release note commit 25cee3167b81b9c81e969629cd83968c6736a94f Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 16:56:44 2014 -0400 Fixed whitespace issue, made C parser check for delimiters in whitespace lines commit 593495eb15162833de78d2da65f377fa977ad225 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Wed Jun 18 15:41:52 2014 -0400 Added new functionality to Python reader commit 8a8325ed883034f176c929b41fe6fad16420e9b5 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Tue Jun 17 19:52:41 2014 -0400 Adjusted tokenizer to ignore whitespace-only lines, fixed tests commit 3ea2eed22884a63a6e8dec1b795acdf29b030949 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 16 12:36:14 2014 -0400 Moved tests to C parsing suite, corrected multi-index test commit d5540311ca44992148932ae27e16fc4d02a2a018 Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Mon Jun 16 12:35:46 2014 -0400 Changed empty file handling so that a ValueError is raised as expected commit 03a4c3d27c18052f04bd7cb862d289eabbc773ba Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Sun Jun 15 23:07:17 2014 -0400 Wrote tests for empty lines and comment lines commit 01db817e97fc8ee0da85cc17603578b56d294b1b Author: Michael Mueller <michaeldmueller7@gmail.com> Date: Sun Jun 15 23:02:04 2014 -0400 Modified C tokenizer so that comments and empty lines are ignored
1 parent 89cf72b commit 31c2558

File tree

8 files changed

+332
-58
lines changed

8 files changed

+332
-58
lines changed

doc/source/install.rst

+1-1
Original file line numberDiff line numberDiff line change
@@ -276,7 +276,7 @@ Optional Dependencies
276276
~~~~~~~~~~~~~~~~~~~~~
277277

278278
* `Cython <http://www.cython.org>`__: Only necessary to build development
279-
version. Version 0.17.1 or higher.
279+
version. Version 0.19.1 or higher.
280280
* `SciPy <http://www.scipy.org>`__: miscellaneous statistical functions
281281
* `PyTables <http://www.pytables.org>`__: necessary for HDF5-based storage. Version 3.0.0 or higher required.
282282
* `SQLAlchemy <http://www.sqlalchemy.org>`__: for SQL database support. Version 0.8.1 or higher recommended.

doc/source/io.rst

+49-24
Original file line numberDiff line numberDiff line change
@@ -100,8 +100,10 @@ They can take a number of arguments:
100100
a list of integers that specify row locations for a multi-index on the columns
101101
E.g. [0,1,3]. Intervening rows that are not specified will be
102102
skipped (e.g. 2 in this example are skipped). Note that this parameter
103-
ignores commented lines, so header=0 denotes the first line of
104-
data rather than the first line of the file.
103+
ignores commented lines and empty lines if ``skip_blank_lines=True`` (the default),
104+
so header=0 denotes the first line of data rather than the first line of the file.
105+
- ``skip_blank_lines``: whether to skip over blank lines rather than interpreting
106+
them as NaN values
105107
- ``skiprows``: A collection of numbers for rows in the file to skip. Can
106108
also be an integer to skip the first ``n`` rows
107109
- ``index_col``: column number, column name, or list of column numbers/names,
@@ -149,7 +151,7 @@ They can take a number of arguments:
149151
- ``escapechar`` : string, to specify how to escape quoted data
150152
- ``comment``: Indicates remainder of line should not be parsed. If found at the
151153
beginning of a line, the line will be ignored altogether. This parameter
152-
must be a single character. Also, fully commented lines
154+
must be a single character. Like empty lines, fully commented lines
153155
are ignored by the parameter `header` but not by `skiprows`. For example,
154156
if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
155157
result in '1,2,3' being treated as the header.
@@ -266,27 +268,6 @@ after a delimiter:
266268
print(data)
267269
pd.read_csv(StringIO(data), skipinitialspace=True)
268270
269-
Moreover, ``read_csv`` ignores any completely commented lines:
270-
271-
.. ipython:: python
272-
273-
data = 'a,b,c\n# commented line\n1,2,3\n#another comment\n4,5,6'
274-
print(data)
275-
pd.read_csv(StringIO(data), comment='#')
276-
277-
.. note::
278-
279-
The presence of ignored lines might create ambiguities involving line numbers;
280-
the parameter ``header`` uses row numbers (ignoring commented
281-
lines), while ``skiprows`` uses line numbers (including commented lines):
282-
283-
.. ipython:: python
284-
285-
data = '#comment\na,b,c\nA,B,C\n1,2,3'
286-
pd.read_csv(StringIO(data), comment='#', header=1)
287-
data = 'A,B,C\n#comment\na,b,c\n1,2,3'
288-
pd.read_csv(StringIO(data), comment='#', skiprows=2)
289-
290271
The parsers make every attempt to "do the right thing" and not be very
291272
fragile. Type inference is a pretty big deal. So if a column can be coerced to
292273
integer dtype without altering the contents, it will do so. Any non-numeric
@@ -363,6 +344,50 @@ file, either using the column names or position numbers:
363344
pd.read_csv(StringIO(data), usecols=['b', 'd'])
364345
pd.read_csv(StringIO(data), usecols=[0, 2, 3])
365346
347+
.. _io.skiplines:
348+
349+
Ignoring line comments and empty lines
350+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
351+
If the ``comment`` parameter is specified, then completely commented lines will
352+
be ignored. By default, completely blank lines will be ignored as well. Both of
353+
these are API changes introduced in version 0.15.
354+
355+
.. ipython:: python
356+
357+
data = '\na,b,c\n \n# commented line\n1,2,3\n\n4,5,6'
358+
print(data)
359+
pd.read_csv(StringIO(data), comment='#')
360+
361+
If ``skip_blank_lines=False``, then ``read_csv`` will not ignore blank lines:
362+
363+
.. ipython:: python
364+
365+
data = 'a,b,c\n\n1,2,3\n\n\n4,5,6'
366+
pd.read_csv(StringIO(data), skip_blank_lines=False)
367+
368+
.. warning::
369+
370+
The presence of ignored lines might create ambiguities involving line numbers;
371+
the parameter ``header`` uses row numbers (ignoring commented/empty
372+
lines), while ``skiprows`` uses line numbers (including commented/empty lines):
373+
374+
.. ipython:: python
375+
376+
data = '#comment\na,b,c\nA,B,C\n1,2,3'
377+
pd.read_csv(StringIO(data), comment='#', header=1)
378+
data = 'A,B,C\n#comment\na,b,c\n1,2,3'
379+
pd.read_csv(StringIO(data), comment='#', skiprows=2)
380+
381+
If both ``header`` and ``skiprows`` are specified, ``header`` will be
382+
relative to the end of ``skiprows``. For example:
383+
384+
.. ipython:: python
385+
386+
data = '# empty\n# second empty line\n# third empty' \
387+
'line\nX,Y,Z\n1,2,3\nA,B,C\n1,2.,4.\n5.,NaN,10.0'
388+
print(data)
389+
pd.read_csv(StringIO(data), comment='#', skiprows=4, header=1)
390+
366391
.. _io.unicode:
367392

368393
Dealing with Unicode Data

doc/source/v0.15.0.txt

+5-2
Original file line numberDiff line numberDiff line change
@@ -153,6 +153,11 @@ API changes
153153

154154
ewma(s, com=3., min_periods=2)
155155

156+
- Made both the C-based and Python engines for `read_csv` and `read_table` ignore empty lines in input as well as
157+
whitespace-filled lines, as long as `sep` is not whitespace. This is an API change
158+
that can be controlled by the keyword parameter `skip_blank_lines`.
159+
(:issue:`4466`, see :ref:`skiplines <_io.skiplines>`)
160+
156161
- :func:`ewmstd`, :func:`ewmvol`, :func:`ewmvar`, :func:`ewmcov`, and :func:`ewmcorr`
157162
now have an optional ``adjust`` argument, just like :func:`ewma` does,
158163
affecting how the weights are calculated.
@@ -680,8 +685,6 @@ Enhancements
680685

681686

682687

683-
684-
685688
- ``tz_localize`` now accepts the ``ambiguous`` keyword which allows for passing an array of bools
686689
indicating whether the date belongs in DST or not, 'NaT' for setting transition times to NaT,
687690
'infer' for inferring DST/non-DST, and 'raise' (default) for an AmbiguousTimeError to be raised (:issue:`7943`).

pandas/io/parsers.py

+42-13
Original file line numberDiff line numberDiff line change
@@ -65,8 +65,8 @@ class ParserWarning(Warning):
6565
a list of integers that specify row locations for a multi-index on the
6666
columns E.g. [0,1,3]. Intervening rows that are not specified will be
6767
skipped (e.g. 2 in this example are skipped). Note that this parameter
68-
ignores commented lines, so header=0 denotes the first line of
69-
data rather than the first line of the file.
68+
ignores commented lines and empty lines if ``skip_blank_lines=True``, so header=0
69+
denotes the first line of data rather than the first line of the file.
7070
skiprows : list-like or integer
7171
Line numbers to skip (0-indexed) or number of lines to skip (int)
7272
at the start of the file
@@ -110,10 +110,11 @@ class ParserWarning(Warning):
110110
comment : str, default None
111111
Indicates remainder of line should not be parsed. If found at the
112112
beginning of a line, the line will be ignored altogether. This parameter
113-
must be a single character. Also, fully commented lines
114-
are ignored by the parameter `header` but not by `skiprows`. For example,
115-
if comment='#', parsing '#empty\n1,2,3\na,b,c' with `header=0` will
116-
result in '1,2,3' being treated as the header.
113+
must be a single character. Like empty lines (as long as ``skip_blank_lines=True``),
114+
fully commented lines are ignored by the parameter `header`
115+
but not by `skiprows`. For example, if comment='#', parsing
116+
'#empty\n1,2,3\na,b,c' with `header=0` will result in '1,2,3' being
117+
treated as the header.
117118
decimal : str, default '.'
118119
Character to recognize as decimal point. E.g. use ',' for European data
119120
nrows : int, default None
@@ -160,6 +161,8 @@ class ParserWarning(Warning):
160161
infer_datetime_format : boolean, default False
161162
If True and parse_dates is enabled for a column, attempt to infer
162163
the datetime format to speed up the processing
164+
skip_blank_lines : boolean, default True
165+
If True, skip over blank lines rather than interpreting as NaN values
163166
164167
Returns
165168
-------
@@ -288,6 +291,7 @@ def _read(filepath_or_buffer, kwds):
288291
'mangle_dupe_cols': True,
289292
'tupleize_cols': False,
290293
'infer_datetime_format': False,
294+
'skip_blank_lines': True
291295
}
292296

293297

@@ -380,7 +384,8 @@ def parser_f(filepath_or_buffer,
380384
squeeze=False,
381385
mangle_dupe_cols=True,
382386
tupleize_cols=False,
383-
infer_datetime_format=False):
387+
infer_datetime_format=False,
388+
skip_blank_lines=True):
384389

385390
# Alias sep -> delimiter.
386391
if delimiter is None:
@@ -452,7 +457,8 @@ def parser_f(filepath_or_buffer,
452457
buffer_lines=buffer_lines,
453458
mangle_dupe_cols=mangle_dupe_cols,
454459
tupleize_cols=tupleize_cols,
455-
infer_datetime_format=infer_datetime_format)
460+
infer_datetime_format=infer_datetime_format,
461+
skip_blank_lines=skip_blank_lines)
456462

457463
return _read(filepath_or_buffer, kwds)
458464

@@ -1346,6 +1352,7 @@ def __init__(self, f, **kwds):
13461352
self.quoting = kwds['quoting']
13471353
self.mangle_dupe_cols = kwds.get('mangle_dupe_cols', True)
13481354
self.usecols = kwds['usecols']
1355+
self.skip_blank_lines = kwds['skip_blank_lines']
13491356

13501357
self.names_passed = kwds['names'] or None
13511358

@@ -1401,6 +1408,7 @@ def __init__(self, f, **kwds):
14011408

14021409
# needs to be cleaned/refactored
14031410
# multiple date column thing turning into a real spaghetti factory
1411+
14041412
if not self._has_complex_date_col:
14051413
(index_names,
14061414
self.orig_names, self.columns) = self._get_index_name(self.columns)
@@ -1598,6 +1606,7 @@ def _infer_columns(self):
15981606

15991607
while self.line_pos <= hr:
16001608
line = self._next_line()
1609+
16011610
unnamed_count = 0
16021611
this_columns = []
16031612
for i, c in enumerate(line):
@@ -1735,25 +1744,35 @@ def _next_line(self):
17351744
line = self._check_comments([self.data[self.pos]])[0]
17361745
self.pos += 1
17371746
# either uncommented or blank to begin with
1738-
if self._empty(self.data[self.pos - 1]) or line:
1747+
if not self.skip_blank_lines and (self._empty(self.data[
1748+
self.pos - 1]) or line):
17391749
break
1750+
elif self.skip_blank_lines:
1751+
ret = self._check_empty([line])
1752+
if ret:
1753+
line = ret[0]
1754+
break
17401755
except IndexError:
17411756
raise StopIteration
17421757
else:
17431758
while self.pos in self.skiprows:
1744-
next(self.data)
17451759
self.pos += 1
1760+
next(self.data)
17461761

17471762
while True:
17481763
orig_line = next(self.data)
17491764
line = self._check_comments([orig_line])[0]
17501765
self.pos += 1
1751-
if self._empty(orig_line) or line:
1766+
if not self.skip_blank_lines and (self._empty(orig_line) or line):
17521767
break
1768+
elif self.skip_blank_lines:
1769+
ret = self._check_empty([line])
1770+
if ret:
1771+
line = ret[0]
1772+
break
17531773

17541774
self.line_pos += 1
17551775
self.buf.append(line)
1756-
17571776
return line
17581777

17591778
def _check_comments(self, lines):
@@ -1774,6 +1793,15 @@ def _check_comments(self, lines):
17741793
ret.append(rl)
17751794
return ret
17761795

1796+
def _check_empty(self, lines):
1797+
ret = []
1798+
for l in lines:
1799+
# Remove empty lines and lines with only one whitespace value
1800+
if len(l) > 1 or len(l) == 1 and (not isinstance(l[0],
1801+
compat.string_types) or l[0].strip()):
1802+
ret.append(l)
1803+
return ret
1804+
17771805
def _check_thousands(self, lines):
17781806
if self.thousands is None:
17791807
return lines
@@ -1909,7 +1937,6 @@ def _get_lines(self, rows=None):
19091937

19101938
# already fetched some number
19111939
if rows is not None:
1912-
19131940
# we already have the lines in the buffer
19141941
if len(self.buf) >= rows:
19151942
new_rows, self.buf = self.buf[:rows], self.buf[rows:]
@@ -1974,6 +2001,8 @@ def _get_lines(self, rows=None):
19742001
lines = lines[:-self.skip_footer]
19752002

19762003
lines = self._check_comments(lines)
2004+
if self.skip_blank_lines:
2005+
lines = self._check_empty(lines)
19772006
return self._check_thousands(lines)
19782007

19792008

0 commit comments

Comments
 (0)