Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Delegate more of Excel parsing to CSV #23544

Merged
merged 1 commit into from
Nov 11, 2018
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 28 additions & 1 deletion doc/source/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2861,7 +2861,13 @@ to be parsed.

read_excel('path_to_file.xls', 'Sheet1', usecols=2)

If `usecols` is a list of integers, then it is assumed to be the file column
You can also specify a comma-delimited set of Excel columns and ranges as a string:

.. code-block:: python

read_excel('path_to_file.xls', 'Sheet1', usecols='A,C:E')

If ``usecols`` is a list of integers, then it is assumed to be the file column
indices to be parsed.

.. code-block:: python
Expand All @@ -2870,6 +2876,27 @@ indices to be parsed.

Element order is ignored, so ``usecols=[0, 1]`` is the same as ``[1, 0]``.

.. versionadded:: 0.24

If ``usecols`` is a list of strings, it is assumed that each string corresponds
gfyoung marked this conversation as resolved.
Show resolved Hide resolved
to a column name provided either by the user in ``names`` or inferred from the
document header row(s). Those strings define which columns will be parsed:

.. code-block:: python

read_excel('path_to_file.xls', 'Sheet1', usecols=['foo', 'bar'])

Element order is ignored, so ``usecols=['baz', 'joe']`` is the same as ``['joe', 'baz']``.
gfyoung marked this conversation as resolved.
Show resolved Hide resolved

.. versionadded:: 0.24

If ``usecols`` is callable, the callable function will be evaluated against
the column names, returning names where the callable function evaluates to ``True``.

.. code-block:: python

read_excel('path_to_file.xls', 'Sheet1', usecols=lambda x: x.isalpha())

Parsing Dates
+++++++++++++

Expand Down
3 changes: 3 additions & 0 deletions doc/source/whatsnew/v0.24.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -238,6 +238,7 @@ Other Enhancements
- Added :meth:`Interval.overlaps`, :meth:`IntervalArray.overlaps`, and :meth:`IntervalIndex.overlaps` for determining overlaps between interval-like objects (:issue:`21998`)
- :func:`~DataFrame.to_parquet` now supports writing a ``DataFrame`` as a directory of parquet files partitioned by a subset of the columns when ``engine = 'pyarrow'`` (:issue:`23283`)
- :meth:`Timestamp.tz_localize`, :meth:`DatetimeIndex.tz_localize`, and :meth:`Series.tz_localize` have gained the ``nonexistent`` argument for alternative handling of nonexistent times. See :ref:`timeseries.timezone_nonexsistent` (:issue:`8917`)
- :meth:`read_excel()` now accepts ``usecols`` as a list of column names or callable (:issue:`18273`)

.. _whatsnew_0240.api_breaking:

Expand Down Expand Up @@ -1299,6 +1300,8 @@ Notice how we now instead output ``np.nan`` itself instead of a stringified form
- Bug in :meth:`HDFStore.append` when appending a :class:`DataFrame` with an empty string column and ``min_itemsize`` < 8 (:issue:`12242`)
- Bug in :meth:`read_csv()` in which :class:`MultiIndex` index names were being improperly handled in the cases when they were not provided (:issue:`23484`)
- Bug in :meth:`read_html()` in which the error message was not displaying the valid flavors when an invalid one was provided (:issue:`23549`)
- Bug in :meth:`read_excel()` in which ``index_col=None`` was not being respected and parsing index columns anyway (:issue:`20480`)
- Bug in :meth:`read_excel()` in which ``usecols`` was not being validated for proper column names when passed in as a string (:issue:`20480`)

Plotting
^^^^^^^^
Expand Down
194 changes: 127 additions & 67 deletions pandas/io/excel.py
Original file line number Diff line number Diff line change
Expand Up @@ -17,8 +17,7 @@
import pandas._libs.json as json
import pandas.compat as compat
from pandas.compat import (
OrderedDict, add_metaclass, lrange, map, range, reduce, string_types, u,
zip)
OrderedDict, add_metaclass, lrange, map, range, string_types, u, zip)
from pandas.errors import EmptyDataError
from pandas.util._decorators import Appender, deprecate_kwarg

Expand Down Expand Up @@ -93,13 +92,22 @@
.. deprecated:: 0.21.0
Pass in `usecols` instead.

usecols : int or list, default None
* If None then parse all columns,
* If int then indicates last column to be parsed
* If list of ints then indicates list of column numbers to be parsed
* If string then indicates comma separated list of Excel column letters and
column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
usecols : int, str, list-like, or callable default None
* If None, then parse all columns,
* If int, then indicates last column to be parsed
* If string, then indicates comma separated list of Excel column letters
and column ranges (e.g. "A:E" or "A,C,E:F"). Ranges are inclusive of
gfyoung marked this conversation as resolved.
Show resolved Hide resolved
both sides.
* If list of ints, then indicates list of column numbers to be parsed.
* If list of strings, then indicates list of column names to be parsed.

.. versionadded:: 0.24.0

* If callable, then evaluate each column name against it and parse the
column if the callable returns ``True``.

.. versionadded:: 0.24.0

squeeze : boolean, default False
If the parsed data only contains one column then return a Series
dtype : Type name or dict of column -> type, default None
Expand Down Expand Up @@ -466,39 +474,6 @@ def parse(self,
convert_float=convert_float,
**kwds)

def _should_parse(self, i, usecols):

def _range2cols(areas):
"""
Convert comma separated list of column names and column ranges to a
list of 0-based column indexes.

>>> _range2cols('A:E')
[0, 1, 2, 3, 4]
>>> _range2cols('A,C,Z:AB')
[0, 2, 25, 26, 27]
"""
def _excel2num(x):
"Convert Excel column name like 'AB' to 0-based column index"
return reduce(lambda s, a: s * 26 + ord(a) - ord('A') + 1,
x.upper().strip(), 0) - 1

cols = []
for rng in areas.split(','):
if ':' in rng:
rng = rng.split(':')
cols += lrange(_excel2num(rng[0]), _excel2num(rng[1]) + 1)
else:
cols.append(_excel2num(rng))
return cols

if isinstance(usecols, int):
return i <= usecols
elif isinstance(usecols, compat.string_types):
return i in _range2cols(usecols)
else:
return i in usecols

def _parse_excel(self,
sheet_name=0,
header=0,
Expand Down Expand Up @@ -527,10 +502,6 @@ def _parse_excel(self,
raise NotImplementedError("chunksize keyword of read_excel "
"is not implemented")

if parse_dates is True and index_col is None:
warnings.warn("The 'parse_dates=True' keyword of read_excel was "
"provided without an 'index_col' keyword value.")

import xlrd
from xlrd import (xldate, XL_CELL_DATE,
XL_CELL_ERROR, XL_CELL_BOOLEAN,
Expand Down Expand Up @@ -620,17 +591,13 @@ def _parse_cell(cell_contents, cell_typ):
sheet = self.book.sheet_by_index(asheetname)

data = []
should_parse = {}
usecols = _maybe_convert_usecols(usecols)

for i in range(sheet.nrows):
row = []
for j, (value, typ) in enumerate(zip(sheet.row_values(i),
sheet.row_types(i))):
if usecols is not None and j not in should_parse:
should_parse[j] = self._should_parse(j, usecols)

if usecols is None or should_parse[j]:
row.append(_parse_cell(value, typ))
row.append(_parse_cell(value, typ))
data.append(row)

if sheet.nrows == 0:
Expand All @@ -642,31 +609,30 @@ def _parse_cell(cell_contents, cell_typ):

# forward fill and pull out names for MultiIndex column
header_names = None
if header is not None:
if is_list_like(header):
header_names = []
control_row = [True] * len(data[0])
for row in header:
if is_integer(skiprows):
row += skiprows

data[row], control_row = _fill_mi_header(
data[row], control_row)
header_name, data[row] = _pop_header_name(
data[row], index_col)
header_names.append(header_name)
else:
data[header] = _trim_excel_header(data[header])
gfyoung marked this conversation as resolved.
Show resolved Hide resolved
if header is not None and is_list_like(header):
header_names = []
control_row = [True] * len(data[0])

for row in header:
if is_integer(skiprows):
row += skiprows

data[row], control_row = _fill_mi_header(
data[row], control_row)
header_name, _ = _pop_header_name(
data[row], index_col)
header_names.append(header_name)

if is_list_like(index_col):
# forward fill values for MultiIndex index
# Forward fill values for MultiIndex index.
if not is_list_like(header):
offset = 1 + header
else:
offset = 1 + max(header)

for col in index_col:
last = data[offset][col]

for row in range(offset + 1, len(data)):
if data[row][col] == '' or data[row][col] is None:
data[row][col] = last
Expand All @@ -693,11 +659,14 @@ def _parse_cell(cell_contents, cell_typ):
thousands=thousands,
comment=comment,
skipfooter=skipfooter,
usecols=usecols,
**kwds)

output[asheetname] = parser.read(nrows=nrows)

if names is not None:
output[asheetname].columns = names

if not squeeze or isinstance(output[asheetname], DataFrame):
output[asheetname].columns = output[
asheetname].columns.set_names(header_names)
Expand Down Expand Up @@ -726,6 +695,97 @@ def __exit__(self, exc_type, exc_value, traceback):
self.close()


def _excel2num(x):
"""
Convert Excel column name like 'AB' to 0-based column index.

Parameters
----------
x : str
The Excel column name to convert to a 0-based column index.

Returns
-------
num : int
The column index corresponding to the name.

Raises
------
ValueError
Part of the Excel column name was invalid.
"""
index = 0

for c in x.upper().strip():
cp = ord(c)

if cp < ord("A") or cp > ord("Z"):
raise ValueError("Invalid column name: {x}".format(x=x))
gfyoung marked this conversation as resolved.
Show resolved Hide resolved

index = index * 26 + cp - ord("A") + 1

return index - 1


def _range2cols(areas):
"""
Convert comma separated list of column names and ranges to indices.

Parameters
----------
areas : str
A string containing a sequence of column ranges (or areas).

Returns
-------
cols : list
A list of 0-based column indices.

Examples
--------
>>> _range2cols('A:E')
[0, 1, 2, 3, 4]
>>> _range2cols('A,C,Z:AB')
[0, 2, 25, 26, 27]
"""
cols = []

for rng in areas.split(","):
if ":" in rng:
rng = rng.split(":")
cols.extend(lrange(_excel2num(rng[0]), _excel2num(rng[1]) + 1))
else:
cols.append(_excel2num(rng))

return cols


def _maybe_convert_usecols(usecols):
"""
Convert `usecols` into a compatible format for parsing in `parsers.py`.

Parameters
----------
usecols : object
The use-columns object to potentially convert.

Returns
-------
converted : object
The compatible format of `usecols`.
"""
if usecols is None:
return usecols

if is_integer(usecols):
return lrange(usecols + 1)

if isinstance(usecols, compat.string_types):
return _range2cols(usecols)

return usecols


def _validate_freeze_panes(freeze_panes):
if freeze_panes is not None:
if (
Expand Down
Loading