Skip to content

Backport PR #45146 on branch 1.4.x (ENH: Allow callable for on_bad_lines in read_csv when engine="python") #45264

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
32 changes: 28 additions & 4 deletions doc/source/user_guide/io.rst
Original file line number Diff line number Diff line change
Expand Up @@ -1305,14 +1305,38 @@ You can elect to skip bad lines:
0 1 2 3
1 8 9 10

Or pass a callable function to handle the bad line if ``engine="python"``.
The bad line will be a list of strings that was split by the ``sep``:

.. code-block:: ipython

In [29]: external_list = []

In [30]: def bad_lines_func(line):
...: external_list.append(line)
...: return line[-3:]

In [31]: pd.read_csv(StringIO(data), on_bad_lines=bad_lines_func, engine="python")
Out[31]:
a b c
0 1 2 3
1 5 6 7
2 8 9 10

In [32]: external_list
Out[32]: [4, 5, 6, 7]

.. versionadded:: 1.4.0


You can also use the ``usecols`` parameter to eliminate extraneous column
data that appear in some lines but not others:

.. code-block:: ipython

In [30]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])
In [33]: pd.read_csv(StringIO(data), usecols=[0, 1, 2])

Out[30]:
Out[33]:
a b c
0 1 2 3
1 4 5 6
Expand All @@ -1324,9 +1348,9 @@ fields are filled with ``NaN``.

.. code-block:: ipython

In [31]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd'])
In [34]: pd.read_csv(StringIO(data), names=['a', 'b', 'c', 'd'])

Out[31]:
Out[34]:
a b c d
0 1 2 3 NaN
1 4 5 6 7
Expand Down
1 change: 1 addition & 0 deletions doc/source/whatsnew/v1.4.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -208,6 +208,7 @@ Other enhancements
- :meth:`Series.str.split` now supports a ``regex`` argument that explicitly specifies whether the pattern is a regular expression. Default is ``None`` (:issue:`43563`, :issue:`32835`, :issue:`25549`)
- :meth:`DataFrame.dropna` now accepts a single label as ``subset`` along with array-like (:issue:`41021`)
- Added :meth:`DataFrameGroupBy.value_counts` (:issue:`43564`)
- :func:`read_csv` now accepts a ``callable`` function in ``on_bad_lines`` when ``engine="python"`` for custom handling of bad lines (:issue:`5686`)
- :class:`ExcelWriter` argument ``if_sheet_exists="overlay"`` option added (:issue:`40231`)
- :meth:`read_excel` now accepts a ``decimal`` argument that allow the user to specify the decimal point when parsing string columns to numeric (:issue:`14403`)
- :meth:`.GroupBy.mean`, :meth:`.GroupBy.std`, :meth:`.GroupBy.var`, :meth:`.GroupBy.sum` now supports `Numba <http://numba.pydata.org/>`_ execution with the ``engine`` keyword (:issue:`43731`, :issue:`44862`, :issue:`44939`)
Expand Down
6 changes: 5 additions & 1 deletion pandas/io/parsers/python_parser.py
Original file line number Diff line number Diff line change
Expand Up @@ -990,7 +990,11 @@ def _rows_to_cols(self, content: list[list[Scalar]]) -> list[np.ndarray]:
actual_len = len(l)

if actual_len > col_len:
if (
if callable(self.on_bad_lines):
new_l = self.on_bad_lines(l)
if new_l is not None:
content.append(new_l)
elif (
self.on_bad_lines == self.BadLineHandleMethod.ERROR
or self.on_bad_lines == self.BadLineHandleMethod.WARN
):
Expand Down
23 changes: 20 additions & 3 deletions pandas/io/parsers/readers.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,6 +9,7 @@
from textwrap import fill
from typing import (
Any,
Callable,
NamedTuple,
)
import warnings
Expand Down Expand Up @@ -354,7 +355,7 @@
.. deprecated:: 1.3.0
The ``on_bad_lines`` parameter should be used instead to specify behavior upon
encountering a bad line instead.
on_bad_lines : {{'error', 'warn', 'skip'}}, default 'error'
on_bad_lines : {{'error', 'warn', 'skip'}} or callable, default 'error'
Specifies what to do upon encountering a bad line (a line with too many fields).
Allowed values are :

Expand All @@ -364,6 +365,16 @@

.. versionadded:: 1.3.0

- callable, function with signature
``(bad_line: list[str]) -> list[str] | None`` that will process a single
bad line. ``bad_line`` is a list of strings split by the ``sep``.
If the function returns ``None`, the bad line will be ignored.
If the function returns a new list of strings with more elements than
expected, a ``ParserWarning`` will be emitted while dropping extra elements.
Only supported when ``engine="python"``

.. versionadded:: 1.4.0

delim_whitespace : bool, default False
Specifies whether or not whitespace (e.g. ``' '`` or ``'\t'``) will be
used as the sep. Equivalent to setting ``sep='\\s+'``. If this option
Expand Down Expand Up @@ -1367,7 +1378,7 @@ def _refine_defaults_read(
sep: str | object,
error_bad_lines: bool | None,
warn_bad_lines: bool | None,
on_bad_lines: str | None,
on_bad_lines: str | Callable | None,
names: ArrayLike | None | object,
prefix: str | None | object,
defaults: dict[str, Any],
Expand Down Expand Up @@ -1399,7 +1410,7 @@ def _refine_defaults_read(
Whether to error on a bad line or not.
warn_bad_lines : str or None
Whether to warn on a bad line or not.
on_bad_lines : str or None
on_bad_lines : str, callable or None
An option for handling bad lines or a sentinel value(None).
names : array-like, optional
List of column names to use. If the file contains a header row,
Expand Down Expand Up @@ -1503,6 +1514,12 @@ def _refine_defaults_read(
kwds["on_bad_lines"] = ParserBase.BadLineHandleMethod.WARN
elif on_bad_lines == "skip":
kwds["on_bad_lines"] = ParserBase.BadLineHandleMethod.SKIP
elif callable(on_bad_lines):
if engine != "python":
raise ValueError(
"on_bad_line can only be a callable function if engine='python'"
)
kwds["on_bad_lines"] = on_bad_lines
else:
raise ValueError(f"Argument {on_bad_lines} is invalid for on_bad_lines")
else:
Expand Down
131 changes: 130 additions & 1 deletion pandas/tests/io/parser/test_python_parser_only.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
these tests out of this module as soon as the C parser can accept further
arguments when parsing.
"""
from __future__ import annotations

import csv
from io import (
Expand All @@ -13,7 +14,10 @@

import pytest

from pandas.errors import ParserError
from pandas.errors import (
ParserError,
ParserWarning,
)

from pandas import (
DataFrame,
Expand Down Expand Up @@ -329,3 +333,128 @@ def readline(self):
return self.data

parser.read_csv(NoNextBuffer("a\n1"))


@pytest.mark.parametrize("bad_line_func", [lambda x: ["2", "3"], lambda x: x[:2]])
def test_on_bad_lines_callable(python_parser_only, bad_line_func):
# GH 5686
parser = python_parser_only
data = """a,b
1,2
2,3,4,5,6
3,4
"""
bad_sio = StringIO(data)
result = parser.read_csv(bad_sio, on_bad_lines=bad_line_func)
expected = DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]})
tm.assert_frame_equal(result, expected)


def test_on_bad_lines_callable_write_to_external_list(python_parser_only):
# GH 5686
parser = python_parser_only
data = """a,b
1,2
2,3,4,5,6
3,4
"""
bad_sio = StringIO(data)
lst = []

def bad_line_func(bad_line: list[str]) -> list[str]:
lst.append(bad_line)
return ["2", "3"]

result = parser.read_csv(bad_sio, on_bad_lines=bad_line_func)
expected = DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]})
tm.assert_frame_equal(result, expected)
assert lst == [["2", "3", "4", "5", "6"]]


@pytest.mark.parametrize("bad_line_func", [lambda x: ["foo", "bar"], lambda x: x[:2]])
@pytest.mark.parametrize("sep", [",", "111"])
def test_on_bad_lines_callable_iterator_true(python_parser_only, bad_line_func, sep):
# GH 5686
# iterator=True has a separate code path than iterator=False
parser = python_parser_only
data = f"""
0{sep}1
hi{sep}there
foo{sep}bar{sep}baz
good{sep}bye
"""
bad_sio = StringIO(data)
result_iter = parser.read_csv(
bad_sio, on_bad_lines=bad_line_func, chunksize=1, iterator=True, sep=sep
)
expecteds = [
{"0": "hi", "1": "there"},
{"0": "foo", "1": "bar"},
{"0": "good", "1": "bye"},
]
for i, (result, expected) in enumerate(zip(result_iter, expecteds)):
expected = DataFrame(expected, index=range(i, i + 1))
tm.assert_frame_equal(result, expected)


def test_on_bad_lines_callable_dont_swallow_errors(python_parser_only):
# GH 5686
parser = python_parser_only
data = """a,b
1,2
2,3,4,5,6
3,4
"""
bad_sio = StringIO(data)
msg = "This function is buggy."

def bad_line_func(bad_line):
raise ValueError(msg)

with pytest.raises(ValueError, match=msg):
parser.read_csv(bad_sio, on_bad_lines=bad_line_func)


def test_on_bad_lines_callable_not_expected_length(python_parser_only):
# GH 5686
parser = python_parser_only
data = """a,b
1,2
2,3,4,5,6
3,4
"""
bad_sio = StringIO(data)

with tm.assert_produces_warning(ParserWarning, match="Length of header or names"):
result = parser.read_csv(bad_sio, on_bad_lines=lambda x: x)
expected = DataFrame({"a": [1, 2, 3], "b": [2, 3, 4]})
tm.assert_frame_equal(result, expected)


def test_on_bad_lines_callable_returns_none(python_parser_only):
# GH 5686
parser = python_parser_only
data = """a,b
1,2
2,3,4,5,6
3,4
"""
bad_sio = StringIO(data)

result = parser.read_csv(bad_sio, on_bad_lines=lambda x: None)
expected = DataFrame({"a": [1, 3], "b": [2, 4]})
tm.assert_frame_equal(result, expected)


def test_on_bad_lines_index_col_inferred(python_parser_only):
# GH 5686
parser = python_parser_only
data = """a,b
1,2,3
4,5,6
"""
bad_sio = StringIO(data)

result = parser.read_csv(bad_sio, on_bad_lines=lambda x: ["99", "99"])
expected = DataFrame({"a": [2, 5], "b": [3, 6]}, index=[1, 4])
tm.assert_frame_equal(result, expected)
12 changes: 12 additions & 0 deletions pandas/tests/io/parser/test_unsupported.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,3 +149,15 @@ def test_pyarrow_engine(self):
kwargs[default] = "warn"
with pytest.raises(ValueError, match=msg):
read_csv(StringIO(data), engine="pyarrow", **kwargs)

def test_on_bad_lines_callable_python_only(self, all_parsers):
# GH 5686
sio = StringIO("a,b\n1,2")
bad_lines_func = lambda x: x
parser = all_parsers
if all_parsers.engine != "python":
msg = "on_bad_line can only be a callable function if engine='python'"
with pytest.raises(ValueError, match=msg):
parser.read_csv(sio, on_bad_lines=bad_lines_func)
else:
parser.read_csv(sio, on_bad_lines=bad_lines_func)