Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-2432: [Python] Fix Pandas decimal type conversion with None values #1878

Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions cpp/src/arrow/python/decimal.cc
Original file line number Diff line number Diff line change
Expand Up @@ -184,14 +184,15 @@ Status DecimalMetadata::Update(int32_t suggested_precision, int32_t suggested_sc
}

Status DecimalMetadata::Update(PyObject* object) {
DCHECK(PyDecimal_Check(object)) << "Object is not a Python Decimal";
bool is_decimal = PyDecimal_Check(object);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it's ok to do this in optimized build. DecimalMetadata expects you to pass a decimal object. @cpcloud may confirm.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary because I added a check before calling Update but it does prevent a seg fault if for some reason it's called with non Decimal objects - which is not nice to get. If it's hurts an optimization though, I can remove it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now we are doing the check twice in optimized builds, which is not nice IMHO. DecimalMetadata::Update is a private API so it's up to the caller to provide appropriate input.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you mean remove PyDecimal_Check all together? This is only called when the type is not specified by the user and then yes, it will end doing 2 passes over the objects and checks both times if they are decimal. It might be possible to do less checks on the second pass if we keep a list of which ones are decimal objects, but I'm not sure that would be worth it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair enough, we can optimize later if we find it too slow. The conversion itself is very slow anyway :-)

DCHECK(is_decimal) << "Object is not a Python Decimal";

if (ARROW_PREDICT_FALSE(PyDecimal_ISNAN(object))) {
if (ARROW_PREDICT_FALSE(!is_decimal || PyDecimal_ISNAN(object))) {
return Status::OK();
}

int32_t precision;
int32_t scale;
int32_t precision = 0;
int32_t scale = 0;
RETURN_NOT_OK(InferDecimalPrecisionAndScale(object, &precision, &scale));
return Update(precision, scale);
}
Expand Down
25 changes: 12 additions & 13 deletions cpp/src/arrow/python/numpy_to_arrow.cc
Original file line number Diff line number Diff line change
Expand Up @@ -743,7 +743,9 @@ Status NumPyConverter::ConvertDecimals() {

if (type_ == NULLPTR) {
for (PyObject* object : objects) {
RETURN_NOT_OK(max_decimal_metadata.Update(object));
if (!internal::PandasObjectIsNull(object)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we care about accepting other NULL-like objects such as float('nan')? Otherwise object != Py_None is a much faster check.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure, is it possible to get NaNs from operations on Decimals? Or is that something the user might mix in somehow?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Python decimal objects can be nan, unfortunately:

>>> import decimal
>>> decimal.Decimal('nan')
Decimal('NaN')

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like it could be NaN also:

In [5]: s1 = pd.Series([Decimal('1.0'), Decimal('2.0')])

In [6]: s2 = pd.Series([Decimal('2.0'), None])

In [7]: s1 / s2
Out[7]: 
0    0.5
1    NaN
dtype: object

RETURN_NOT_OK(max_decimal_metadata.Update(object));
}
}

type_ =
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, what happens here if all items are None? Do we have a test for that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Expand All @@ -758,22 +760,19 @@ Status NumPyConverter::ConvertDecimals() {
for (PyObject* object : objects) {
const int is_decimal = PyObject_IsInstance(object, decimal_type_.obj());

if (ARROW_PREDICT_FALSE(is_decimal == 0)) {
if (is_decimal == 1) {
Decimal128 value;
RETURN_NOT_OK(internal::DecimalFromPythonDecimal(object, decimal_type, &value));
RETURN_NOT_OK(builder.Append(value));
} else if (is_decimal == 0 && internal::PandasObjectIsNull(object)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as above: do we care about other NULL-like values than simply None?

RETURN_NOT_OK(builder.AppendNull());
} else {
// PyObject_IsInstance could error and set an exception
RETURN_IF_PYERROR();
std::stringstream ss;
ss << "Error converting from Python objects to Decimal: ";
RETURN_NOT_OK(InvalidConversion(object, "decimal.Decimal", &ss));
return Status::Invalid(ss.str());
} else if (ARROW_PREDICT_FALSE(is_decimal == -1)) {
DCHECK_NE(PyErr_Occurred(), nullptr);
RETURN_IF_PYERROR();
}

if (internal::PandasObjectIsNull(object)) {
RETURN_NOT_OK(builder.AppendNull());
} else {
Decimal128 value;
RETURN_NOT_OK(internal::DecimalFromPythonDecimal(object, decimal_type, &value));
RETURN_NOT_OK(builder.Append(value));
}
}
return PushBuilderResult(&builder);
Expand Down
32 changes: 23 additions & 9 deletions python/pyarrow/tests/test_convert_pandas.py
Original file line number Diff line number Diff line change
Expand Up @@ -80,9 +80,15 @@ def _check_pandas_roundtrip(df, expected=None, nthreads=1,
else False))


def _check_series_roundtrip(s, type_=None):
def _check_series_roundtrip(s, type_=None, expected_pa_type=None):
arr = pa.array(s, from_pandas=True, type=type_)

if type_ is not None and expected_pa_type is None:
expected_pa_type = type_

if expected_pa_type is not None:
assert arr.type == expected_pa_type

result = pd.Series(arr.to_pandas(), name=s.name)
if patypes.is_timestamp(arr.type) and arr.type.tz is not None:
result = (result.dt.tz_localize('utc')
Expand Down Expand Up @@ -1149,19 +1155,15 @@ def test_fixed_size_bytes_does_not_accept_varying_lengths(self):

def test_variable_size_bytes(self):
s = pd.Series([b'123', b'', b'a', None])
arr = pa.Array.from_pandas(s, type=pa.binary())
assert arr.type == pa.binary()
_check_series_roundtrip(s, type_=pa.binary())

def test_binary_from_bytearray(self):
s = pd.Series([bytearray(b'123'), bytearray(b''), bytearray(b'a')])
s = pd.Series([bytearray(b'123'), bytearray(b''), bytearray(b'a'),
None])
# Explicitly set type
arr = pa.Array.from_pandas(s, type=pa.binary())
assert arr.type == pa.binary()
# Infer type from bytearrays
arr = pa.Array.from_pandas(s)
assert arr.type == pa.binary()
_check_series_roundtrip(s, type_=pa.binary())
# Infer type from bytearrays
_check_series_roundtrip(s, expected_pa_type=pa.binary())

def test_table_empty_str(self):
values = ['', '', '', '', '']
Expand Down Expand Up @@ -1326,6 +1328,18 @@ def test_decimal_with_different_precisions(self):
expected = [decimal.Decimal('0.01000'), decimal.Decimal('0.00100')]
assert array.to_pylist() == expected

def test_decimal_with_None_explicit_type(self):
series = pd.Series([decimal.Decimal('3.14'), None])
_check_series_roundtrip(series, type_=pa.decimal128(12, 5))

# Test that having all None values still produces decimal array
series = pd.Series([None] * 2)
_check_series_roundtrip(series, type_=pa.decimal128(12, 5))

def test_decimal_with_None_infer_type(self):
series = pd.Series([decimal.Decimal('3.14'), None])
_check_series_roundtrip(series, expected_pa_type=pa.decimal128(3, 2))


class TestListTypes(object):
"""
Expand Down