ARROW-3428: [Python] Fix from_pandas conversion from float to bool #2698

BryanCutler · 2018-10-03T23:36:19Z

When from_pandas converts data to boolean, the values are read into a uint8_t and then checked. When the values are floating point numbers, not all bits are checked which can cause incorrect results.

BryanCutler · 2018-10-03T23:36:46Z

cpp/src/arrow/python/numpy_to_arrow.cc

The problem seems to be reading values as uint8_t

BryanCutler · 2018-10-03T23:38:04Z

cpp/src/arrow/python/numpy_to_arrow.cc

Removing this to allow the casting kernel to do the conversion makes my test pass, but fails lots of others, so this is not the right fix.

BryanCutler · 2018-10-03T23:41:03Z

@wesm and @pitrou , I got as far to figure out the problem, but haven't been able to come up with a good fix. I thought I could use the casting kernel to do the conversion, but it causes other problems so maybe that is not the right approach. Any suggestions?

pitrou · 2018-10-04T11:39:22Z

I thought I could use the casting kernel to do the conversion, but it causes other problems so maybe that is not the right approach

What are the other problems exactly? This does sound like the right approach to me.

BryanCutler · 2018-10-04T16:59:56Z

thanks @pitrou , it leads to a number of other tests failures, but I'm not sure why. If this seems like the right approach I'll look into it further.

BryanCutler · 2018-10-05T19:18:45Z

What are the other problems exactly? This does sound like the right approach to me.

Ok, the issue was the numpy bool data, which is 1-byte, needs to be converted to a new bitmap before any casting is done. This also fixes when converting from Numpy bools to other numbers. @pitrou and @wesm please take a look when you can, thanks!

BryanCutler · 2018-10-05T19:20:20Z

cpp/src/arrow/python/numpy_to_arrow.cc

it doesn't really make sense to convert from bool to date, but better reuse the same code to prevent any weird errors, just in case.

BryanCutler · 2018-10-05T19:21:34Z

python/pyarrow/tests/test_convert_pandas.py

I think this shouldn't raise an error if the type was specified, I can look at that in another pr

codecov-io · 2018-10-05T20:02:27Z

Codecov Report

Merging #2698 into master will increase coverage by 1.03%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master    #2698      +/-   ##
==========================================
+ Coverage   87.48%   88.52%   +1.03%     
==========================================
  Files         402      341      -61     
  Lines       61401    57625    -3776     
==========================================
- Hits        53718    51012    -2706     
+ Misses       7609     6613     -996     
+ Partials       74        0      -74

Impacted Files	Coverage Δ
cpp/src/arrow/python/type_traits.h	`83.33% <ø> (ø)`	⬆️
python/pyarrow/tests/test_convert_pandas.py	`95.06% <100%> (+0.07%)`	⬆️
cpp/src/arrow/compute/compute-test.cc	`99.37% <100%> (ø)`	⬆️
cpp/src/arrow/python/numpy_to_arrow.cc	`94.06% <100%> (+0.52%)`	⬆️
rust/src/record_batch.rs
go/arrow/datatype_nested.go
rust/src/util/bit_util.rs
go/arrow/math/uint64_amd64.go
go/arrow/internal/testing/tools/bool.go
go/arrow/internal/bitutil/bitutil.go
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96affdc...4e493cd. Read the comment docs.

pitrou

Thanks for the changes! Just two comments below.

pitrou · 2018-10-10T10:38:32Z

cpp/src/arrow/python/numpy_to_arrow.cc

Add a comment or docstring here?

pitrou · 2018-10-10T10:40:49Z

cpp/src/arrow/python/numpy_to_arrow.cc

This loop would probably be faster if replaced with GenerateBits (see bit-util.h).

Yes, let's definitely do that

Ok, I'll give that a shot

wesm · 2018-10-10T15:21:32Z

cpp/src/arrow/python/numpy_to_arrow.cc

This is a bit odd. Seems like we should be using a templated approach here

Yeah I thought about that but the input type isn't parameterized until the casting operation is called and this conversion needs to happen before then. Also, I think bool is the only case where we need a conversion like this since we are going from 1 byte to a bitmask, so it seems best to do a simple check. What do you think?

IMHO, we can start with a simple check and refine later if needed.

wesm · 2018-10-10T15:21:51Z

cpp/src/arrow/python/numpy_to_arrow.cc

Yes, let's definitely do that

wesm · 2018-10-10T15:23:00Z

cpp/src/arrow/python/numpy_to_arrow.cc

Seems like a dtype check did not happen, why was this code path hitting silently? I can take a closer look too but curious if you know

The output array is visited and there is no check for the dtype except in the specConvertData specialized for date types. In the case of a boolean output type, it assumed the numpy data was unit8_t. In the other cases, the dtype is sent to the CastBuffer where the input and output types are parameterized, but it expects the buffer to be in Arrow layout, so for bools it needs to be converted to a bitmask before.

BryanCutler · 2018-10-15T04:39:08Z

cpp/src/arrow/python/numpy_to_arrow.cc

I used the unrolled version, just wondering if there is really any reason to use the other?

Not unless the generate function is heavy (it is inlined several times), which isn't the case here.

BryanCutler · 2018-10-15T04:43:03Z

cpp/src/arrow/python/numpy_to_arrow.cc

I'm not too sure the details of this - is it possible for a boolean array to be strided?

All Numpy arrays can be strided, yes.

Ok, thanks. This is handled by the Ndarray1DIndexer right?

Yes (see operator[]).

BryanCutler · 2018-10-15T04:44:20Z

Thanks for the review @pitrou and @wesm ! I had a couple of questions, if you could please take another look, thanks!

wesm · 2018-10-15T13:44:34Z

Will review again -- would like to approve this before it is merged

wesm · 2018-11-11T22:30:57Z

Sorry to be delayed on this. I will rebase and review, then merge

BryanCutler · 2018-11-13T05:43:24Z

Sorry to be delayed on this. I will rebase and review, then merge

No problem, thanks @wesm !

BryanCutler · 2018-12-03T22:39:39Z

@wesm do you think you will have time to look at this before 0.12.0 is cut? It fixes a data corruption problem that is very easy for Spark users to inadvertently cause, so it would be great to get in this release if possible.

wesm · 2018-12-03T22:42:35Z

I'm planning to look at it, will revert back

convert numpy bool to arrow bools before cast, add from bool tests call PrepareInputData in date conversion using GenerateBits for better performance

wesm

+1. This looks fine after the rebase. Thank you @BryanCutler!

BryanCutler · 2019-01-10T17:40:38Z

Thanks @wesm and @pitrou for reviewing!

BryanCutler commented Oct 3, 2018

View reviewed changes

cpp/src/arrow/python/numpy_to_arrow.cc Outdated

Copy link

Member Author

BryanCutler Oct 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem seems to be reading values as uint8_t

BryanCutler commented Oct 3, 2018

View reviewed changes

BryanCutler mentioned this pull request Oct 3, 2018

[SPARK-25461][PySpark][SQL] Add document for mismatch between return type of Pandas.Series and return type of pandas udf apache/spark#22610

Closed

BryanCutler commented Oct 5, 2018

View reviewed changes

pitrou reviewed Oct 10, 2018

View reviewed changes

wesm reviewed Oct 10, 2018

View reviewed changes

BryanCutler commented Oct 15, 2018

View reviewed changes

added test with fix that passes, but fails other tests

f3d4726

convert numpy bool to arrow bools before cast, add from bool tests call PrepareInputData in date conversion using GenerateBits for better performance

wesm force-pushed the python-from_pandas-float-to-bool-ARROW-3428 branch from 8cd543d to f3d4726 Compare January 4, 2019 21:27

wesm approved these changes Jan 10, 2019

View reviewed changes

wesm closed this in 2b361fb Jan 10, 2019

asfimport mentioned this pull request Jan 10, 2019

[Python] from_pandas gives incorrect results when converting floating point to bool #19753

Closed

ARROW-3428: [Python] Fix from_pandas conversion from float to bool #2698

ARROW-3428: [Python] Fix from_pandas conversion from float to bool #2698

Uh oh!

Conversation

BryanCutler commented Oct 3, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Oct 3, 2018

Uh oh!

pitrou commented Oct 4, 2018

Uh oh!

BryanCutler commented Oct 4, 2018

Uh oh!

BryanCutler commented Oct 5, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov-io commented Oct 5, 2018

Codecov Report

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Oct 15, 2018

Uh oh!

wesm commented Oct 15, 2018

Uh oh!

wesm commented Nov 11, 2018

Uh oh!

BryanCutler commented Nov 13, 2018

Uh oh!

BryanCutler commented Dec 3, 2018

Uh oh!

wesm commented Dec 3, 2018

Uh oh!

wesm left a comment

Choose a reason for hiding this comment

Uh oh!

BryanCutler commented Jan 10, 2019