-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-3428: [Python] Fix from_pandas conversion from float to bool #2698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-3428: [Python] Fix from_pandas conversion from float to bool #2698
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem seems to be reading values as uint8_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this to allow the casting kernel to do the conversion makes my test pass, but fails lots of others, so this is not the right fix.
What are the other problems exactly? This does sound like the right approach to me. |
|
thanks @pitrou , it leads to a number of other tests failures, but I'm not sure why. If this seems like the right approach I'll look into it further. |
Ok, the issue was the numpy bool data, which is 1-byte, needs to be converted to a new bitmap before any casting is done. This also fixes when converting from Numpy bools to other numbers. @pitrou and @wesm please take a look when you can, thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it doesn't really make sense to convert from bool to date, but better reuse the same code to prevent any weird errors, just in case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this shouldn't raise an error if the type was specified, I can look at that in another pr
Codecov Report
@@ Coverage Diff @@
## master #2698 +/- ##
==========================================
+ Coverage 87.48% 88.52% +1.03%
==========================================
Files 402 341 -61
Lines 61401 57625 -3776
==========================================
- Hits 53718 51012 -2706
+ Misses 7609 6613 -996
+ Partials 74 0 -74
Continue to review full report at Codecov.
|
pitrou
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the changes! Just two comments below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a comment or docstring here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This loop would probably be faster if replaced with GenerateBits (see bit-util.h).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's definitely do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, I'll give that a shot
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit odd. Seems like we should be using a templated approach here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I thought about that but the input type isn't parameterized until the casting operation is called and this conversion needs to happen before then. Also, I think bool is the only case where we need a conversion like this since we are going from 1 byte to a bitmask, so it seems best to do a simple check. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO, we can start with a simple check and refine later if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, let's definitely do that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a dtype check did not happen, why was this code path hitting silently? I can take a closer look too but curious if you know
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The output array is visited and there is no check for the dtype except in the specConvertData specialized for date types. In the case of a boolean output type, it assumed the numpy data was unit8_t. In the other cases, the dtype is sent to the CastBuffer where the input and output types are parameterized, but it expects the buffer to be in Arrow layout, so for bools it needs to be converted to a bitmask before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used the unrolled version, just wondering if there is really any reason to use the other?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not unless the generate function is heavy (it is inlined several times), which isn't the case here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not too sure the details of this - is it possible for a boolean array to be strided?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All Numpy arrays can be strided, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks. This is handled by the Ndarray1DIndexer right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes (see operator[]).
|
Will review again -- would like to approve this before it is merged |
|
Sorry to be delayed on this. I will rebase and review, then merge |
No problem, thanks @wesm ! |
|
@wesm do you think you will have time to look at this before 0.12.0 is cut? It fixes a data corruption problem that is very easy for Spark users to inadvertently cause, so it would be great to get in this release if possible. |
|
I'm planning to look at it, will revert back |
convert numpy bool to arrow bools before cast, add from bool tests call PrepareInputData in date conversion using GenerateBits for better performance
8cd543d to
f3d4726
Compare
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1. This looks fine after the rebase. Thank you @BryanCutler!
When
from_pandasconverts data to boolean, the values are read into auint8_tand then checked. When the values are floating point numbers, not all bits are checked which can cause incorrect results.