Skip to content

Conversation

@BryanCutler
Copy link
Member

When from_pandas converts data to boolean, the values are read into a uint8_t and then checked. When the values are floating point numbers, not all bits are checked which can cause incorrect results.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem seems to be reading values as uint8_t

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this to allow the casting kernel to do the conversion makes my test pass, but fails lots of others, so this is not the right fix.

@BryanCutler
Copy link
Member Author

@wesm and @pitrou , I got as far to figure out the problem, but haven't been able to come up with a good fix. I thought I could use the casting kernel to do the conversion, but it causes other problems so maybe that is not the right approach. Any suggestions?

@pitrou
Copy link
Member

pitrou commented Oct 4, 2018

I thought I could use the casting kernel to do the conversion, but it causes other problems so maybe that is not the right approach

What are the other problems exactly? This does sound like the right approach to me.

@BryanCutler
Copy link
Member Author

thanks @pitrou , it leads to a number of other tests failures, but I'm not sure why. If this seems like the right approach I'll look into it further.

@BryanCutler
Copy link
Member Author

What are the other problems exactly? This does sound like the right approach to me.

Ok, the issue was the numpy bool data, which is 1-byte, needs to be converted to a new bitmap before any casting is done. This also fixes when converting from Numpy bools to other numbers. @pitrou and @wesm please take a look when you can, thanks!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it doesn't really make sense to convert from bool to date, but better reuse the same code to prevent any weird errors, just in case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this shouldn't raise an error if the type was specified, I can look at that in another pr

@codecov-io
Copy link

Codecov Report

Merging #2698 into master will increase coverage by 1.03%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #2698      +/-   ##
==========================================
+ Coverage   87.48%   88.52%   +1.03%     
==========================================
  Files         402      341      -61     
  Lines       61401    57625    -3776     
==========================================
- Hits        53718    51012    -2706     
+ Misses       7609     6613     -996     
+ Partials       74        0      -74
Impacted Files Coverage Δ
cpp/src/arrow/python/type_traits.h 83.33% <ø> (ø) ⬆️
python/pyarrow/tests/test_convert_pandas.py 95.06% <100%> (+0.07%) ⬆️
cpp/src/arrow/compute/compute-test.cc 99.37% <100%> (ø) ⬆️
cpp/src/arrow/python/numpy_to_arrow.cc 94.06% <100%> (+0.52%) ⬆️
rust/src/record_batch.rs
go/arrow/datatype_nested.go
rust/src/util/bit_util.rs
go/arrow/math/uint64_amd64.go
go/arrow/internal/testing/tools/bool.go
go/arrow/internal/bitutil/bitutil.go
... and 55 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 96affdc...4e493cd. Read the comment docs.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the changes! Just two comments below.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a comment or docstring here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop would probably be faster if replaced with GenerateBits (see bit-util.h).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's definitely do that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I'll give that a shot

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit odd. Seems like we should be using a templated approach here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I thought about that but the input type isn't parameterized until the casting operation is called and this conversion needs to happen before then. Also, I think bool is the only case where we need a conversion like this since we are going from 1 byte to a bitmask, so it seems best to do a simple check. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO, we can start with a simple check and refine later if needed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, let's definitely do that

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like a dtype check did not happen, why was this code path hitting silently? I can take a closer look too but curious if you know

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The output array is visited and there is no check for the dtype except in the specConvertData specialized for date types. In the case of a boolean output type, it assumed the numpy data was unit8_t. In the other cases, the dtype is sent to the CastBuffer where the input and output types are parameterized, but it expects the buffer to be in Arrow layout, so for bools it needs to be converted to a bitmask before.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used the unrolled version, just wondering if there is really any reason to use the other?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not unless the generate function is heavy (it is inlined several times), which isn't the case here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not too sure the details of this - is it possible for a boolean array to be strided?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All Numpy arrays can be strided, yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks. This is handled by the Ndarray1DIndexer right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes (see operator[]).

@BryanCutler
Copy link
Member Author

Thanks for the review @pitrou and @wesm ! I had a couple of questions, if you could please take another look, thanks!

@wesm
Copy link
Member

wesm commented Oct 15, 2018

Will review again -- would like to approve this before it is merged

@wesm
Copy link
Member

wesm commented Nov 11, 2018

Sorry to be delayed on this. I will rebase and review, then merge

@BryanCutler
Copy link
Member Author

Sorry to be delayed on this. I will rebase and review, then merge

No problem, thanks @wesm !

@BryanCutler
Copy link
Member Author

@wesm do you think you will have time to look at this before 0.12.0 is cut? It fixes a data corruption problem that is very easy for Spark users to inadvertently cause, so it would be great to get in this release if possible.

@wesm
Copy link
Member

wesm commented Dec 3, 2018

I'm planning to look at it, will revert back

convert numpy bool to arrow bools before cast, add from bool tests

call PrepareInputData in date conversion

using GenerateBits for better performance
@wesm wesm force-pushed the python-from_pandas-float-to-bool-ARROW-3428 branch from 8cd543d to f3d4726 Compare January 4, 2019 21:27
Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. This looks fine after the rebase. Thank you @BryanCutler!

@wesm wesm closed this in 2b361fb Jan 10, 2019
@BryanCutler
Copy link
Member Author

Thanks @wesm and @pitrou for reviewing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants