-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-1491: [C++] Add casting implementations from strings to numbers or boolean #1387
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
a13c5e0 to
491e5eb
Compare
491e5eb to
a12eb38
Compare
a12eb38 to
8a6470c
Compare
xhochy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, I would like to see one minor comment why we need the separation for INT8.
| std::function<out_type(const std::string&)> cast_func; | ||
| if (output->type->id() == Type::INT8 || output->type->id() == Type::UINT8) { | ||
| cast_func = [](const std::string& s) { | ||
| return boost::numeric_cast<out_type>(boost::lexical_cast<int>(s)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment why this special case is needed?
|
|
||
| #include <boost/algorithm/string.hpp> | ||
| #include <boost/lexical_cast.hpp> | ||
| #include <boost/numeric/conversion/cast.hpp> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to not rely on Boost for this, e.g. are there some alternatives in the STL or that we can access otherwise? I will review the rest in more detail later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems that boost::numeric_cast and boost::lexical_cast are not replaceable by STL.
STL has std::to_string, but it does not support small size ints.
http://en.cppreference.com/w/cpp/string/basic_string/to_string
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be ok in the of small size ints just to upcast them? This should not affect performance as it's a small temporary.
| return boost::numeric_cast<out_type>(boost::lexical_cast<int>(s)); | ||
| }; | ||
| } else { | ||
| cast_func = [](const std::string& s) { return boost::lexical_cast<out_type>(s); }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think C++11 Lambdas actually incur more overhead than an inlined function. We should instead introduce an auxiliary numeric cast functor that does this switch at compile-time (resulting in an inlined function in the inner loop for all possible types) rather than runtime
| if (input_array.null_count() > 0) { | ||
| std::stringstream ss; | ||
| ss << "Failed to cast NA into " << output->type->ToString(); | ||
| ctx->SetStatus(Status(StatusCode::SerializationError, ss.str())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the input has nulls, then the output should have nulls in the same locations
| ss << "Failed to cast NA into " << output->type->ToString(); | ||
| ctx->SetStatus(Status(StatusCode::SerializationError, ss.str())); | ||
| return; | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the input has nulls, then the output should have nulls in the same locations (like the other cast functions)
| TEST_F(TestCast, StringToNumber) { | ||
| CastOptions options; | ||
|
|
||
| vector<bool> is_valid = {true, true, true, true, true}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you modify the unit tests to propagate nulls?
| CheckCase<StringType, std::string, FloatType, float>(utf8(), v_float, is_valid, | ||
| float32(), e_float, options); | ||
| CheckCase<StringType, std::string, DoubleType, double>(utf8(), v_float, is_valid, | ||
| float64(), e_double, options); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you test with a non-zero offset (e.g. foo->Slice(2))?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm It seems that the sliced pattern is already tested in CheckCase method.
https://github.com/Licht-T/arrow/blob/master/cpp/src/arrow/compute/compute-test.cc#L123
|
@Licht-T I will do a bit of work on this patch tomorrow or Friday for further review |
|
Thanks @wesm! I was busy but now I am okay. Would you mind if I help? |
|
Sure please go ahead |
|
@wesm Now, all fixed. |
|
I will review again when I can |
|
This PR looks good besides the dependency on Boost. Probably we need this to get it working but in the longterm, we should get rid of it again. |
| typename std::enable_if<std::is_arithmetic<T>::value && !std::is_same<T, int8_t>::value && | ||
| !std::is_same<T, uint8_t>::value, | ||
| T>::type | ||
| castStringToNumeric(const std::string& s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize this function.
| template <typename T> | ||
| typename std::enable_if<std::is_same<T, int8_t>::value || std::is_same<T, uint8_t>::value, | ||
| T>::type | ||
| castStringToNumeric(const std::string& s) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Capitalize.
|
|
||
| auto out_data = GetMutableValues<out_type>(output, 1); | ||
|
|
||
| std::function<out_type(const std::string&)> cast_func; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this variable used anywhere? It looks like you might've replaced it with the castStringToNumeric function.
|
|
||
| try { | ||
| *out_data++ = castStringToNumeric<out_type>(s); | ||
| } catch (...) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a specific exception that can be caught here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm concerned about propagating the actual error message instead of just saying "Cast from X to Y failed".
|
I'm taking over this PR, will put up a new one based on this one. |
The implementation for numbers uses C++ `istringstream`. This makes casting a bit lenient (it will probably accept whitespace). This is a rewrite of #1387 Author: Antoine Pitrou <antoine@python.org> Closes #2362 from pitrou/ARROW-1491-cast-string-to-number and squashes the following commits: c7db1b0 <Antoine Pitrou> Use trait "enable_if_number" 5a9c9a0 <Antoine Pitrou> Use `istringstream` for locale-independent parsing c84aac8 <Antoine Pitrou> ARROW-1491: Add casting from strings to numbers and booleans
|
Superseded by #2362 |
This closes ARROW-1491.