Skip to content

Conversation

@Licht-T
Copy link
Contributor

@Licht-T Licht-T commented Dec 4, 2017

This closes ARROW-1491.

@Licht-T Licht-T force-pushed the feature-string-to-number-and-bool branch from a13c5e0 to 491e5eb Compare December 4, 2017 14:07
@Licht-T Licht-T force-pushed the feature-string-to-number-and-bool branch from 491e5eb to a12eb38 Compare December 4, 2017 14:55
@Licht-T Licht-T force-pushed the feature-string-to-number-and-bool branch from a12eb38 to 8a6470c Compare December 4, 2017 16:16
Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good, I would like to see one minor comment why we need the separation for INT8.

std::function<out_type(const std::string&)> cast_func;
if (output->type->id() == Type::INT8 || output->type->id() == Type::UINT8) {
cast_func = [](const std::string& s) {
return boost::numeric_cast<out_type>(boost::lexical_cast<int>(s));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment why this special case is needed?


#include <boost/algorithm/string.hpp>
#include <boost/lexical_cast.hpp>
#include <boost/numeric/conversion/cast.hpp>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to not rely on Boost for this, e.g. are there some alternatives in the STL or that we can access otherwise? I will review the rest in more detail later

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that boost::numeric_cast and boost::lexical_cast are not replaceable by STL.
STL has std::to_string, but it does not support small size ints.
http://en.cppreference.com/w/cpp/string/basic_string/to_string

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn't it be ok in the of small size ints just to upcast them? This should not affect performance as it's a small temporary.

return boost::numeric_cast<out_type>(boost::lexical_cast<int>(s));
};
} else {
cast_func = [](const std::string& s) { return boost::lexical_cast<out_type>(s); };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think C++11 Lambdas actually incur more overhead than an inlined function. We should instead introduce an auxiliary numeric cast functor that does this switch at compile-time (resulting in an inlined function in the inner loop for all possible types) rather than runtime

if (input_array.null_count() > 0) {
std::stringstream ss;
ss << "Failed to cast NA into " << output->type->ToString();
ctx->SetStatus(Status(StatusCode::SerializationError, ss.str()));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input has nulls, then the output should have nulls in the same locations

ss << "Failed to cast NA into " << output->type->ToString();
ctx->SetStatus(Status(StatusCode::SerializationError, ss.str()));
return;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the input has nulls, then the output should have nulls in the same locations (like the other cast functions)

TEST_F(TestCast, StringToNumber) {
CastOptions options;

vector<bool> is_valid = {true, true, true, true, true};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you modify the unit tests to propagate nulls?

CheckCase<StringType, std::string, FloatType, float>(utf8(), v_float, is_valid,
float32(), e_float, options);
CheckCase<StringType, std::string, DoubleType, double>(utf8(), v_float, is_valid,
float64(), e_double, options);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you test with a non-zero offset (e.g. foo->Slice(2))?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm It seems that the sliced pattern is already tested in CheckCase method.
https://github.com/Licht-T/arrow/blob/master/cpp/src/arrow/compute/compute-test.cc#L123

@wesm
Copy link
Member

wesm commented Dec 20, 2017

@Licht-T I will do a bit of work on this patch tomorrow or Friday for further review

@Licht-T
Copy link
Contributor Author

Licht-T commented Dec 25, 2017

Thanks @wesm! I was busy but now I am okay. Would you mind if I help?

@wesm
Copy link
Member

wesm commented Dec 25, 2017

Sure please go ahead

@Licht-T
Copy link
Contributor Author

Licht-T commented Jan 10, 2018

@wesm Now, all fixed.

@wesm
Copy link
Member

wesm commented Jan 10, 2018

I will review again when I can

@xhochy
Copy link
Member

xhochy commented Jan 14, 2018

This PR looks good besides the dependency on Boost. Probably we need this to get it working but in the longterm, we should get rid of it again.

typename std::enable_if<std::is_arithmetic<T>::value && !std::is_same<T, int8_t>::value &&
!std::is_same<T, uint8_t>::value,
T>::type
castStringToNumeric(const std::string& s) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize this function.

template <typename T>
typename std::enable_if<std::is_same<T, int8_t>::value || std::is_same<T, uint8_t>::value,
T>::type
castStringToNumeric(const std::string& s) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize.


auto out_data = GetMutableValues<out_type>(output, 1);

std::function<out_type(const std::string&)> cast_func;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this variable used anywhere? It looks like you might've replaced it with the castStringToNumeric function.


try {
*out_data++ = castStringToNumeric<out_type>(s);
} catch (...) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a specific exception that can be caught here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned about propagating the actual error message instead of just saying "Cast from X to Y failed".

@cpcloud
Copy link
Contributor

cpcloud commented Feb 16, 2018

I'm taking over this PR, will put up a new one based on this one.

wesm pushed a commit that referenced this pull request Aug 6, 2018
The implementation for numbers uses C++ `istringstream`. This makes casting a bit lenient (it will probably accept whitespace).

This is a rewrite of #1387

Author: Antoine Pitrou <antoine@python.org>

Closes #2362 from pitrou/ARROW-1491-cast-string-to-number and squashes the following commits:

c7db1b0 <Antoine Pitrou> Use trait "enable_if_number"
5a9c9a0 <Antoine Pitrou> Use `istringstream` for locale-independent parsing
c84aac8 <Antoine Pitrou> ARROW-1491:  Add casting from strings to numbers and booleans
@wesm
Copy link
Member

wesm commented Aug 6, 2018

Superseded by #2362

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants