Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce classification analysis runner. #701

Merged
merged 5 commits into from
Oct 4, 2019

Conversation

przemekwitek
Copy link
Contributor

@przemekwitek przemekwitek commented Sep 27, 2019

Make CDataFrameBoostedTreeRunner class abstract and derive two subclasses from it:

  1. CDataFrameRegressionRunner
  2. CDataFrameClassificationRunner

Relates elastic/elasticsearch#46735

@droberts195
Copy link
Contributor

I have labelled this >non-issue on the assumption that it will be release-noted via the Java side PR.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. I left some minor comments.

One thing I think is it would be nice to avoid the string comparisons you do reading every row. Basically, you can do this once if you look up the position of the dependent variable in the feature name strings vector (see comments). It would also be good to add a unit test to CDataFrameAnalyzerTest for this behaviour. You can access the data frame CDataFrameAnalyzer creates to do this.

@@ -173,6 +173,9 @@ class API_EXPORT CDataFrameAnalysisSpecification {
//! 2. disk is used (only one partition needs to be loaded to main memory)
void estimateMemoryUsage(CMemoryUsageEstimationResultJsonWriter& writer) const;

//! Fills in categorical field names for which empty value should be treated as missing.
void columnsForWhichEmptyIsMissing(TStrVec& fieldNames) const;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My inclination would be to pass in the actual field names and return a vector bool which is an indicator function for which columns to treat empty as missing. This will make the check much more efficient, i.e.

Suggested change
void columnsForWhichEmptyIsMissing(TStrVec& fieldNames) const;
TBoolVec columnsForWhichEmptyIsMissing(const TStrVec& fieldNames) const;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

private:
using TBoostedTreeUPtr = std::unique_ptr<maths::CBoostedTree>;
using TBoostedTreeFactoryUPtr = std::unique_ptr<maths::CBoostedTreeFactory>;
using TDataSearcherUPtr = CDataFrameAnalysisSpecification::TDataSearcherUPtr;
using TMemoryEstimator = std::function<void(std::int64_t)>;

protected:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we move before private typedefs as per coding standards

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -373,13 +384,38 @@ void CDataFrameAnalyzer::writeProgress(int progress,
writer.flush();
}

void mapToVector(const TStrSizeUMap& map, TStrVec& vector) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move these to local namespace at the start of this file, i.e. these should have local linkage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::size_t index = entry.second;
if (index >= vector.size()) {
HANDLE_FATAL(<< "Index out of bounds: " << index);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

move unsafe access in to else in case handle fatal doesn't abort.

Suggested change
}
} else {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

void mapsToVectors(const TStrSizeUMapVec& maps, TStrVecVec& vectors) {
assert(vectors.empty());
vectors.resize(maps.size());
for (std::size_t i = 0; i < maps.size(); i++) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coding standards

Suggested change
for (std::size_t i = 0; i < maps.size(); i++) {
for (std::size_t i = 0; i < maps.size(); ++i) {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

BTW, I thought this kind of optimization is done automatically by the compiler these days.
But of course I'll comply with the standards :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For size_t it makes no difference. However, for a user defined type the compiler is forced to call a different function which would usually create a temporary. Since it is just as easy to type ++i as i++ we just mandate that we always use ++i if the return type isn't used to save having to think about it.

const TStrVecVec&,
TRowRef row,
core::CRapidJsonConcurrentLineWriter& writer) const {
const auto& tree{boostedTree()};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As before

Suggested change
const auto& tree{boostedTree()};
const auto& tree = this->boostedTree();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -104,6 +104,7 @@ class API_EXPORT CDataFrameAnalyzer {
TDataFrameAnalysisSpecificationUPtr m_AnalysisSpecification;
TStrVec m_CategoricalFieldNames;
TStrSizeUMapVec m_CategoricalFieldValues;
std::set<std::string> m_EmptyAsMissing;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
std::set<std::string> m_EmptyAsMissing;
TBoolVec m_EmptyIsMissing;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 60 to 63
TStrVec emptyAsMissing;
m_AnalysisSpecification->columnsForWhichEmptyIsMissing(emptyAsMissing);
m_EmptyAsMissing =
std::set<std::string>(emptyAsMissing.begin(), emptyAsMissing.end());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per my suggestion I'd move this into CDataFrameAnalyzer::captureFieldNames

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*columns = fieldToValue(isCategorical[i],
m_CategoricalFieldValues[i], fieldValues[i]);
*columns = fieldToValue(isCategorical[i], m_CategoricalFieldValues[i],
m_EmptyAsMissing.count(m_FieldNames[i]) > 0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
m_EmptyAsMissing.count(m_FieldNames[i]) > 0,
m_EmptyIsMissing[i],

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

const CDataFrameBoostedTreeRunner::TBoostedTreeUPtr&
CDataFrameBoostedTreeRunner::boostedTree() const {
if (m_BoostedTree == nullptr) {
HANDLE_FATAL(<< "Internal error: boosted tree object missing. Please report this error.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we assert here let's return by constant reference, i.e. const maths::CBoostedTree&.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor Author

@przemekwitek przemekwitek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks pretty good. I left some minor comments.

One thing I think is it would be nice to avoid the string comparisons you do reading every row. Basically, you can do this once if you look up the position of the dependent variable in the feature name strings vector (see comments).

Done.

It would also be good to add a unit test to CDataFrameAnalyzerTest for this behaviour. You can access the data frame CDataFrameAnalyzer creates to do this.

WIP

private:
using TBoostedTreeUPtr = std::unique_ptr<maths::CBoostedTree>;
using TBoostedTreeFactoryUPtr = std::unique_ptr<maths::CBoostedTreeFactory>;
using TDataSearcherUPtr = CDataFrameAnalysisSpecification::TDataSearcherUPtr;
using TMemoryEstimator = std::function<void(std::int64_t)>;

protected:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -104,6 +104,7 @@ class API_EXPORT CDataFrameAnalyzer {
TDataFrameAnalysisSpecificationUPtr m_AnalysisSpecification;
TStrVec m_CategoricalFieldNames;
TStrSizeUMapVec m_CategoricalFieldValues;
std::set<std::string> m_EmptyAsMissing;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 60 to 63
TStrVec emptyAsMissing;
m_AnalysisSpecification->columnsForWhichEmptyIsMissing(emptyAsMissing);
m_EmptyAsMissing =
std::set<std::string>(emptyAsMissing.begin(), emptyAsMissing.end());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

void mapsToVectors(const TStrSizeUMapVec& maps, TStrVecVec& vectors) {
assert(vectors.empty());
vectors.resize(maps.size());
for (std::size_t i = 0; i < maps.size(); i++) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

BTW, I thought this kind of optimization is done automatically by the compiler these days.
But of course I'll comply with the standards :)

}

const CDataFrameAnalysisConfigReader CDataFrameClassificationRunner::getParameterReader() {
CDataFrameAnalysisConfigReader theReader =
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -173,6 +173,9 @@ class API_EXPORT CDataFrameAnalysisSpecification {
//! 2. disk is used (only one partition needs to be loaded to main memory)
void estimateMemoryUsage(CMemoryUsageEstimationResultJsonWriter& writer) const;

//! Fills in categorical field names for which empty value should be treated as missing.
void columnsForWhichEmptyIsMissing(TStrVec& fieldNames) const;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

*columns = fieldToValue(isCategorical[i],
m_CategoricalFieldValues[i], fieldValues[i]);
*columns = fieldToValue(isCategorical[i], m_CategoricalFieldValues[i],
m_EmptyAsMissing.count(m_FieldNames[i]) > 0,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

std::size_t index = entry.second;
if (index >= vector.size()) {
HANDLE_FATAL(<< "Index out of bounds: " << index);
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -373,13 +384,38 @@ void CDataFrameAnalyzer::writeProgress(int progress,
writer.flush();
}

void mapToVector(const TStrSizeUMap& map, TStrVec& vector) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

const CDataFrameBoostedTreeRunner::TBoostedTreeUPtr&
CDataFrameBoostedTreeRunner::boostedTree() const {
if (m_BoostedTree == nullptr) {
HANDLE_FATAL(<< "Internal error: boosted tree object missing. Please report this error.");
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -1018,6 +1021,66 @@ void CDataFrameAnalyzerTest::testCategoricalFields() {
}
}

void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() {
void CDataFrameAnalyzerTest::testCategoricalFieldsEmptyAsMissing() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@przemekwitek
Copy link
Contributor Author

WIP

I've added unit tests. @tveasey, please take another look.

Copy link
Contributor

@tveasey tveasey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great! Left a couple more minor change requests, but happy for this to go in without any further reviews so I'll go ahead and approve.

@@ -1018,6 +1021,66 @@ void CDataFrameAnalyzerTest::testCategoricalFields() {
}
}

void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() {
void CDataFrameAnalyzerTest::testCategoricalFieldsEmptyAsMissing() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

return [expected](double actual) { return expected == actual; };
};

auto nan = []() { return [](double actual) { return isnan(actual); }; };
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this maths::CDataFrameUtils::isMissing? This should match whatever value we use to indicate missing.

Suggested change
auto nan = []() { return [](double actual) { return isnan(actual); }; };
auto missing = []() { return [](double actual) { return maths::CDataFrameUtils::isMissing(actual); }; };

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

assertRow(8, {eq(3.0), eq(1.0), eq(8.0), eq(8.0), eq(0.0)}, rows[8]);
assertRow(9, {eq(2.0), eq(2.0), eq(9.0), eq(9.0), eq(1.0)}, rows[9]);
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 good test

@@ -18,6 +18,8 @@ class CDataFrameAnalysisRunnerTest : public CppUnit::TestFixture {
void testEstimateMemoryUsage_10();
void testEstimateMemoryUsage_100();
void testEstimateMemoryUsage_1000();
void testColumnsForWhichEmptyIsMissing_Classification();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's stick to camel case we use this in 99.9% of all code (we should fix the other non-standard names in this suite as well, but you don't have to make that change)

Suggested change
void testColumnsForWhichEmptyIsMissing_Classification();
void testColumnsForWhichEmptyIsMissingClassification();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -18,6 +18,8 @@ class CDataFrameAnalysisRunnerTest : public CppUnit::TestFixture {
void testEstimateMemoryUsage_10();
void testEstimateMemoryUsage_100();
void testEstimateMemoryUsage_1000();
void testColumnsForWhichEmptyIsMissing_Classification();
void testColumnsForWhichEmptyIsMissing_Regression();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void testColumnsForWhichEmptyIsMissing_Regression();
void testColumnsForWhichEmptyIsMissingRegression();

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

bool(emptyAsMissing[3]));
}

void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Classification() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Classification() {
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissingClassification() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

testColumnsForWhichEmptyIsMissing("classification", true);
}

void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Regression() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Regression() {
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissingRegression() {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@przemekwitek
Copy link
Contributor Author

run elasticsearch-ci

@przemekwitek przemekwitek merged commit 0a95ace into elastic:master Oct 4, 2019
@przemekwitek przemekwitek deleted the classification branch October 4, 2019 05:05
przemekwitek added a commit to przemekwitek/ml-cpp that referenced this pull request Oct 4, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants