-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce classification analysis runner. #701
Conversation
885a693
to
cff6d69
Compare
13381dd
to
d6181ba
Compare
I have labelled this |
d6181ba
to
88e02b4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good. I left some minor comments.
One thing I think is it would be nice to avoid the string comparisons you do reading every row. Basically, you can do this once if you look up the position of the dependent variable in the feature name strings vector (see comments). It would also be good to add a unit test to CDataFrameAnalyzerTest
for this behaviour. You can access the data frame CDataFrameAnalyzer
creates to do this.
@@ -173,6 +173,9 @@ class API_EXPORT CDataFrameAnalysisSpecification { | |||
//! 2. disk is used (only one partition needs to be loaded to main memory) | |||
void estimateMemoryUsage(CMemoryUsageEstimationResultJsonWriter& writer) const; | |||
|
|||
//! Fills in categorical field names for which empty value should be treated as missing. | |||
void columnsForWhichEmptyIsMissing(TStrVec& fieldNames) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My inclination would be to pass in the actual field names and return a vector bool which is an indicator function for which columns to treat empty as missing. This will make the check much more efficient, i.e.
void columnsForWhichEmptyIsMissing(TStrVec& fieldNames) const; | |
TBoolVec columnsForWhichEmptyIsMissing(const TStrVec& fieldNames) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
private: | ||
using TBoostedTreeUPtr = std::unique_ptr<maths::CBoostedTree>; | ||
using TBoostedTreeFactoryUPtr = std::unique_ptr<maths::CBoostedTreeFactory>; | ||
using TDataSearcherUPtr = CDataFrameAnalysisSpecification::TDataSearcherUPtr; | ||
using TMemoryEstimator = std::function<void(std::int64_t)>; | ||
|
||
protected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we move before private typedefs as per coding standards
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/api/CDataFrameAnalyzer.cc
Outdated
@@ -373,13 +384,38 @@ void CDataFrameAnalyzer::writeProgress(int progress, | |||
writer.flush(); | |||
} | |||
|
|||
void mapToVector(const TStrSizeUMap& map, TStrVec& vector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move these to local namespace at the start of this file, i.e. these should have local linkage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/api/CDataFrameAnalyzer.cc
Outdated
std::size_t index = entry.second; | ||
if (index >= vector.size()) { | ||
HANDLE_FATAL(<< "Index out of bounds: " << index); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move unsafe access in to else in case handle fatal doesn't abort.
} | |
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/api/CDataFrameAnalyzer.cc
Outdated
void mapsToVectors(const TStrSizeUMapVec& maps, TStrVecVec& vectors) { | ||
assert(vectors.empty()); | ||
vectors.resize(maps.size()); | ||
for (std::size_t i = 0; i < maps.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
coding standards
for (std::size_t i = 0; i < maps.size(); i++) { | |
for (std::size_t i = 0; i < maps.size(); ++i) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
BTW, I thought this kind of optimization is done automatically by the compiler these days.
But of course I'll comply with the standards :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For size_t it makes no difference. However, for a user defined type the compiler is forced to call a different function which would usually create a temporary. Since it is just as easy to type ++i
as i++
we just mandate that we always use ++i
if the return type isn't used to save having to think about it.
const TStrVecVec&, | ||
TRowRef row, | ||
core::CRapidJsonConcurrentLineWriter& writer) const { | ||
const auto& tree{boostedTree()}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As before
const auto& tree{boostedTree()}; | |
const auto& tree = this->boostedTree(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
include/api/CDataFrameAnalyzer.h
Outdated
@@ -104,6 +104,7 @@ class API_EXPORT CDataFrameAnalyzer { | |||
TDataFrameAnalysisSpecificationUPtr m_AnalysisSpecification; | |||
TStrVec m_CategoricalFieldNames; | |||
TStrSizeUMapVec m_CategoricalFieldValues; | |||
std::set<std::string> m_EmptyAsMissing; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::set<std::string> m_EmptyAsMissing; | |
TBoolVec m_EmptyIsMissing; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/api/CDataFrameAnalyzer.cc
Outdated
TStrVec emptyAsMissing; | ||
m_AnalysisSpecification->columnsForWhichEmptyIsMissing(emptyAsMissing); | ||
m_EmptyAsMissing = | ||
std::set<std::string>(emptyAsMissing.begin(), emptyAsMissing.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per my suggestion I'd move this into CDataFrameAnalyzer::captureFieldNames
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/api/CDataFrameAnalyzer.cc
Outdated
*columns = fieldToValue(isCategorical[i], | ||
m_CategoricalFieldValues[i], fieldValues[i]); | ||
*columns = fieldToValue(isCategorical[i], m_CategoricalFieldValues[i], | ||
m_EmptyAsMissing.count(m_FieldNames[i]) > 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
m_EmptyAsMissing.count(m_FieldNames[i]) > 0, | |
m_EmptyIsMissing[i], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
const CDataFrameBoostedTreeRunner::TBoostedTreeUPtr& | ||
CDataFrameBoostedTreeRunner::boostedTree() const { | ||
if (m_BoostedTree == nullptr) { | ||
HANDLE_FATAL(<< "Internal error: boosted tree object missing. Please report this error."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we assert here let's return by constant reference, i.e. const maths::CBoostedTree&.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks pretty good. I left some minor comments.
One thing I think is it would be nice to avoid the string comparisons you do reading every row. Basically, you can do this once if you look up the position of the dependent variable in the feature name strings vector (see comments).
Done.
It would also be good to add a unit test to
CDataFrameAnalyzerTest
for this behaviour. You can access the data frameCDataFrameAnalyzer
creates to do this.
WIP
private: | ||
using TBoostedTreeUPtr = std::unique_ptr<maths::CBoostedTree>; | ||
using TBoostedTreeFactoryUPtr = std::unique_ptr<maths::CBoostedTreeFactory>; | ||
using TDataSearcherUPtr = CDataFrameAnalysisSpecification::TDataSearcherUPtr; | ||
using TMemoryEstimator = std::function<void(std::int64_t)>; | ||
|
||
protected: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
include/api/CDataFrameAnalyzer.h
Outdated
@@ -104,6 +104,7 @@ class API_EXPORT CDataFrameAnalyzer { | |||
TDataFrameAnalysisSpecificationUPtr m_AnalysisSpecification; | |||
TStrVec m_CategoricalFieldNames; | |||
TStrSizeUMapVec m_CategoricalFieldValues; | |||
std::set<std::string> m_EmptyAsMissing; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/api/CDataFrameAnalyzer.cc
Outdated
TStrVec emptyAsMissing; | ||
m_AnalysisSpecification->columnsForWhichEmptyIsMissing(emptyAsMissing); | ||
m_EmptyAsMissing = | ||
std::set<std::string>(emptyAsMissing.begin(), emptyAsMissing.end()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/api/CDataFrameAnalyzer.cc
Outdated
void mapsToVectors(const TStrSizeUMapVec& maps, TStrVecVec& vectors) { | ||
assert(vectors.empty()); | ||
vectors.resize(maps.size()); | ||
for (std::size_t i = 0; i < maps.size(); i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
BTW, I thought this kind of optimization is done automatically by the compiler these days.
But of course I'll comply with the standards :)
} | ||
|
||
const CDataFrameAnalysisConfigReader CDataFrameClassificationRunner::getParameterReader() { | ||
CDataFrameAnalysisConfigReader theReader = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -173,6 +173,9 @@ class API_EXPORT CDataFrameAnalysisSpecification { | |||
//! 2. disk is used (only one partition needs to be loaded to main memory) | |||
void estimateMemoryUsage(CMemoryUsageEstimationResultJsonWriter& writer) const; | |||
|
|||
//! Fills in categorical field names for which empty value should be treated as missing. | |||
void columnsForWhichEmptyIsMissing(TStrVec& fieldNames) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/api/CDataFrameAnalyzer.cc
Outdated
*columns = fieldToValue(isCategorical[i], | ||
m_CategoricalFieldValues[i], fieldValues[i]); | ||
*columns = fieldToValue(isCategorical[i], m_CategoricalFieldValues[i], | ||
m_EmptyAsMissing.count(m_FieldNames[i]) > 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/api/CDataFrameAnalyzer.cc
Outdated
std::size_t index = entry.second; | ||
if (index >= vector.size()) { | ||
HANDLE_FATAL(<< "Index out of bounds: " << index); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
lib/api/CDataFrameAnalyzer.cc
Outdated
@@ -373,13 +384,38 @@ void CDataFrameAnalyzer::writeProgress(int progress, | |||
writer.flush(); | |||
} | |||
|
|||
void mapToVector(const TStrSizeUMap& map, TStrVec& vector) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
} | ||
|
||
const CDataFrameBoostedTreeRunner::TBoostedTreeUPtr& | ||
CDataFrameBoostedTreeRunner::boostedTree() const { | ||
if (m_BoostedTree == nullptr) { | ||
HANDLE_FATAL(<< "Internal error: boosted tree object missing. Please report this error."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
2f0265e
to
dd86f4e
Compare
@@ -1018,6 +1021,66 @@ void CDataFrameAnalyzerTest::testCategoricalFields() { | |||
} | |||
} | |||
|
|||
void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() { | |
void CDataFrameAnalyzerTest::testCategoricalFieldsEmptyAsMissing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
I've added unit tests. @tveasey, please take another look. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! Left a couple more minor change requests, but happy for this to go in without any further reviews so I'll go ahead and approve.
@@ -1018,6 +1021,66 @@ void CDataFrameAnalyzerTest::testCategoricalFields() { | |||
} | |||
} | |||
|
|||
void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void CDataFrameAnalyzerTest::testCategoricalFields_EmptyAsMissing() { | |
void CDataFrameAnalyzerTest::testCategoricalFieldsEmptyAsMissing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
return [expected](double actual) { return expected == actual; }; | ||
}; | ||
|
||
auto nan = []() { return [](double actual) { return isnan(actual); }; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make this maths::CDataFrameUtils::isMissing? This should match whatever value we use to indicate missing.
auto nan = []() { return [](double actual) { return isnan(actual); }; }; | |
auto missing = []() { return [](double actual) { return maths::CDataFrameUtils::isMissing(actual); }; }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
assertRow(8, {eq(3.0), eq(1.0), eq(8.0), eq(8.0), eq(0.0)}, rows[8]); | ||
assertRow(9, {eq(2.0), eq(2.0), eq(9.0), eq(9.0), eq(1.0)}, rows[9]); | ||
}); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 good test
@@ -18,6 +18,8 @@ class CDataFrameAnalysisRunnerTest : public CppUnit::TestFixture { | |||
void testEstimateMemoryUsage_10(); | |||
void testEstimateMemoryUsage_100(); | |||
void testEstimateMemoryUsage_1000(); | |||
void testColumnsForWhichEmptyIsMissing_Classification(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's stick to camel case we use this in 99.9% of all code (we should fix the other non-standard names in this suite as well, but you don't have to make that change)
void testColumnsForWhichEmptyIsMissing_Classification(); | |
void testColumnsForWhichEmptyIsMissingClassification(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -18,6 +18,8 @@ class CDataFrameAnalysisRunnerTest : public CppUnit::TestFixture { | |||
void testEstimateMemoryUsage_10(); | |||
void testEstimateMemoryUsage_100(); | |||
void testEstimateMemoryUsage_1000(); | |||
void testColumnsForWhichEmptyIsMissing_Classification(); | |||
void testColumnsForWhichEmptyIsMissing_Regression(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void testColumnsForWhichEmptyIsMissing_Regression(); | |
void testColumnsForWhichEmptyIsMissingRegression(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
bool(emptyAsMissing[3])); | ||
} | ||
|
||
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Classification() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Classification() { | |
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissingClassification() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
testColumnsForWhichEmptyIsMissing("classification", true); | ||
} | ||
|
||
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Regression() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissing_Regression() { | |
void CDataFrameAnalysisRunnerTest::testColumnsForWhichEmptyIsMissingRegression() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
77d998e
to
7e30490
Compare
run elasticsearch-ci |
…sses from it: CDataFrameRegressionRunner and CDataFrameClassificationRunner.
7e30490
to
97ae2ae
Compare
Make CDataFrameBoostedTreeRunner class abstract and derive two subclasses from it:
Relates elastic/elasticsearch#46735