From ba64d512a619df3643a0007ea16beeeb1a308c04 Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Istv=C3=A1n=20Zolt=C3=A1n=20Szab=C3=B3?= Date: Fri, 17 Jan 2020 09:12:36 +0100 Subject: [PATCH] [DOCS] Documents imbalanced class sizes and their effect on classification (#806) Co-Authored-By: Tom Veasey --- .../ml/df-analytics/dfa-classification.asciidoc | 8 ++++++-- .../dfanalytics-limitations.asciidoc | 17 +++++++++++++++++ 2 files changed, 23 insertions(+), 2 deletions(-) diff --git a/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc b/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc index db4fe25b9..e3ee26623 100644 --- a/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc +++ b/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc @@ -49,7 +49,11 @@ means that you need to supply a labeled training dataset that has some {feature-vars} and a {depvar}. The {classification} algorithm learns the relationships between the features and the {depvar}. Once you’ve trained the model on your training dataset, you can reuse the knowledge that the model has -learned about the relationships between the data points to classify new data. +learned about the relationships between the data points to classify new data. +Your training dataset should be approximately balanced which means the number of +data points belonging to the various classes should not be widely different, +otherwise the {classanalysis} may not provide the best predictions. Read +<> to learn more. [discrete] @@ -80,4 +84,4 @@ seen before, you must set aside a proportion of the training dataset for testing. This split of the dataset is the testing dataset. Once the model has been trained, you can let the model predict the value of the data points it has never seen before and compare the prediction to the actual value by using the -evaluate {dfanalytics} API. \ No newline at end of file +evaluate {dfanalytics} API. diff --git a/docs/en/stack/ml/df-analytics/dfanalytics-limitations.asciidoc b/docs/en/stack/ml/df-analytics/dfanalytics-limitations.asciidoc index 393746920..d1cdf5dd0 100644 --- a/docs/en/stack/ml/df-analytics/dfanalytics-limitations.asciidoc +++ b/docs/en/stack/ml/df-analytics/dfanalytics-limitations.asciidoc @@ -104,6 +104,23 @@ where included fields contain an array are also ignored. Documents in the destination index that don't contain a results field are not included in the {classanalysis}. +[float] +[[dfa-classification-imbalanced-classes]] +=== Imbalanced class sizes affect {classification} performance + +If your training data is very imbalanced, then {classanalysis} may not provide +good predictions. Training tries to mitigate the effects of imbalanced +training data by maximizing the minimum recall of any class. For imbalanced +training data, this means that it will try to balance the proportion of values +it correctly labels in the minority and majority class. However, this process +can result in a slight degradation of the overall accuracy. Try to avoid highly +imbalanced situations. We recommend having at least 50 examples of each class +and a ratio of no more than 10 to 1 for the majority to minority class labels in +the training data. If your training dataset is very imbalanced, consider +downsampling the majority class, upsampling the minority class or – if it's +possible – gather more data. Consider investigating methods to change data that +fit your use case the most. + [float] [[dfa-inference-nested-limitation]] === Deeply nested objects affect {infer} performance