[DOCS] Documents imbalanced class sizes and their effect on classification (#806)

szabosteve · tveasey · szabosteve · commit ba64d512a619 · 2020-01-17T09:13:27.000+01:00
Co-Authored-By: Tom Veasey &lt;tveasey@users.noreply.github.com&gt;
diff --git a/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc b/docs/en/stack/ml/df-analytics/dfa-classification.asciidoc
@@ -49,7 +49,11 @@ means that you need to supply a labeled training dataset that has some
 {feature-vars} and a {depvar}. The {classification} algorithm learns the 
 relationships between the features and the {depvar}. Once you’ve trained the 
 model on your training dataset, you can reuse the knowledge that the model has 
-learned about the relationships between the data points to classify new data.
+learned about the relationships between the data points to classify new data. 
+Your training dataset should be approximately balanced which means the number of 
+data points belonging to the various classes should not be widely different, 
+otherwise the {classanalysis} may not provide the best predictions. Read 
+<<dfa-classification-imbalanced-classes>> to learn more.
 
 
 [discrete]
@@ -80,4 +84,4 @@ seen before, you must set aside a proportion of the training dataset for
 testing. This split of the dataset is the testing dataset. Once the model has 
 been trained, you can let the model predict the value of the data points it has 
 never seen before and compare the prediction to the actual value by using the 
-evaluate {dfanalytics} API.
+evaluate {dfanalytics} API.
diff --git a/docs/en/stack/ml/df-analytics/dfanalytics-limitations.asciidoc b/docs/en/stack/ml/df-analytics/dfanalytics-limitations.asciidoc
@@ -104,6 +104,23 @@ where included fields contain an array are also ignored. Documents in the
 destination index that don't contain a results field are not included in the 
 {classanalysis}.
 
+[float]
+[[dfa-classification-imbalanced-classes]]
+=== Imbalanced class sizes affect {classification} performance
+
+If your training data is very imbalanced, then {classanalysis} may not provide 
+good predictions. Training tries to mitigate the effects of imbalanced 
+training data by maximizing the minimum recall of any class. For imbalanced 
+training data, this means that it will try to balance the proportion of values 
+it correctly labels in the minority and majority class. However, this process 
+can result in a slight degradation of the overall accuracy. Try to avoid highly 
+imbalanced situations. We recommend having at least 50 examples of each class 
+and a ratio of no more than 10 to 1 for the majority to minority class labels in 
+the training data. If your training dataset is very imbalanced, consider 
+downsampling the majority class, upsampling the minority class or – if it's 
+possible – gather more data. Consider investigating methods to change data that 
+fit your use case the most.
+
 [float]
 [[dfa-inference-nested-limitation]]
 === Deeply nested objects affect {infer} performance