Skip to content

Commit ba64d51

Browse files
szabostevetveasey
andcommitted
[DOCS] Documents imbalanced class sizes and their effect on classification (#806)
Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>
1 parent e75bfe4 commit ba64d51

File tree

2 files changed

+23
-2
lines changed

2 files changed

+23
-2
lines changed

docs/en/stack/ml/df-analytics/dfa-classification.asciidoc

Lines changed: 6 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,11 @@ means that you need to supply a labeled training dataset that has some
4949
{feature-vars} and a {depvar}. The {classification} algorithm learns the
5050
relationships between the features and the {depvar}. Once you’ve trained the
5151
model on your training dataset, you can reuse the knowledge that the model has
52-
learned about the relationships between the data points to classify new data.
52+
learned about the relationships between the data points to classify new data.
53+
Your training dataset should be approximately balanced which means the number of
54+
data points belonging to the various classes should not be widely different,
55+
otherwise the {classanalysis} may not provide the best predictions. Read
56+
<<dfa-classification-imbalanced-classes>> to learn more.
5357

5458

5559
[discrete]
@@ -80,4 +84,4 @@ seen before, you must set aside a proportion of the training dataset for
8084
testing. This split of the dataset is the testing dataset. Once the model has
8185
been trained, you can let the model predict the value of the data points it has
8286
never seen before and compare the prediction to the actual value by using the
83-
evaluate {dfanalytics} API.
87+
evaluate {dfanalytics} API.

docs/en/stack/ml/df-analytics/dfanalytics-limitations.asciidoc

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -104,6 +104,23 @@ where included fields contain an array are also ignored. Documents in the
104104
destination index that don't contain a results field are not included in the
105105
{classanalysis}.
106106

107+
[float]
108+
[[dfa-classification-imbalanced-classes]]
109+
=== Imbalanced class sizes affect {classification} performance
110+
111+
If your training data is very imbalanced, then {classanalysis} may not provide
112+
good predictions. Training tries to mitigate the effects of imbalanced
113+
training data by maximizing the minimum recall of any class. For imbalanced
114+
training data, this means that it will try to balance the proportion of values
115+
it correctly labels in the minority and majority class. However, this process
116+
can result in a slight degradation of the overall accuracy. Try to avoid highly
117+
imbalanced situations. We recommend having at least 50 examples of each class
118+
and a ratio of no more than 10 to 1 for the majority to minority class labels in
119+
the training data. If your training dataset is very imbalanced, consider
120+
downsampling the majority class, upsampling the minority class or – if it's
121+
possible – gather more data. Consider investigating methods to change data that
122+
fit your use case the most.
123+
107124
[float]
108125
[[dfa-inference-nested-limitation]]
109126
=== Deeply nested objects affect {infer} performance

0 commit comments

Comments
 (0)