Skip to content

Commit

Permalink
[DOCS] Documents imbalanced class sizes and their effect on classific…
Browse files Browse the repository at this point in the history
…ation (#806)

Co-Authored-By: Tom Veasey <tveasey@users.noreply.github.com>
  • Loading branch information
szabosteve and tveasey committed Jan 17, 2020
1 parent e75bfe4 commit ba64d51
Show file tree
Hide file tree
Showing 2 changed files with 23 additions and 2 deletions.
8 changes: 6 additions & 2 deletions docs/en/stack/ml/df-analytics/dfa-classification.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,11 @@ means that you need to supply a labeled training dataset that has some
{feature-vars} and a {depvar}. The {classification} algorithm learns the
relationships between the features and the {depvar}. Once you’ve trained the
model on your training dataset, you can reuse the knowledge that the model has
learned about the relationships between the data points to classify new data.
learned about the relationships between the data points to classify new data.
Your training dataset should be approximately balanced which means the number of
data points belonging to the various classes should not be widely different,
otherwise the {classanalysis} may not provide the best predictions. Read
<<dfa-classification-imbalanced-classes>> to learn more.


[discrete]
Expand Down Expand Up @@ -80,4 +84,4 @@ seen before, you must set aside a proportion of the training dataset for
testing. This split of the dataset is the testing dataset. Once the model has
been trained, you can let the model predict the value of the data points it has
never seen before and compare the prediction to the actual value by using the
evaluate {dfanalytics} API.
evaluate {dfanalytics} API.
17 changes: 17 additions & 0 deletions docs/en/stack/ml/df-analytics/dfanalytics-limitations.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,23 @@ where included fields contain an array are also ignored. Documents in the
destination index that don't contain a results field are not included in the
{classanalysis}.

[float]
[[dfa-classification-imbalanced-classes]]
=== Imbalanced class sizes affect {classification} performance

If your training data is very imbalanced, then {classanalysis} may not provide
good predictions. Training tries to mitigate the effects of imbalanced
training data by maximizing the minimum recall of any class. For imbalanced
training data, this means that it will try to balance the proportion of values
it correctly labels in the minority and majority class. However, this process
can result in a slight degradation of the overall accuracy. Try to avoid highly
imbalanced situations. We recommend having at least 50 examples of each class
and a ratio of no more than 10 to 1 for the majority to minority class labels in
the training data. If your training dataset is very imbalanced, consider
downsampling the majority class, upsampling the minority class or – if it's
possible – gather more data. Consider investigating methods to change data that
fit your use case the most.

[float]
[[dfa-inference-nested-limitation]]
=== Deeply nested objects affect {infer} performance
Expand Down

0 comments on commit ba64d51

Please sign in to comment.