Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 6 additions & 2 deletions docs/en/stack/ml/df-analytics/dfa-classification.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,11 @@ means that you need to supply a labeled training dataset that has some
{feature-vars} and a {depvar}. The {classification} algorithm learns the
relationships between the features and the {depvar}. Once you’ve trained the
model on your training dataset, you can reuse the knowledge that the model has
learned about the relationships between the data points to classify new data.
learned about the relationships between the data points to classify new data.
Your training dataset should be approximately balanced which means the number of
data points belonging to the various classes should not be widely different,
otherwise the {classanalysis} may not provide the best predictions. Read
<<dfa-classification-imbalanced-classes>> to learn more.


[discrete]
Expand Down Expand Up @@ -80,4 +84,4 @@ seen before, you must set aside a proportion of the training dataset for
testing. This split of the dataset is the testing dataset. Once the model has
been trained, you can let the model predict the value of the data points it has
never seen before and compare the prediction to the actual value by using the
evaluate {dfanalytics} API.
evaluate {dfanalytics} API.
17 changes: 17 additions & 0 deletions docs/en/stack/ml/df-analytics/dfanalytics-limitations.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,23 @@ where included fields contain an array are also ignored. Documents in the
destination index that don't contain a results field are not included in the
{classanalysis}.

[float]
[[dfa-classification-imbalanced-classes]]
=== Imbalanced class sizes affect {classification} performance

If your training data is very imbalanced, then {classanalysis} may not provide
good predictions. Training tries to mitigate the effects of imbalanced
training data by maximizing the minimum recall of any class. For imbalanced
training data, this means that it will try to balance the proportion of values
it correctly labels in the minority and majority class. However, this process
can result in a slight degradation of the overall accuracy. Try to avoid highly
imbalanced situations. We recommend having at least 50 examples of each class
and a ratio of no more than 10 to 1 for the majority to minority class labels in
the training data. If your training dataset is very imbalanced, consider
downsampling the majority class, upsampling the minority class or – if it's
possible – gather more data. Consider investigating methods to change data that
fit your use case the most.

[float]
[[dfa-inference-nested-limitation]]
=== Deeply nested objects affect {infer} performance
Expand Down