Added functionality to balance sample size of each level of the target/label variable [version 1.0.0] #460

coopa33 · 2025-06-14T12:45:08Z

Refers to issue #457

I've added a section in the QML interface to select whether the data should be balanced in sample size according to the selected target variable. I've also added the functionality in the R-scripts. This functionality should be available in all Machine Learning Classification analyses since it makes use of common classification functions.

I've added this functionality very early in the analysis, namely in .mlClassificationReadData in commonMachineLearningClassification. The subfunction itself is called .mlBalanceDataset in the same script.

Dependencies for the analysis option named "balanceLabels" is added in .mlClassificationDependencies, as well as in the specific function that produces the Data Split plot (.mlPlotDataSplit) in commonMachineLearningRegression.

For the latter, this is because the Data Split plot is a default plot for the analysis, and because the dependencies in that plot function do not make use of the common dependencies in .mlClassificationDependencies. I did not check whether this is the case for other non-default plots and tables in the analysis, but all main tables of the analysis (which use .mlClassificationDependencies) reflect changes in the balanceLabels option.

This is my first time making a contribution to a JASP module, please advise as to further steps that need to be taken by me!

…n function (in commonMachineLearningClassification.R) called .balance_dataset. 2) Added conditional activation of .balance_dataset in .mlClassificationReadData, depending on options[[balanceLabels]]. 3) Added the dependency in .mlClassificationDependencies.

…dependency for .mlPlotDataSplit

koenderks · 2025-06-14T12:53:51Z

Before look at this functionality, this seems more like something that should be implemented with a checkbox rather than a radiobutton. Then you can just add a checkbox in the data split preferences saying "balance labels" or whatever, and do not need an entirely new section (which takes up a lot of space in the interface).

Also, now you do undersampling (i.e., removing cases of the majority classes), but how about oversampling (i.e., duplicating cases of the minority classes)?

coopa33 · 2025-06-14T13:22:28Z

I've moved the option to the data split section, as you've advised.

For oversampling, should it just be sampling with replacement then? I could add an additional option for choosing between under and oversampling, it would definitely be a better option if the original dataset is small to begin with, and the minimum sample size found in the labels is small as well.

Also, I note that in the edited analysis, when I enter only a target variable without any predictors, the table returns an error. This does not influence the analysis (if you enter a predictor the error goes away), but it doesn't occur in the main Logistic/Multinomial Regression analysis, and I don't know how this error occurs exactly. If the balancing functionality is turned off, the error also doesn't occur, meaning it has to do with my new function .mlBalanceDataset.

This can be seen in the video below (around the middle of the video)
cleaned.webm

coopa33 · 2025-06-14T14:02:16Z

This is how it looks like, it oversamples with replacement

…erate section, because data split is shared in regression analyses, which do not require balancing targets (target is continuous)

coopa33 · 2025-06-14T14:21:47Z

I removed the option from data split again, because DataSplit.qml is also being used by regression analyses, and there it does not make sense to balance the targets (because it is continuous)... I put it in a seperate new section, do you have any idea how to handle this?

koenderks · 2025-06-14T14:39:54Z

I would put it back in data split but just hide it in the regression analyses by creating an alias for the visible property and then setting that to false in the regression qmls

…operty, and set it to 'false' for all Regression analyses

coopa33 · 2025-06-14T15:26:04Z

Done! Should now only be visible for the classification analyses.

…feature for analyses requiring validation data, pending decision whether we should balance validation data as well

coopa33 · 2025-06-15T14:17:11Z

Previously, the function balanced the complete data, before data split. Now, it only balances the training data while keeping test data the same. I've temporarily set it to not visible on the analyses that require validation data, since I'm not sure whether we should balance the validation set as well? Intuitively it makes sense to me to not balance the validation set, since duplicate observations could be a problem and make the model overoptimistic. What do you think?

coopa33 added 5 commits June 13, 2025 16:55

Added QML components for balancing labels in data

af1b0de

changed balance_dataset to .mlBalanceDataset to reflect style. Added …

10194cd

…dependency for .mlPlotDataSplit

changed balance_dataset to .mlBalanceDataset to reflect style. Added …

db3e7db

…dependency for .mlPlotDataSplit

Remove accidental changes made. Made style adjustments.

33ca688

Moved option to balance targets to the data split section.

942ec3c

Added option to balance either through over- or under-sampling

9c63f09

Removed balance target interface from data split, and put it in a sep…

d281588

…erate section, because data split is shared in regression analyses, which do not require balancing targets (target is continuous)

Put balancing option back into data split. Added alias for visible pr…

c11530e

…operty, and set it to 'false' for all Regression analyses

coopa33 added 2 commits June 15, 2025 15:43

Removed redundancies in .mlBalanceDataset

be328e3

.mlBalanceDataset now only operates on training data. I disabled the …

4c27e61

…feature for analyses requiring validation data, pending decision whether we should balance validation data as well

Made style adjustments

1bfa7ae

koenderks changed the title ~~Added functionality to balance sample size of each level of the target/label variable~~ Added functionality to balance sample size of each level of the target/label variable [version 1.0.0] Jun 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added functionality to balance sample size of each level of the target/label variable [version 1.0.0] #460

Added functionality to balance sample size of each level of the target/label variable [version 1.0.0] #460

Uh oh!

coopa33 commented Jun 14, 2025 •

edited

Loading

Uh oh!

koenderks commented Jun 14, 2025 •

edited

Loading

Uh oh!

coopa33 commented Jun 14, 2025

Uh oh!

coopa33 commented Jun 14, 2025

Uh oh!

coopa33 commented Jun 14, 2025

Uh oh!

koenderks commented Jun 14, 2025

Uh oh!

coopa33 commented Jun 14, 2025

Uh oh!

coopa33 commented Jun 15, 2025

Uh oh!

Uh oh!

Added functionality to balance sample size of each level of the target/label variable [version 1.0.0] #460

Are you sure you want to change the base?

Added functionality to balance sample size of each level of the target/label variable [version 1.0.0] #460

Uh oh!

Conversation

coopa33 commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

koenderks commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coopa33 commented Jun 14, 2025

Uh oh!

coopa33 commented Jun 14, 2025

Uh oh!

coopa33 commented Jun 14, 2025

Uh oh!

koenderks commented Jun 14, 2025

Uh oh!

coopa33 commented Jun 14, 2025

Uh oh!

coopa33 commented Jun 15, 2025

Uh oh!

Uh oh!

coopa33 commented Jun 14, 2025 •

edited

Loading

koenderks commented Jun 14, 2025 •

edited

Loading