-
Notifications
You must be signed in to change notification settings - Fork 21
Added functionality to balance sample size of each level of the target/label variable [version 1.0.0] #460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…n function (in commonMachineLearningClassification.R) called .balance_dataset. 2) Added conditional activation of .balance_dataset in .mlClassificationReadData, depending on options[[balanceLabels]]. 3) Added the dependency in .mlClassificationDependencies.
…dependency for .mlPlotDataSplit
…dependency for .mlPlotDataSplit
Before look at this functionality, this seems more like something that should be implemented with a checkbox rather than a radiobutton. Then you can just add a checkbox in the data split preferences saying "balance labels" or whatever, and do not need an entirely new section (which takes up a lot of space in the interface). Also, now you do undersampling (i.e., removing cases of the majority classes), but how about oversampling (i.e., duplicating cases of the minority classes)? |
I've moved the option to the data split section, as you've advised. For oversampling, should it just be sampling with replacement then? I could add an additional option for choosing between under and oversampling, it would definitely be a better option if the original dataset is small to begin with, and the minimum sample size found in the labels is small as well. Also, I note that in the edited analysis, when I enter only a target variable without any predictors, the table returns an error. This does not influence the analysis (if you enter a predictor the error goes away), but it doesn't occur in the main Logistic/Multinomial Regression analysis, and I don't know how this error occurs exactly. If the balancing functionality is turned off, the error also doesn't occur, meaning it has to do with my new function .mlBalanceDataset. This can be seen in the video below (around the middle of the video) |
…erate section, because data split is shared in regression analyses, which do not require balancing targets (target is continuous)
I removed the option from data split again, because DataSplit.qml is also being used by regression analyses, and there it does not make sense to balance the targets (because it is continuous)... I put it in a seperate new section, do you have any idea how to handle this? |
I would put it back in data split but just hide it in the regression analyses by creating an alias for the visible property and then setting that to false in the regression qmls |
…operty, and set it to 'false' for all Regression analyses
Done! Should now only be visible for the classification analyses. |
…feature for analyses requiring validation data, pending decision whether we should balance validation data as well
Previously, the function balanced the complete data, before data split. Now, it only balances the training data while keeping test data the same. I've temporarily set it to not visible on the analyses that require validation data, since I'm not sure whether we should balance the validation set as well? Intuitively it makes sense to me to not balance the validation set, since duplicate observations could be a problem and make the model overoptimistic. What do you think? |
Refers to issue #457
I've added a section in the QML interface to select whether the data should be balanced in sample size according to the selected target variable. I've also added the functionality in the R-scripts. This functionality should be available in all Machine Learning Classification analyses since it makes use of common classification functions.
I've added this functionality very early in the analysis, namely in .mlClassificationReadData in commonMachineLearningClassification. The subfunction itself is called .mlBalanceDataset in the same script.
Dependencies for the analysis option named "balanceLabels" is added in .mlClassificationDependencies, as well as in the specific function that produces the Data Split plot (.mlPlotDataSplit) in commonMachineLearningRegression.
For the latter, this is because the Data Split plot is a default plot for the analysis, and because the dependencies in that plot function do not make use of the common dependencies in .mlClassificationDependencies. I did not check whether this is the case for other non-default plots and tables in the analysis, but all main tables of the analysis (which use .mlClassificationDependencies) reflect changes in the balanceLabels option.
This is my first time making a contribution to a JASP module, please advise as to further steps that need to be taken by me!