Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling NaN Values in TPR Calculation for Highly Unbalanced Datasets #643

Open
HyunjunA opened this issue Mar 8, 2024 · 2 comments
Open
Assignees
Milestone

Comments

@HyunjunA
Copy link
Contributor

HyunjunA commented Mar 8, 2024

The machine learning backend encounters a breakdown related to handling datasets with extreme class imbalances.
Currently addressing all issues related to this problem in the machine learning backend.

@HyunjunA HyunjunA self-assigned this Mar 8, 2024
@HyunjunA HyunjunA added this to the v0.21 milestone Mar 8, 2024
@HyunjunA
Copy link
Contributor Author

HyunjunA commented Mar 9, 2024

The default cross-validation (cv) value of the current ML backend is 10, which could potentially pose issues for certain types of datasets. Specifically, when utilizing the CSICU team's dataset with cv = 10 and any form of Stratified cross-validation methods, this issue remains unresolved. As a temporary solution, I've employed 'nanmean' to calculate the mean accuracy. However, a permanent solution requires the implementation of logic to set the cv appropriately for edge cases.

@HyunjunA
Copy link
Contributor Author

HyunjunA commented Mar 12, 2024

The following code adjusts the number of splits based on the smallest class size within a given dataset. In the machine learning backend, the use of stratified cross-validation function ensures that each fold contains at least one instance of every class, with the number of splits determined by the size of the smallest class. This approach guarantees representation of all classes across folds.

However, for the highly imbalanced and small dataset provided by Debbie, we has opted for a temporary solution. This solution involves using the nanmean function to prevent the generation of NaN values.

We will discuss the possibility of integrating the below code into Aliro and better solutions in the future.

def decision_rule_fold_cv_based_on_classes(each_class):
    """
    Adjusts the number of cross-validation folds based on the class distribution.
    
    Parameters
    ----------
    each_class : dict
        A dictionary where keys are the classes and the values are the number of samples per class.
    
    Returns
    -------
    cv : int
        The suitable number of cross-validation folds ensuring that each fold can include instances of each class.
     """
    # Find the minimum class count to ensure every fold can contain at least one instance of every class.
    min_class_count = min(each_class.values())
    
    # The maximum number of folds is determined by the smallest class to ensure representation in each fold.
    # However, we cannot have more folds than the minimum class count.
    n_folds = min(10, min_class_count)  # Starting with a default max of 10 folds
    
    # Ensure at least 2 folds for meaningful cross-validation.
    n_folds = max(n_folds, 2)
    
    return n_folds

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant