Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discrepancy in Model Performance Using HuggingFace Pipeline Utility #134

Open
penguinwang96825 opened this issue Jun 25, 2024 · 5 comments
Labels
reproduction Cannot reproduce the result

Comments

@penguinwang96825
Copy link

penguinwang96825 commented Jun 25, 2024

Hi I'm attempting to reproduce the performance metrics of models using HuggingFace's Pipeline utility, but I'm encountering different results. Below is the Python code I used for testing:

import warnings

import torch
import datasets
import numpy as np
from tqdm.auto import tqdm
from dotenv import load_dotenv
from transformers import pipeline
from transformers.pipelines.pt_utils import KeyDataset

pipe = pipeline('audio-classification', model='MIT/ast-finetuned-audioset-16-16-0.442')
dataset = datasets.load_dataset('confit/audioset', 'balanced', split='test')
classes = dataset.features["label"].feature.names
id2label = {idx:row for idx, row in enumerate(classes)}
label2id = {row:idx for idx, row in enumerate(classes)}

y_scores = []
for out in tqdm(pipe(KeyDataset(dataset, 'file'), top_k=527)):
    score = torch.zeros(len(classes)) # placeholder
    for item in out:
        label = item['label']
        score[label2id.get(label)] = item['score']
    y_scores.append(score)
y_scores = torch.vstack(y_scores)
print(y_scores)
print(y_scores.shape)

y_true = []
for example in tqdm(dataset, total=len(dataset)):
    num_instances = 1
    num_classes = len(classes)
    one_hot_tensor = np.zeros((num_instances, num_classes), dtype=float)
    one_hot_tensor[0, example['label']] = 1
    y_true.append(torch.from_numpy(one_hot_tensor))
y_true = torch.cat(y_true, dim=0)
print(y_true)
print(y_true.shape)

map_score = mean_average_precision(y_true.to('cpu').numpy(), y_scores.to('cpu').numpy())
print("Mean Average Precision:", map_score)

mean_auc = mean_auc_roc(y_true.to('cpu').numpy(), y_scores.to('cpu').numpy())
print("Mean AUC-ROC:", mean_auc)

The helper functions for the metrics calculations are implemented as follows:

import numpy as np
from sklearn.metrics import average_precision_score, roc_auc_score

def mean_average_precision(y_true, y_scores):
    """
    Calculate the mean average precision (mAP) for multilabel classification.

    Args:
    y_true (np.array): A binary matrix (samples x labels) of ground truth labels.
    y_scores (np.array): A matrix (samples x labels) of predicted scores.

    Returns:
    float: mean average precision score
    """
    # Number of classes
    n_classes = y_true.shape[1]

    # List to store average precision for each class
    ap_scores = []

    # Calculate average precision for each class
    for i in range(n_classes):
        ap = average_precision_score(y_true[:, i], y_scores[:, i])
        ap_scores.append(ap)

    # Calculate mean of average precision scores
    mAP = np.mean(ap_scores)
    return mAP


def mean_auc_roc(y_true, y_scores):
    """
    Calculate the mean AUC-ROC for multilabel classification.

    Args:
    y_true (np.array): A binary matrix (samples x labels) of ground truth labels.
    y_scores (np.array): A matrix (samples x labels) of predicted scores.

    Returns:
    float: mean AUC-ROC score
    """
    # Number of classes
    n_classes = y_true.shape[1]

    # List to store AUC-ROC for each class
    auc_scores = []

    # Calculate AUC-ROC for each class
    for i in range(n_classes):
        # Ensure there is more than one class to avoid sklearn ValueError
        if len(np.unique(y_true[:, i])) > 1:
            auc = roc_auc_score(y_true[:, i], y_scores[:, i])
            auc_scores.append(auc)
        else:
            # Handle the case where a class has only one class present in y_true
            # Typically handled by assigning an AUC of 0.5 (random guessing score)
            # or by not including this class in the average calculation
            auc_scores.append(0.5)

    # Calculate mean of AUC-ROC scores
    mean_auc = np.mean(auc_scores)
    return mean_auc

The recorded performance metrics were:

Checkpoint mAP AUC-ROC
MIT/ast-finetuned-audioset-16-16-0.442 0.4040 0.9671
MIT/ast-finetuned-audioset-10-10-0.4593 0.4256 0.9737

These results do not align closely with the expected performance. Could you help me identify any potential issues with my approach or provide guidance on achieving the expected performance levels?

@YuanGongND
Copy link
Owner

hi there,

I am not the one who did the HF transplant, but using the eval pipeline in this repo, you should be able to reproduce the exact result.

Quick question: where is your eval data from?

-Yuan

@YuanGongND YuanGongND added the reproduction Cannot reproduce the result label Jun 30, 2024
@penguinwang96825
Copy link
Author

penguinwang96825 commented Jun 30, 2024

Hi thanks for the prompt reply. I also notice that the number of parameters are different between the one from this repo and the one from Huggingface Hub. FYI, I downloaded the AudioSet from this repo.

@YuanGongND
Copy link
Owner

This data do not have problem, if you search in the issues, there are people successfully reproduce the result with this version.

The problem is likely in your eval pipeline. Which norm (i.e., mean std) did you use for eval? You should use the same as our training norm.

why not try out eval?

-Yuan

@penguinwang96825
Copy link
Author

I believe the Huggingface FeatureExtractor uses the default normalisation settings, you can check it from here, the mean is -4.2677393 and the std is 4.5689974. The thing is, I want to ensure everything is Huggingface compatible. This compatibility simplifies model evaluation, enables easier experimentation, and facilitates collaboration within the machine learning community.

@YuanGongND
Copy link
Owner

I understand, and believe HF can reach the performance, it might just be a minor thing. I just do not have time to debug as I am managing multiple repos.

How about this:

https://colab.research.google.com/github/YuanGongND/ast/blob/master/colab/AST_Inference_Demo.ipynb

This is a colab for inference using our pipeline. You only need minimal effort to revise it to eval all your samples, and then you should see a mAP with our eval pipeline. You can also record the logits of each sample, then you can compare with the HF one.

You can even start from a single sample, see if our colab logits and your HF logits are close enough. And you can start from that point for debugging.

-Yuan

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
reproduction Cannot reproduce the result
Projects
None yet
Development

No branches or pull requests

2 participants