Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency in number of classes in EC/GO downstream datasets #50

Open
klemens-floege opened this issue Jul 31, 2024 · 1 comment
Open

Comments

@klemens-floege
Copy link

klemens-floege commented Jul 31, 2024

Dear all,

I believe to have found some major flaws in the EC/GO downstream datasets you linked on your google drive (https://drive.google.com/drive/folders/11dNGqPYfLE3M-Mbh4U7IQpuHxJpuRr4g).

In the SaProt codebase, in the SaProtAnnotationModel class you specify the number of classes in these datasets to be: label2num = {"EC": 585, "GO_BP": 1943, "GO_MF": 489, "GO_CC": 320}. However, when investigating the EC dataset for example, I only find 366 distinct classes in the training set, 263 in test and 287 in the validation. Similar issues arise in all the three GO datasets. This seems like an ill-posed classification problem to me and I would appreciate some clarification.

Thank you very much for taking the time to look into this.

PS: Here is the simple Pandas code I used for the analysis.
`
df_test = pd.read_csv(ec_test_path)
df_train = pd.read_csv(ec_train_path)
df_valid = pd.read_csv(ec_valid_path)

df_train['class'].nunique()=366
df_test['class'].nunique()=263
df_valid['class'].nunique()=287

Convert 'class' columns to sets
train_classes = set(df_train['class'])
valid_classes = set(df_valid['class'])
test_classes = set(df_test['class'])

Find the intersection of the two sets
intersection_train_val = train_classes.intersection(valid_classes)
intersection_train_test = train_classes.intersection(test_classes)
intersection_val_test = valid_classes.intersection(test_classes)

len(intersection_train_val)=287
len(intersection_train_test)=262
len(intersection_val_test)=207

`

@LTEnjoy
Copy link
Contributor

LTEnjoy commented Jul 31, 2024

Hi, Thank you for your interest in our work!

Could you explain more about how you define "distinct class"? The EC and GO tasks are multiple binary classification tasks, which means a protein is mapped to multiple labels for different functions, each being 0 or 1 to indicate whether the protein has a specific function. For instance, the number "585" for the EC task means a protein has 585 binary labels such as 0 1 0 ... 1 0 0. The 1 at specific position indicates the protein has that function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants