-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inconsistency in number of classes in EC/GO downstream datasets #50
Comments
Hi, Thank you for your interest in our work! Could you explain more about how you define "distinct class"? The EC and GO tasks are multiple binary classification tasks, which means a protein is mapped to multiple labels for different functions, each being 0 or 1 to indicate whether the protein has a specific function. For instance, the number "585" for the EC task means a protein has 585 binary labels such as |
Dear all,
I believe to have found some major flaws in the EC/GO downstream datasets you linked on your google drive (https://drive.google.com/drive/folders/11dNGqPYfLE3M-Mbh4U7IQpuHxJpuRr4g).
In the SaProt codebase, in the SaProtAnnotationModel class you specify the number of classes in these datasets to be: label2num = {"EC": 585, "GO_BP": 1943, "GO_MF": 489, "GO_CC": 320}. However, when investigating the EC dataset for example, I only find 366 distinct classes in the training set, 263 in test and 287 in the validation. Similar issues arise in all the three GO datasets. This seems like an ill-posed classification problem to me and I would appreciate some clarification.
Thank you very much for taking the time to look into this.
PS: Here is the simple Pandas code I used for the analysis.
`
df_test = pd.read_csv(ec_test_path)
df_train = pd.read_csv(ec_train_path)
df_valid = pd.read_csv(ec_valid_path)
df_train['class'].nunique()=366
df_test['class'].nunique()=263
df_valid['class'].nunique()=287
Convert 'class' columns to sets
train_classes = set(df_train['class'])
valid_classes = set(df_valid['class'])
test_classes = set(df_test['class'])
Find the intersection of the two sets
intersection_train_val = train_classes.intersection(valid_classes)
intersection_train_test = train_classes.intersection(test_classes)
intersection_val_test = valid_classes.intersection(test_classes)
len(intersection_train_val)=287
len(intersection_train_test)=262
len(intersection_val_test)=207
`
The text was updated successfully, but these errors were encountered: