Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Accuracy difference between train and val datasets and paper #28

Open
zoltanfarkasgis opened this issue Jul 4, 2023 · 0 comments
Open

Comments

@zoltanfarkasgis
Copy link

zoltanfarkasgis commented Jul 4, 2023

Out of curiosity, we took the 0.25 datasets (train and val) and run those through the ResNet34 model and weights trained by the authors.
This 0.25 dataset is the one that the face align part in predict.py creates (so we assume it is equivalent to the one used and referred in the paper for 7 race classes).

Interestingly, the accuracy (match between labels in the original csv and predicted classes by the published model) differs between the train and val(test) datasets and both are lower than presented in the paper:
Train:

  • full set 85% (73386 / 86744)
  • service_test == True 84% (33794 / 40252)
    Val:
  • full set 78% (8511 / 10954)
  • service_test == True 77% (3955 / 5162)

BTW, as it was previously discovered by fellow commenters, the filter service_test == True defines a subset where the labels are balanced in terms of race and gender. Therefore, we calculated metrics both for the full set and this subset.

We would have expected higher and consistent percentages.

  • The paper presents comparison tables where the accuracy of the model is 94%, which is not supported by these findings.
  • The drop in validation accuracy might suggest that the ResNet34 model (a simple one with fc head replacement) was trained on this dataset but evidently does not perform as the published results.
  • BTW, if one looks deeper into the images: some are very challenging (low res, profile, back of the head, low light etc). Whether and what it means regarding the data quality and balance of the dataset (level, consistency, distribution etc.) needs further consideration. For example, if lower quality images (the term TBD) are present in higher proportion within one class than within the others that can make the dataset out of balance (regardless what the label distribution suggests, because in that case some labels hold less or confusing information concentrated in a specific class). We did not perform such analysis just flag the potential here.

Please feel free to correct any inaccuracy or misinterpretation above or provide an explanation.

@zoltanfarkasgis zoltanfarkasgis changed the title Accuracy difference between train and val datasets Accuracy difference between train and val datasets and paper Jul 4, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant