You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi!
I love your dataset and I think it is very helpful.
However, I found that there are quite amount of invalid and duplicated images.
The invalid images I found are all in class bird and class frog, and they look like:
Here is my results:
Class: airplane
Number of images: 9000
Number of duplicates in class airplane: 121
Class: truck
Number of images: 9000
Number of duplicates in class truck: 26
Class: bird
Number of images: 9000
Number of duplicates in class bird: 24
Class: automobile
Number of images: 9000
Number of duplicates in class automobile: 12
Class: horse
Number of images: 9000
Number of duplicates in class horse: 80
Class: cat
Number of images: 9000
Number of duplicates in class cat: 27
Class: deer
Number of images: 9000
Number of duplicates in class deer: 139
Class: frog
Number of images: 9000
Number of duplicates in class frog: 319
Class: ship
Number of images: 9000
Number of duplicates in class ship: 22
Class: dog
Number of images: 9000
Number of duplicates in class dog: 25
Here is my code:
import hashlib
import os
import os.path as osp
from imageio import imread
from PIL import Image
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import matplotlib.gridspec as gridspec
import time
import numpy as np
# Reference: https://medium.com/@urvisoni/removing-duplicate-images-through-python-23c5fdc7479e
def file_hash(filepath):
with open(filepath, 'rb') as f:
return hashlib.md5(f.read()).hexdigest()
duplicates = []
num_duplicate = 0
hash_keys = set()
root = "/data/CINIC10/train"
file_list = os.listdir(root)
len(file_list)
for classname in os.listdir(root):
print(f"Class: {classname}")
class_dir = osp.join(root, classname)
class_list = os.listdir(class_dir)
print(f"Number of images: {len(class_list)}")
for index, filename in enumerate(class_list):
filename = osp.join(class_dir, filename)
if os.path.isfile(filename):
filehash = file_hash(filename)
if filehash not in hash_keys:
hash_keys.add(filehash)
else:
duplicates.append((classname, filename))
else:
print(f"{filename} not a file")
break
print(f"Number of duplicates in class {classname}: {len(duplicates) - num_duplicate}")
num_duplicate = len(duplicates)
The text was updated successfully, but these errors were encountered:
Hi!
I love your dataset and I think it is very helpful.
However, I found that there are quite amount of invalid and duplicated images.
The invalid images I found are all in class bird and class frog, and they look like:
Here is my results:
Here is my code:
The text was updated successfully, but these errors were encountered: