Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicated and invalid images #5

Open
zhanyuanucb opened this issue Jun 23, 2020 · 0 comments
Open

Duplicated and invalid images #5

zhanyuanucb opened this issue Jun 23, 2020 · 0 comments

Comments

@zhanyuanucb
Copy link

Hi!
I love your dataset and I think it is very helpful.
However, I found that there are quite amount of invalid and duplicated images.
The invalid images I found are all in class bird and class frog, and they look like:
invalid

Here is my results:

Class: airplane
Number of images: 9000
Number of duplicates in class airplane: 121
Class: truck
Number of images: 9000
Number of duplicates in class truck: 26
Class: bird
Number of images: 9000
Number of duplicates in class bird: 24
Class: automobile
Number of images: 9000
Number of duplicates in class automobile: 12
Class: horse
Number of images: 9000
Number of duplicates in class horse: 80
Class: cat
Number of images: 9000
Number of duplicates in class cat: 27
Class: deer
Number of images: 9000
Number of duplicates in class deer: 139
Class: frog
Number of images: 9000
Number of duplicates in class frog: 319
Class: ship
Number of images: 9000
Number of duplicates in class ship: 22
Class: dog
Number of images: 9000
Number of duplicates in class dog: 25

Here is my code:

import hashlib
import os
import os.path as osp
from imageio import imread
from PIL import Image
import matplotlib.pyplot as plt
from matplotlib.pyplot import imshow
import matplotlib.gridspec as gridspec
import time 
import numpy as np 


# Reference: https://medium.com/@urvisoni/removing-duplicate-images-through-python-23c5fdc7479e
def file_hash(filepath):
    with open(filepath, 'rb') as f:
        return hashlib.md5(f.read()).hexdigest()

duplicates = []
num_duplicate = 0
hash_keys = set()
root = "/data/CINIC10/train"
file_list = os.listdir(root)
len(file_list)
for classname in os.listdir(root):
    print(f"Class: {classname}")
    class_dir = osp.join(root, classname)
    class_list = os.listdir(class_dir)
    print(f"Number of images: {len(class_list)}")
    for index, filename in enumerate(class_list):
        filename = osp.join(class_dir, filename)
        if os.path.isfile(filename):
            filehash = file_hash(filename)
            if filehash not in hash_keys:
                hash_keys.add(filehash)
            else:
                duplicates.append((classname, filename))
        else:
            print(f"{filename} not a file")
            break
    print(f"Number of duplicates in class {classname}: {len(duplicates) - num_duplicate}")
    num_duplicate = len(duplicates)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant