Can you please add the Stanford dog dataset? #4504

dgrnd4 · 2022-06-15T15:39:35Z

Adding a Dataset

Name: Stanford dog dataset
Description: The dataset is about 120 classes for a total of 20.580 images. You can find the dataset here http://vision.stanford.edu/aditya86/ImageNetDogs/
Paper: http://vision.stanford.edu/aditya86/ImageNetDogs/
Data: link to the Github repository or current dataset location
Motivation: *The dataset has been built using images and annotation from ImageNet for the task of fine-grained image categorization. It is useful for fine-grain purpose *

Instructions to add a new dataset can be found here.

julien-c · 2022-06-15T15:46:07Z

would you like to give it a try, @dgrnd4? (maybe with the help of the dataset author?)

dgrnd4 · 2022-06-15T15:49:27Z

@julien-c i am sorry but I have no idea about how it works: can I add the dataset by myself, following "instructions to add a new dataset"?
Can I add a dataset even if it's not mine? (it's public in the link that I wrote on the post)

mariosasko · 2022-06-16T10:47:26Z

Hi! The ADD NEW DATASET instructions are indeed the best place to start. It's also perfectly fine to add a dataset if it's public, even if it's not yours. Let me know if you need some additional pointers.

khushmeeet · 2022-07-03T07:33:45Z

If no one is working on this, I could take this up!

dgrnd4 · 2022-07-03T08:32:26Z

@khushmeeet this is the link where I added the dataset already. If you can I would ask you to do this:

The dataset it's all in TRAINING SET: can you please divide it in Training,Test and Validation Set? If you can for each class, take the 80% for the Training set and the 10% for Test and 10% Validation
The images has different size, can you please resize all the images in 224,224,3? Look even at the last dimension "3" because some images has dimension 4!

Thank you!!

mariosasko · 2022-07-04T11:45:38Z

Hi @khushmeeet! Thanks for the interest. You can self-assign the issue by commenting #self-assign on it.

Also, I think we can skip @dgrnd4's steps as we try to avoid any custom processing on top of raw data. One can later copy the script and override _post_process in it to perform such processing on the generated dataset.

khushmeeet · 2022-07-04T20:21:01Z

Thanks @mariosasko

@dgrnd4 As dataset is there on Hub, and preprocessing is not recommended. I am not sure if there is any other task to do. However, I can't seem to find relevant .py files for this dataset in GitHub repo.

dgrnd4 · 2022-07-06T09:39:01Z

@khushmeeet @mariosasko The point is that the images must be processed and must have the same size in order to can be used for things for example "Training".

mariosasko · 2022-07-06T11:41:46Z

@dgrnd4 Yes, but this can be done after loading (map to resize images and train_test_split to create extra splits)

@khushmeeet The linked version is implemented as a no-code dataset and is generated directly from the ZIP archive, but our "GitHub" datasets (these are datasets without a user/org namespace on the Hub) need a generation script, and you can find one here. datasets started as a fork of TFDS, so we share similar script structure, which makes it trivial to adapt it.

dgrnd4 · 2022-07-06T18:55:36Z

@mariosasko The point is that if I use something like this:
x_train, x_test = train_test_split(dataset, test_size=0.1)

to get Train 90% and Test 10%, and then to get the Validation Set (10% of the whole 100%):

train_ratio = 0.80
validation_ratio = 0.10
test_ratio = 0.10

x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio))

The point is that the structure of the data is:

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 20580
    })
})

So how to extract images and labels?

EDIT --> Split of the dataset in Train-Test-Validation:

import datasets
from datasets.dataset_dict import DatasetDict
from datasets import Dataset

percentage_divison_test = int(len(dataset['train'])/100 *10)       # 10%  --> 2058 
percentage_divison_validation = int(len(dataset['train'])/100 *20) # 20%  --> 4116

dataset_ = datasets.DatasetDict({"train": Dataset.from_dict({

                                  'image':  dataset['train'][0 : len(dataset['train']) ]['image'],    
                                  'labels': dataset['train'][0 : len(dataset['train']) ]['label'] }), 
                                 
                                 "test": Dataset.from_dict({  #20580-4116 (validation) ,20580-2058 (test)
                                  'image':  dataset['train'][len(dataset['train']) - percentage_divison_validation : len(dataset['train']) - percentage_divison_test]['image'], 
                                  'labels': dataset['train'][len(dataset['train']) - percentage_divison_validation : len(dataset['train']) - percentage_divison_test]['label'] }), 
                                 
                                  "validation": Dataset.from_dict({ # 20580-2058 (test)
                                  'image':  dataset['train'][len(dataset['train']) - percentage_divison_test : len(dataset['train'])]['image'], 
                                  'labels': dataset['train'][len(dataset['train']) - percentage_divison_test : len(dataset['train'])]['label'] }), 
                                })

khushmeeet · 2022-07-09T00:08:52Z

#self-assign

khushmeeet · 2022-07-09T04:47:47Z

I have raised PR for adding stanford-dog dataset. I have not added any data preprocessing code. Only dataset generation script is there. Let me know any changes required, or anything to add to README.

zutarich · 2023-10-09T06:53:08Z

Is this issue still open, i am new to open source thus want to take this one as my start.

mariosasko · 2023-10-18T18:55:22Z

@zutarich This issue should have been closed since the dataset in question is available on the Hub here.

AlanBlanchet · 2024-12-09T15:37:28Z

I didn't know about this issue until now but i added my version of the dataset on the hub with the bboxes :
https://huggingface.co/datasets/Alanox/stanford-dogs

Although I could have made it cleaner and built the splits from the .txt files + put into the coco format.
There is a stanford-dogs.py file if you want to help adding these missing metadatas.
Hope this helps

dgrnd4 added the dataset request Requesting to add a new dataset label Jun 15, 2022

mariosasko added the good first issue Good for newcomers label Jun 30, 2022

This comment was marked as resolved.

Sign in to view

github-actions bot assigned khushmeeet Jul 9, 2022

khushmeeet mentioned this issue Jul 9, 2022

Add stanford dog dataset #4664

Closed

mariosasko closed this as completed Oct 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can you please add the Stanford dog dataset? #4504

Can you please add the Stanford dog dataset? #4504

dgrnd4 commented Jun 15, 2022

julien-c commented Jun 15, 2022

dgrnd4 commented Jun 15, 2022

mariosasko commented Jun 16, 2022

khushmeeet commented Jul 3, 2022

dgrnd4 commented Jul 3, 2022 •

edited

Loading

mariosasko commented Jul 4, 2022

khushmeeet commented Jul 4, 2022

dgrnd4 commented Jul 6, 2022

mariosasko commented Jul 6, 2022

dgrnd4 commented Jul 6, 2022 •

edited

Loading

This comment was marked as resolved.

khushmeeet commented Jul 9, 2022

khushmeeet commented Jul 9, 2022

zutarich commented Oct 9, 2023

mariosasko commented Oct 18, 2023

AlanBlanchet commented Dec 9, 2024 •

edited

Loading

Can you please add the Stanford dog dataset? #4504

Can you please add the Stanford dog dataset? #4504

Comments

dgrnd4 commented Jun 15, 2022

Adding a Dataset

julien-c commented Jun 15, 2022

dgrnd4 commented Jun 15, 2022

mariosasko commented Jun 16, 2022

khushmeeet commented Jul 3, 2022

dgrnd4 commented Jul 3, 2022 • edited Loading

mariosasko commented Jul 4, 2022

khushmeeet commented Jul 4, 2022

dgrnd4 commented Jul 6, 2022

mariosasko commented Jul 6, 2022

dgrnd4 commented Jul 6, 2022 • edited Loading

This comment was marked as resolved.

khushmeeet commented Jul 9, 2022

khushmeeet commented Jul 9, 2022

zutarich commented Oct 9, 2023

mariosasko commented Oct 18, 2023

AlanBlanchet commented Dec 9, 2024 • edited Loading

dgrnd4 commented Jul 3, 2022 •

edited

Loading

dgrnd4 commented Jul 6, 2022 •

edited

Loading

AlanBlanchet commented Dec 9, 2024 •

edited

Loading