Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Can you please add the Stanford dog dataset? #4504

Closed
dgrnd4 opened this issue Jun 15, 2022 · 16 comments
Closed

Can you please add the Stanford dog dataset? #4504

dgrnd4 opened this issue Jun 15, 2022 · 16 comments
Assignees
Labels
dataset request Requesting to add a new dataset good first issue Good for newcomers

Comments

@dgrnd4
Copy link

dgrnd4 commented Jun 15, 2022

Adding a Dataset

Instructions to add a new dataset can be found here.

@dgrnd4 dgrnd4 added the dataset request Requesting to add a new dataset label Jun 15, 2022
@julien-c
Copy link
Member

would you like to give it a try, @dgrnd4? (maybe with the help of the dataset author?)

@dgrnd4
Copy link
Author

dgrnd4 commented Jun 15, 2022

@julien-c i am sorry but I have no idea about how it works: can I add the dataset by myself, following "instructions to add a new dataset"?
Can I add a dataset even if it's not mine? (it's public in the link that I wrote on the post)

@mariosasko
Copy link
Collaborator

Hi! The ADD NEW DATASET instructions are indeed the best place to start. It's also perfectly fine to add a dataset if it's public, even if it's not yours. Let me know if you need some additional pointers.

@mariosasko mariosasko added the good first issue Good for newcomers label Jun 30, 2022
@khushmeeet
Copy link
Contributor

If no one is working on this, I could take this up!

@dgrnd4
Copy link
Author

dgrnd4 commented Jul 3, 2022

@khushmeeet this is the link where I added the dataset already. If you can I would ask you to do this:

  1. The dataset it's all in TRAINING SET: can you please divide it in Training,Test and Validation Set? If you can for each class, take the 80% for the Training set and the 10% for Test and 10% Validation
  2. The images has different size, can you please resize all the images in 224,224,3? Look even at the last dimension "3" because some images has dimension 4!

Thank you!!

@mariosasko
Copy link
Collaborator

Hi @khushmeeet! Thanks for the interest. You can self-assign the issue by commenting #self-assign on it.

Also, I think we can skip @dgrnd4's steps as we try to avoid any custom processing on top of raw data. One can later copy the script and override _post_process in it to perform such processing on the generated dataset.

@khushmeeet
Copy link
Contributor

Thanks @mariosasko

@dgrnd4 As dataset is there on Hub, and preprocessing is not recommended. I am not sure if there is any other task to do. However, I can't seem to find relevant .py files for this dataset in GitHub repo.

@dgrnd4
Copy link
Author

dgrnd4 commented Jul 6, 2022

@khushmeeet @mariosasko The point is that the images must be processed and must have the same size in order to can be used for things for example "Training".

@mariosasko
Copy link
Collaborator

@dgrnd4 Yes, but this can be done after loading (map to resize images and train_test_split to create extra splits)

@khushmeeet The linked version is implemented as a no-code dataset and is generated directly from the ZIP archive, but our "GitHub" datasets (these are datasets without a user/org namespace on the Hub) need a generation script, and you can find one here. datasets started as a fork of TFDS, so we share similar script structure, which makes it trivial to adapt it.

@dgrnd4
Copy link
Author

dgrnd4 commented Jul 6, 2022

@mariosasko The point is that if I use something like this:
x_train, x_test = train_test_split(dataset, test_size=0.1)

to get Train 90% and Test 10%, and then to get the Validation Set (10% of the whole 100%):

train_ratio = 0.80
validation_ratio = 0.10
test_ratio = 0.10

x_train, x_test, y_train, y_test = train_test_split(dataX, dataY, test_size=1 - train_ratio)
x_val, x_test, y_val, y_test = train_test_split(x_test, y_test, test_size=test_ratio/(test_ratio + validation_ratio)) 

The point is that the structure of the data is:

DatasetDict({
    train: Dataset({
        features: ['image', 'label'],
        num_rows: 20580
    })
})

So how to extract images and labels?

EDIT --> Split of the dataset in Train-Test-Validation:

import datasets
from datasets.dataset_dict import DatasetDict
from datasets import Dataset

percentage_divison_test = int(len(dataset['train'])/100 *10)       # 10%  --> 2058 
percentage_divison_validation = int(len(dataset['train'])/100 *20) # 20%  --> 4116

dataset_ = datasets.DatasetDict({"train": Dataset.from_dict({

                                  'image':  dataset['train'][0 : len(dataset['train']) ]['image'],    
                                  'labels': dataset['train'][0 : len(dataset['train']) ]['label'] }), 
                                 
                                 "test": Dataset.from_dict({  #20580-4116 (validation) ,20580-2058 (test)
                                  'image':  dataset['train'][len(dataset['train']) - percentage_divison_validation : len(dataset['train']) - percentage_divison_test]['image'], 
                                  'labels': dataset['train'][len(dataset['train']) - percentage_divison_validation : len(dataset['train']) - percentage_divison_test]['label'] }), 
                                 
                                  "validation": Dataset.from_dict({ # 20580-2058 (test)
                                  'image':  dataset['train'][len(dataset['train']) - percentage_divison_test : len(dataset['train'])]['image'], 
                                  'labels': dataset['train'][len(dataset['train']) - percentage_divison_test : len(dataset['train'])]['label'] }), 
                                })

@dgrnd4

This comment was marked as resolved.

@khushmeeet
Copy link
Contributor

#self-assign

@khushmeeet
Copy link
Contributor

I have raised PR for adding stanford-dog dataset. I have not added any data preprocessing code. Only dataset generation script is there. Let me know any changes required, or anything to add to README.

@zutarich
Copy link

zutarich commented Oct 9, 2023

Is this issue still open, i am new to open source thus want to take this one as my start.

@mariosasko
Copy link
Collaborator

@zutarich This issue should have been closed since the dataset in question is available on the Hub here.

@AlanBlanchet
Copy link

AlanBlanchet commented Dec 9, 2024

I didn't know about this issue until now but i added my version of the dataset on the hub with the bboxes :
https://huggingface.co/datasets/Alanox/stanford-dogs

Although I could have made it cleaner and built the splits from the .txt files + put into the coco format.
There is a stanford-dogs.py file if you want to help adding these missing metadatas.
Hope this helps

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dataset request Requesting to add a new dataset good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

6 participants