Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split definition of DTD, EuroSAT and SUN397 #1

Closed
gortizji opened this issue Feb 6, 2023 · 13 comments
Closed

Split definition of DTD, EuroSAT and SUN397 #1

gortizji opened this issue Feb 6, 2023 · 13 comments

Comments

@gortizji
Copy link

gortizji commented Feb 6, 2023

Hi, awesome work!

I'm trying to reproduce your results but I cannot find the split definitions you use for DTD, EuroSAT and SUN397. Would you mind pointing me to the right resources to download the versions of these datasets compatible with your code?

Thanks a lot!

@gabrielilharco
Copy link
Contributor

gabrielilharco commented Feb 6, 2023

Hi @gortizji. Thanks for the interest in our work and for the kind words!

Please note that in this codebase we use the suffix "Val" to indicate that we want to use the validation sets instead of the test sets (e.g. evaluating on DTDVal uses the validation set, and DTD uses the test set. You should also use the Val suffix when training).

Hope this helps, and let me know if you have any other questions!

@gortizji
Copy link
Author

gortizji commented Feb 7, 2023

Thanks @gabrielilharco for your quick answer. Some follow-up questions:

  • DTD: There are 10 different splits defined in the original webpage. I assume you use the first split, then?
  • EuroSAT: As far as I can tell, EuroSAT has only 27,000 images. Could the split be 12,000/5,000/10,000?
  • SUN397: Again there are 10 different balanced splits defined in the original webpage. Do you use any one in particular?

Thanks!

@gabrielilharco
Copy link
Contributor

For DTD and SUN397, yes, we use the first split (train1.txt+val1.txt / test1.txt for DTD, and Testing_01.txt/Training_01.txt for SUN397, as in https://vision.princeton.edu/projects/2010/SUN/download/Partitions.zip). For EuroSAT, it indeed has 27,000, and we use a 21,600/2,700/2,700 split (also updated the previous message)

@gortizji
Copy link
Author

gortizji commented Feb 8, 2023

That makes sense 😄. Thanks a lot @gabrielilharco!

@gortizji gortizji closed this as completed Feb 8, 2023
@gortizji
Copy link
Author

Hi again,

Could you comment what is the expected folder structure for SUN397? This seems to determine the classnames of the dataset, but I am not sure how to deal with the nested structure of the labels such as volleyball_court/indoor and volleyball_court/outdoor.

Thanks in advance 😄

@gortizji gortizji reopened this Mar 24, 2023
@gabrielilharco
Copy link
Contributor

Hi @gortizji,

We expect the data to be stored without nested folders, it should look like this:

a_abbey
     sun_aaalbzqrimafwbiv.jpg
     sun_aasgdbvvfthiibcm.jpg
     ...
a_airplane_cabin
a_airport_terminal
a_alley
a_amphitheater
...
v_volleyball_court_indoor
v_volleyball_court_outdoor
...
y_youth_hostel

@gortizji
Copy link
Author

Perfect! Thanks a lot.

@prateeky2806
Copy link

Hi @gabrielilharco and @gortizji, I am facing a similar issue. I have downloaded the datasets from the provided links but I am not sure how to structure the downloaded files so that they can be loaded correctly. Does any of you have a script that can be used to correctly structure these downloaded datasets?

Thank you in advance!
Prateek Yadav

@prateeky2806
Copy link

prateeky2806 commented Apr 19, 2023

I kind of figured this out myself but for anyone else like me here are the scripts I used. There are four datasets that require manual downloading, DTD, EuroSAT, RESISC45, SUN397. the links for downloading the datasets and the splits file are mentioned above in the thread. Download datasets from those links.
There is no folder structuring required for RESISC45, I have provided the code for SUN397, and EuroSAT below. I forgot to save the script for DTD but it's pretty similar to the ones provided below.

## PROCESS SUN397 DATASET

import os
import shutil
from pathlib import Path


def process_dataset(txt_file, downloaded_data_path, output_folder):
    with open(txt_file, 'r') as file:
        lines = file.readlines()

    for i, line in enumerate(lines):
        input_path = line.strip()
        final_folder_name = "_".join(x for x in input_path.split('/')[:-1])[1:]
        filename = input_path.split('/')[-1]
        output_class_folder = os.path.join(output_folder, final_folder_name)

        if not os.path.exists(output_class_folder):
            os.makedirs(output_class_folder)

        full_input_path = os.path.join(downloaded_data_path, input_path[1:])
        output_file_path = os.path.join(output_class_folder, filename)
        # print(final_folder_name, filename, output_class_folder, full_input_path, output_file_path)
        # exit()
        shutil.copy(full_input_path, output_file_path)
        if i % 100 == 0:
            print(f"Processed {i}/{len(lines)} images")

downloaded_data_path = "path/to/downloaded/SUN/data"
process_dataset('Training_01.txt', downloaded_data_path, os.path.join(downloaded_data_path, "train"))
process_dataset('Testing_01.txt', downloaded_data_path, os.path.join(downloaded_data_path, "val"))
### PROCESS EuroSAT_RGB DATASET

import os
import shutil
import random

def create_directory_structure(base_dir, classes):
    for dataset in ['train', 'val', 'test']:
        path = os.path.join(base_dir, dataset)
        os.makedirs(path, exist_ok=True)
        for cls in classes:
            os.makedirs(os.path.join(path, cls), exist_ok=True)

def split_dataset(base_dir, source_dir, classes, val_size=270, test_size=270):
    for cls in classes:
        class_path = os.path.join(source_dir, cls)
        images = os.listdir(class_path)
        random.shuffle(images)

        val_images = images[:val_size]
        test_images = images[val_size:val_size + test_size]
        train_images = images[val_size + test_size:]

        for img in train_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(base_dir, 'train', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
        for img in val_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(base_dir, 'val', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
        for img in test_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(base_dir, 'test', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)

source_dir = '/nas-hdd/prateek/data/EuroSAT_RGB'  # replace with the path to your dataset
base_dir = '/nas-hdd/prateek/data/EuroSAT_Splitted'  # replace with the path to the output directory

classes = [d for d in os.listdir(source_dir) if os.path.isdir(os.path.join(source_dir, d))]

create_directory_structure(base_dir, classes)
split_dataset(base_dir, source_dir, classes)

Cheers,
Prateek

@gabrielilharco
Copy link
Contributor

Thanks a lot @prateeky2806!

@fredzzhang
Copy link

Hi @gabrielilharco and @prateeky2806,

I might have missed something but why doesn't RESISC45 need a folder structure? The dataset class inherits from the ImageFolder class, which assumes the images are arranged under folders named after the classes. So when I tried to run the code on this dataset, I get the following error.

...
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 309, in __init__
    super().__init__(
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 144, in __init__
    classes, class_to_idx = self.find_classes(self.root)
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 218, in find_classes
    return find_classes(directory)
  File "/home/frederic/miniconda3/envs/ws/lib/python3.10/site-packages/torchvision/datasets/folder.py", line 42, in find_classes
    raise FileNotFoundError(f"Couldn't find any class folder in {directory}.")
FileNotFoundError: Couldn't find any class folder in /home/frederic/data/resisc45/NWPU-RESISC45.

Is there something I missed?

Thanks,
Fred.

@fredzzhang
Copy link

It seems that creating a folder for each class is necessary. Either way, I'll attach the scripts to set up the resisc45 dataset for future reference.

mkdir resisc45 && cd resisc45
# Download the dataset and splits
FILE=NWPU-RESISC45.rar
ID=1DnPSU5nVSN7xv95bpZ3XQ0JhKXZOKgIv
wget --load-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&confirm=$(wget --quiet --save-cookies /tmp/cookies.txt "https://docs.google.com/uc?export=download&id=$ID" -O- | sed -rn 's/.*confirm=([0-9A-Za-z_]+).*/\1/p')&id=$ID" -O $FILE && rm -rf /tmp/cookies.txt
unrar x $FILE
wget -O resisc45-train.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-train.txt"
wget -O resisc45-val.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-val.txt"
wget -O resisc45-test.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-test.txt"
# Partition the dataset into different classes

import os
import shutil

def create_directory_structure(data_root, split):
    split_file = f'resisc45-{split}.txt'
    with open(os.path.join(data_root, split_file), 'r') as f:
        lines = f.readlines()
    for l in lines:
        l = l.strip()
        class_name = '_'.join(l.split('_')[:-1])
        class_dir = os.path.join(data_root, 'NWPU-RESISC45', class_name)
        if not os.path.exists(class_dir):
            os.mkdir(class_dir)
        src_path = os.path.join(data_root, 'NWPU-RESISC45', l)
        dst_path = os.path.join(class_dir, l)
        print(src_path, dst_path)
        shutil.move(src_path, dst_path)

data_root = '/home/frederic/data/resisc45'
for split in ['train', 'val', 'test']:
    create_directory_structure(data_root, split)

Cheers,
Fred.

@enkeejunior1
Copy link

enkeejunior1 commented Jul 18, 2024

In case who want fully automatic code for dataset preperation, I'll attach my code

before run this code, please manually download resisc45 dataset in ~/ path.
You can download the dataset in this link

download.sh

sudo apt -y install kaggle 
mkdir <your base dir>
cd <your base dir>
export KAGGLE_USERNAME=<your kaggle username>
export KAGGLE_KEY=<your kaggle key>

# stanford cars dataset (ref: https://github.com/pytorch/vision/issues/7545#issuecomment-1631441616)
mkdir stanford_cars && cd stanford_cars
kaggle datasets download -d jessicali9530/stanford-cars-dataset
kaggle datasets download -d abdelrahmant11/standford-cars-dataset-meta
unzip standford-cars-dataset-meta.zip
unzip stanford-cars-dataset.zip
tar -xvzf car_devkit.tgz
mv cars_test a
mv a/cars_test/ cars_test
rm -rf a
mv cars_train a
mv a/cars_train/ cars_train
rm -rf a
mv 'cars_test_annos_withlabels (1).mat' cars_test_annos_withlabels.mat
rm -rf 'cars_annos (2).mat' *.zip
cd ..

# ressic45
mkdir resisc45 && cd resisc45
# (manual download) https://onedrive.live.com/?authkey=%21AHHNaHIlzp%5FIXjs&id=5C5E061130630A68%21107&cid=5C5E061130630A68&parId=root&parQt=sharedby&o=OneUp
mv ~/NWPU-RESISC45.rar ./
sudo apt -y install unar
unar NWPU-RESISC45.rar
wget -O resisc45-train.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-train.txt"
wget -O resisc45-val.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-val.txt"
wget -O resisc45-test.txt "https://storage.googleapis.com/remote_sensing_representations/resisc45-test.txt"
rm -rf NWPU-RESISC45.rar
cd ..

# dtd
mkdir dtd && cd dtd
wget https://www.robots.ox.ac.uk/~vgg/data/dtd/download/dtd-r1.0.1.tar.gz
tar -xvzf dtd-r1.0.1.tar.gz
rm -rf dtd-r1.0.1.tar.gz
mv dtd/images images
mv dtd/imdb/ imdb
mv dtd/labels labels
cat labels/train1.txt labels/val1.txt > labels/train.txt
cat labels/test1.txt > labels/test.txt

# euro_sat
mkdir euro_sat && cd euro_sat
wget --no-check-certificate https://madm.dfki.de/files/sentinel/EuroSAT.zip
unzip EuroSAT.zip
rm -rf EuroSAT.zip

# sun397
mkdir sun397 && cd sun397
wget http://vision.princeton.edu/projects/2010/SUN/SUN397.tar.gz
unzip Partitions.zip
tar -xvzf SUN397.tar.gz
rm -rf SUN397.tar.gz

split_dataset.py

base_dir = f'<your base dir>'

### PROCESS SUN397 DATASET
import os
import shutil
from pathlib import Path
downloaded_data_path = f"{base_dir}/sun397"
output_path = f"{base_dir}/sun397"

def process_dataset(txt_file, downloaded_data_path, output_folder):
    with open(txt_file, 'r') as file:
        lines = file.readlines()

    for i, line in enumerate(lines):
        input_path = line.strip()
        final_folder_name = "_".join(x for x in input_path.split('/')[:-1])[1:]
        filename = input_path.split('/')[-1]
        output_class_folder = os.path.join(output_folder, final_folder_name)

        if not os.path.exists(output_class_folder):
            os.makedirs(output_class_folder)

        full_input_path = os.path.join(downloaded_data_path, input_path[1:])
        output_file_path = os.path.join(output_class_folder, filename)
        # print(final_folder_name, filename, output_class_folder, full_input_path, output_file_path)
        # exit()
        shutil.copy(full_input_path, output_file_path)
        if i % 100 == 0:
            print(f"Processed {i}/{len(lines)} images")

process_dataset(
    os.path.join(downloaded_data_path, 'Training_01.txt'), 
    os.path.join(downloaded_data_path, 'SUN397'), 
    os.path.join(output_path, "train")
)
process_dataset(
    os.path.join(downloaded_data_path, 'Testing_01.txt'), 
    os.path.join(downloaded_data_path, 'SUN397'), 
    os.path.join(output_path, "val")
)


### PROCESS EuroSAT_RGB DATASET
src_dir = f'{base_dir}/euro_sat/2750'    # replace with the path to your dataset
dst_dir = f'{base_dir}/EuroSAT_splits'  # replace with the path to the output directory

import os
import shutil
import random

def create_directory_structure(dst_dir, classes):
    for dataset in ['train', 'val', 'test']:
        path = os.path.join(dst_dir, dataset)
        os.makedirs(path, exist_ok=True)
        for cls in classes:
            os.makedirs(os.path.join(path, cls), exist_ok=True)

def split_dataset(dst_dir, src_dir, classes, val_size=270, test_size=270):
    for cls in classes:
        class_path = os.path.join(src_dir, cls)
        images = os.listdir(class_path)
        random.shuffle(images)

        val_images = images[:val_size]
        test_images = images[val_size:val_size + test_size]
        train_images = images[val_size + test_size:]
        
        for img in train_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(dst_dir, 'train', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
            # break
        for img in val_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(dst_dir, 'val', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
            # break
        for img in test_images:
            src_path = os.path.join(class_path, img)
            dst_path = os.path.join(dst_dir, 'test', cls, img)
            print(src_path, dst_path)
            shutil.copy(src_path, dst_path)
            # break

classes = [d for d in os.listdir(src_dir) if os.path.isdir(os.path.join(src_dir, d))]
create_directory_structure(dst_dir, classes)
split_dataset(dst_dir, src_dir, classes)

### PROCESS DTD DATASET
import os
import shutil
from pathlib import Path
downloaded_data_path = f"{base_dir}/dtd/images"
output_path = f"{base_dir}/dtd"

def process_dataset(txt_file, downloaded_data_path, output_folder):
    with open(txt_file, 'r') as file:
        lines = file.readlines()

    for i, line in enumerate(lines):
        input_path = line.strip()
        final_folder_name = input_path.split('/')[:-1][0]
        filename = input_path.split('/')[-1]
        output_class_folder = os.path.join(output_folder, final_folder_name)

        if not os.path.exists(output_class_folder):
            os.makedirs(output_class_folder)

        full_input_path = os.path.join(downloaded_data_path, input_path)
        output_file_path = os.path.join(output_class_folder, filename)
        shutil.copy(full_input_path, output_file_path)
        if i % 100 == 0:
            print(f"Processed {i}/{len(lines)} images")

process_dataset(
    f'{base_dir}/dtd/labels/train.txt', downloaded_data_path, os.path.join(output_path, "train")
)
process_dataset(
    f'{base_dir}/dtd/labels/test.txt', downloaded_data_path, os.path.join(output_path, "val")
)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants