Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stanford cars download url is broken - HTTP 404 #7545

Closed
IamMohitM opened this issue Apr 29, 2023 · 13 comments
Closed

Stanford cars download url is broken - HTTP 404 #7545

IamMohitM opened this issue Apr 29, 2023 · 13 comments

Comments

@IamMohitM
Copy link

IamMohitM commented Apr 29, 2023

🐛 Describe the bug

The Stanford Cars dataset is not available on the url from source code

https://pytorch.org/vision/main/_modules/torchvision/datasets/stanford_cars.html

Reproduce error:

import torch
import torchvision

train = torchvision.datasets.StanfordCars(root=".", download=True)

I get the following error

HTTPError                                 Traceback (most recent call last)
Cell In[18], line 4
      1 import torch
      2 import torchvision
----> 4 train = torchvision.datasets.StanfordCars(root=".", download=True)

File [~/Projects/diffusion/env/lib/python3.10/site-packages/torchvision/datasets/stanford_cars.py:60](https://file+.vscode-resource.vscode-cdn.net/Users/mo/Projects/diffusion/diffusion/~/Projects/diffusion/env/lib/python3.10/site-packages/torchvision/datasets/stanford_cars.py:60), in StanfordCars.__init__(self, root, split, transform, target_transform, download)
     57     self._images_base_path = self._base_folder [/](https://file+.vscode-resource.vscode-cdn.net/) "cars_test"
     59 if download:
---> 60     self.download()
     62 if not self._check_exists():
     63     raise RuntimeError("Dataset not found. You can use download=True to download it")

File [~/Projects/diffusion/env/lib/python3.10/site-packages/torchvision/datasets/stanford_cars.py:94](https://file+.vscode-resource.vscode-cdn.net/Users/mo/Projects/diffusion/diffusion/~/Projects/diffusion/env/lib/python3.10/site-packages/torchvision/datasets/stanford_cars.py:94), in StanfordCars.download(self)
     91 if self._check_exists():
     92     return
---> 94 download_and_extract_archive(
     95     url="https://ai.stanford.edu/~jkrause/cars/car_devkit.tgz",
     96     download_root=str(self._base_folder),
     97     md5="c3b158d763b6e2245038c8ad08e45376",
     98 )
     99 if self._split == "train":
    100     download_and_extract_archive(
    101         url="https://ai.stanford.edu/~jkrause/car196/cars_train.tgz",
...
File [/usr/local/Cellar/python](https://file+.vscode-resource.vscode-cdn.net/usr/local/Cellar/python)@3.10[/3.10.10_1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:643](https://file+.vscode-resource.vscode-cdn.net/3.10.10_1/Frameworks/Python.framework/Versions/3.10/lib/python3.10/urllib/request.py:643), in HTTPDefaultErrorHandler.http_error_default(self, req, fp, code, msg, hdrs)
    642 def http_error_default(self, req, fp, code, msg, hdrs):
--> 643     raise HTTPError(req.full_url, code, msg, hdrs, fp)

HTTPError: HTTP Error 404: Not Found

Versions

PyTorch version: 2.0.0
Is debug build: False
CUDA used to build PyTorch: None
ROCM used to build PyTorch: N/A

OS: macOS 13.2.1 (x86_64)
GCC version: Could not collect
Clang version: 14.0.0 (clang-1400.0.29.202)
CMake version: version 3.19.8
Libc version: N/A

Python version: 3.10.10 (main, Feb 16 2023, 02:58:25) [Clang 14.0.0 (clang-1400.0.29.202)] (64-bit runtime)
Python platform: macOS-13.2.1-x86_64-i386-64bit
Is CUDA available: False
CUDA runtime version: No CUDA
CUDA_MODULE_LOADING set to: N/A
GPU models and configuration: No CUDA
Nvidia driver version: No CUDA
cuDNN version: No CUDA
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz

Versions of relevant libraries:
[pip3] mypy-extensions==1.0.0
[pip3] numpy==1.24.3
[pip3] torch==2.0.0
[pip3] torchaudio==2.0.1
[pip3] torchvision==0.15.1
[conda] blas 1.0 mkl
[conda] mkl 2019.4 233
[conda] mkl-service 2.3.0 py38h9ed2024_0
[conda] mkl_fft 1.2.1 py38ha059aab_0
[conda] mkl_random 1.1.1 py38h959d312_0
[conda] numpy 1.20.1 pypi_0 pypi
[conda] numpy-base 1.19.2 py38hcfb5961_0
[conda] numpydoc 1.1.0 pyhd3eb1b0_1
[conda] pytorch3d 0.4.0 pypi_0 pypi
[conda] torch 1.9.0 pypi_0 pypi

cc @pmeier

@pmeier
Copy link
Collaborator

pmeier commented May 1, 2023

It's not just the download, the whole website seems to be down: https://ai.stanford.edu/~jkrause/cars/car_dataset.html. Googling for this dataset reveals that this is still the "current" address. Meaning, maybe this is temporary and will come back up. We should monitor this and maybe reach out to the authors in case this persists.

Wondering why our download tests didn't catch this 🤔

@ysmintor
Copy link

I met the same problem. And I have not find a mirror link to download link. Hope them can fix it as soon as possible.

@IamMohitM
Copy link
Author

For people looking for a quick solution:

You can download the dataset from kaggle and use the following class to create a torch Dataset.

class StanfordCars(torch.utils.data.Dataset):
    def __init__(self, root_path, transform = None):
        self.images = [os.path.join(root_path, file) for file in os.listdir(root_path)]
        self.transform = transform

    def __len__(self):
        return len(self.images)

    def __getitem__(self, index):
        image_file = self.images[index]
        image = Image.open(image_file).convert("RGB")
        if self.transform:
            image = self.transform(image)
        return image[None]
  

@ricefryegg
Copy link

ricefryegg commented Jun 4, 2023

Thank you @IamMohitM, we can use data = torchvision.datasets.StanfordCars(root="./", download=True) and with the following steps and avoid changes to the code

  • Download dataset bundle from kaggle, extract, and remove recursive directory structure (eg stanford_cars/cars_test/cars_test)
  • Download car_devkit.tgz, extract it in stanford_cars

Confirm dataset structures are as follows:

└── stanford_cars
    └── cars_train
        └── .jpg
    └── cars_test
        └── .jpg
    └── devkit
        ├── cars_meta.mat
        ├── cars_test_annos.mat
        ├── cars_train_annos.mat
        ├── eval_train.m
        ├── README.txt
        └── train_perfect_preds.txt

@pgsld23333
Copy link

Thank you @IamMohitM, we can use data = torchvision.datasets.StanfordCars(root="./", download=True) and with the following steps and avoid changes to the code

* Download dataset bundle from [kaggle](https://www.kaggle.com/datasets/jessicali9530/stanford-cars-dataset), extract, and remove recursive directory structure (eg `stanford_cars/cars_test/cars_test`)

* Download [car_devkit.tgz](https://github.com/pytorch/vision/files/11644847/car_devkit.tgz), extract it in `stanford_cars`

Confirm dataset structures are as follows:

└── stanford_cars
    └── cars_train
        └── .jpg
    └── cars_test
        └── .jpg
    └── devkit
        ├── cars_meta.mat
        ├── cars_test_annos.mat
        ├── cars_train_annos.mat
        ├── eval_train.m
        ├── README.txt
        └── train_perfect_preds.txt

But the annotation of the test set is still not available.

@iamchenxin
Copy link

Thank you @IamMohitM, we can use data = torchvision.datasets.StanfordCars(root="./", download=True) and with the following steps and avoid changes to the code

* Download dataset bundle from [kaggle](https://www.kaggle.com/datasets/jessicali9530/stanford-cars-dataset), extract, and remove recursive directory structure (eg `stanford_cars/cars_test/cars_test`)

* Download [car_devkit.tgz](https://github.com/pytorch/vision/files/11644847/car_devkit.tgz), extract it in `stanford_cars`

Confirm dataset structures are as follows:

└── stanford_cars
    └── cars_train
        └── .jpg
    └── cars_test
        └── .jpg
    └── devkit
        ├── cars_meta.mat
        ├── cars_test_annos.mat
        ├── cars_train_annos.mat
        ├── eval_train.m
        ├── README.txt
        └── train_perfect_preds.txt

But the annotation of the test set is still not available.

A cars_test_annos_withlabels.mat should be placed into base folder。(something like “stanford_cars\cars_test_annos_withlabels.mat”)

@thefirebanks
Copy link

thefirebanks commented Jul 11, 2023

@iamchenxin Just to clarify what you mean:

The original file cars_test_annos.mat in the devkit/ folder you mentioned does NOT contain the annotated labels, so it's not enough to download the dataset from Kaggle and the devkit you sent. I found the cars_test_annos_withlabels.mat file in one of the examples from the Kaggle dataset:

https://www.kaggle.com/code/subhangaupadhaya/pytorch-stanfordcars-classification/input?select=cars_test_annos_withlabels+%281%29.mat

and I'm sure that other code examples also load this file as part of their input. So to summarize, we need:

The directory structure you provided earlier works well once we add the missing file!

└── stanford_cars
    └── cars_test_annos_withlabels.mat
    └── cars_train
        └── *.jpg
    └── cars_test
        └── .*jpg
    └── devkit
        ├── cars_meta.mat
        ├── cars_test_annos.mat
        ├── cars_train_annos.mat
        ├── eval_train.m
        ├── README.txt
        └── train_perfect_preds.txt

If the script/notebook we're writing the code in is at the same directory level as the stanford_cars/ folder, we can write:

data = torchvision.datasets.StanfordCars(root="./", download=True)

Hope that this helps! @pgsld23333 @IamMohitM let me know if I missed something.

@jzhangCSER01
Copy link

@iamchenxin Just to clarify what you mean:

The original file cars_test_annos.mat in the devkit/ folder you mentioned does NOT contain the annotated labels, so it's not enough to download the dataset from Kaggle and the devkit you sent. I found the cars_test_annos_withlabels.mat file in one of the examples from the Kaggle dataset:

https://www.kaggle.com/code/subhangaupadhaya/pytorch-stanfordcars-classification/input?select=cars_test_annos_withlabels+%281%29.mat

and I'm sure that other code examples also load this file as part of their input. So to summarize, we need:

The directory structure you provided earlier works well once we add the missing file!

└── stanford_cars
    └── cars_test_annos_withlabels.mat
    └── cars_train
        └── *.jpg
    └── cars_test
        └── .*jpg
    └── devkit
        ├── cars_meta.mat
        ├── cars_test_annos.mat
        ├── cars_train_annos.mat
        ├── eval_train.m
        ├── README.txt
        └── train_perfect_preds.txt

If the script/notebook we're writing the code in is at the same directory level as the stanford_cars/ folder, we can write:

data = torchvision.datasets.StanfordCars(root="./", download=True)

Hope that this helps! @pgsld23333 @IamMohitM let me know if I missed something.

thanks a lot! it's really helpful

@Coderx7
Copy link

Coderx7 commented Mar 6, 2024

@pmeier Its been almost a year and this hasn't been remedied yet. Whats the plan going forward? removing it or keeping it broken like this?

@NicolasHug
Copy link
Member

NicolasHug commented Mar 12, 2024

Thanks all for the reports. Unfortunately the URL has been consistently broken for a while now, so we decided to disable the (broken) download functionality. Passing download=True will now result in an error, and point the users to @thefirebanks 's #7545 (comment), suggesting to download the dataset manually from Kaggle. Thank you @thefirebanks for the very helpful instructions.

These changes will be effective from torchvision 0.18, aimed to be released in April 2024. I'll close this issue, but #7545 (comment) is still relevant until further notice.

@rygx
Copy link
Contributor

rygx commented Aug 11, 2024

Looks like the existing datasets are either missing some files or not fully conforming to the torchvision's required structures. So a lot of upfront manual work is needed every time to get ready.

To tackle this issue, a new dataset is created on Kaggle that is compatible with the latest version of torchvision: https://www.kaggle.com/datasets/rickyyyyyyy/torchvision-stanford-cars

The dataset can be setup with

import kaggle
# you need to configure API key through https://www.kaggle.com/docs/api
kaggle.api.dataset_download_files('rickyyyyyyy/torchvision-stanford-cars', path=YOUR_DATA_PATH, unzip=True)

And then can be used by torchvision

import torchvision
data = torchvision.datasets.StanfordCars(root=YOUR_DATA_PATH, download=False)

@o-laurent
Copy link

o-laurent commented Nov 5, 2024

Hi, and thanks, @rygx, for your help!

I could put this version of the dataset on Zenodo (giving credits to the original owners) to make it accessible directly from a URL without authentication or additional dependency so as to fix this problem completely. Let me know if that would be helpful.

Have a great day.

@BKJackson
Copy link

Loading the dataset was not straightforward for me, but I eventually got something to work.

I downloaded this version from kagglehub and saved them in my google drive:

import kagglehub
path = kagglehub.dataset_download("emanuelriquelmem/stanford-cars-pytorch")

This will plot the images:

import os
import matplotlib.pyplot as plt
import matplotlib.image as mpimg

def show_images(dataset_path, num_samples=20, cols=4):
    """ Plots some samples from the dataset """

    # Get a list of all image files in the folder
    image_files = [f for f in os.listdir(dataset_path) if f.endswith('.jpg')]

    plt.figure(figsize=(15,15))

    for i in range(min(num_samples, len(image_files))):
        # Load the image using matplotlib.image
        img = mpimg.imread(os.path.join(dataset_path, image_files[i]))

        plt.subplot(int(num_samples/cols) + 1, cols, i + 1)
        plt.imshow(img)
        plt.axis('off')  # Turn off axis ticks and labels

    plt.show()

# Call the function to display the images
dataset_path = 'drive/MyDrive/stanford_cars/cars_test'
show_images(dataset_path)

This will load the train and test sets:

import os
import scipy.io as sio
from PIL import Image
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms

IMG_SIZE = 64
BATCH_SIZE = 128

class StanfordCarsDataset(Dataset):
    def __init__(self, root_dir, annotations_file=None, transform=None, test=False):
        self.root_dir = root_dir
        self.transform = transform
        self.image_paths = [os.path.join(root_dir, filename) for filename in os.listdir(root_dir) if filename.endswith('.jpg')]

        if annotations_file: 
            self.annotations = sio.loadmat(annotations_file)['annotations'][0]  # Load annotations
            if test:
                self.filename_to_label = {ann[4][0]: -1 for ann in self.annotations} #Assign -1 to all test images
            else:
                self.filename_to_label = {ann[5][0]: int(ann[4][0][0]) for ann in self.annotations}  # Create mapping
        else:
            self.filename_to_label = {}  # Empty dictionary if no annotations file is provided

    def __len__(self):
        return len(self.image_paths)

    def __getitem__(self, idx):
        image_path = self.image_paths[idx]
        image = Image.open(image_path).convert('RGB')  # Ensure RGB format

        # Get label if available
        filename = os.path.basename(image_path)
        label = self.filename_to_label.get(filename, -1)  # Use -1 as default label if not in the annotations

        if self.transform:
            image = self.transform(image)

        # Always return a label even if -1    
        return image, label

def load_transformed_dataset():
    data_transforms = [
        transforms.Resize((IMG_SIZE, IMG_SIZE)),
        transforms.RandomHorizontalFlip(),
        transforms.ToTensor(), # Scales data into [0,1]
        transforms.Lambda(lambda t: (t * 2) - 1) # Scale between [-1, 1]
    ]
    data_transforms = transforms.Compose(data_transforms)



    train = StanfordCarsDataset(root_dir='drive/MyDrive/stanford_cars/cars_train', 
                               annotations_file='drive/MyDrive/stanford_cars/devkit/cars_train_annos.mat', 
                               transform=data_transforms, test=False)
    
    test = StanfordCarsDataset(root_dir='drive/MyDrive/stanford_cars/cars_test', 
                               annotations_file='drive/MyDrive/stanford_cars/devkit/cars_test_annos.mat', 
                               transform=data_transforms, test=True)

    return torch.utils.data.ConcatDataset([train, test])

data = load_transformed_dataset()
dataloader = DataLoader(data, batch_size=BATCH_SIZE, shuffle=True)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests