Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

downsampled_imagenet broken #4662

Open
marikgoldstein opened this issue Jan 18, 2023 · 3 comments
Open

downsampled_imagenet broken #4662

marikgoldstein opened this issue Jan 18, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@marikgoldstein
Copy link

marikgoldstein commented Jan 18, 2023

Hi TFDS,

downsampled_imagenet (32x32) gives a 404 (stack trace at end of issue). This is because the imagenet link stored by tfds (https://image-net.org/small/download.php) is broken. The broken link is also featured in some papers such as Pixel Recurrent Neural Networks.

There is a different New currently-working link for 32x32 imagenet (https://image-net.org/download-images.php, if you log in, you can see a 32x32 option).

Let us refer to them as OLD (what TFDS used to host) and NEW (currently on imagenet website).

An anon. ICLR reviewer (see "weaknesses" under reviewer AKwV) mentioned that NEW is "too easy" and cannot be used to compare to old results using OLD. The reviewer also mentioned that OLD floats around the community on some torrent.

TFDS' link to OLD likely broke more recently than 9 months ago since another Google repo shared code that uses tfds to get downsampled_imagenet (I left an issue there google-research/vdm#8) and their datasets.py file was pushed then.

None of these are the same as imagenet_resized.

Purpose:

  • for tfds team to consider what to do with the broken link, in light of the above considerations. This helps the library regardless of any research community issues.
  • (possibly beyond tfds) clarify difference to researchers and making both versions available

Possible solution:

  • if several people reach consensus that they have OLD, it could be posted on tfds as a "old_downsampled_imagenet" to help reproduce existing research that used the data.

Examples of research using OLD

Some ICLR publications from this year already use NEW.

Thanks!
Mark

Environment information

  • Operating System: Ubuntu VERSION="18.04.6 LTS (Bionic Beaver)"

  • Python version: 3.9.12

  • tensorflow-datasets/tfds-nightly version: tfds '4.7.0' and tfds '4.8.2+nightly'

  • tensorflow/tf-nightly version: tf '2.10.0'

  • Does the issue still exists with the last tfds-nightly package (pip install --upgrade tfds-nightly) ?

Yes

Reproduction instructions

import tensorflow_datasets as tfds                                                                                                                                                               
ds = tfds.load('downsampled_imagenet', split='validation', as_supervised=True, batch_size=128)

Link to logs

2023-01-18 12:03:50.178320: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2 AVX AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-18 12:03:51.793197: W tensorflow/core/platform/cloud/google_auth_provider.cc:184] All attempts to get a Google authentication bearer token failed, returning an empty token. Retrieving token from files failed with "NOT_FOUND: Could not locate the credentials file.". Retrieving token from GCE failed with "FAILED_PRECONDITION: Error executing an HTTP request: libcurl code 6 meaning 'Couldn't resolve host name', error details: Could not resolve host: metadata".
Downloading and preparing dataset Unknown size (download: Unknown size, generated: Unknown size, total: Unknown size) to /home/marik/tensorflow_datasets/downsampled_imagenet/32x32/2.0.0...
Dl Size...: 0 MiB [00:00, ? MiB/s]                                                                                                                                       | 0/2 [00:00<?, ? url/s]
Dl Completed...:   0%|                                                                                                                                                   | 0/2 [00:00<?, ? url/s]
Traceback (most recent call last):
  File "/home/marik/imnet2.py", line 2, in <module>
    ds = tfds.load('downsampled_imagenet', split='validation', as_supervised=True, batch_size=128)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/logging/__init__.py", line 250, in decorator
    return function(*args, **kwargs)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/load.py", line 575, in load
    dbuilder.download_and_prepare(**download_and_prepare_kwargs)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_builder.py", line 523, in download_and_prepare
    self._download_and_prepare(
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1244, in _download_and_prepare
    split_generators = self._split_generators(  # pylint: disable=unexpected-keyword-arg
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/image/downsampled_imagenet.py", line 102, in _split_generators
    train_path, valid_path = dl_manager.download([
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 552, in download
    return _map_promise(self._download, url_or_urls)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 770, in _map_promise
    res = tf.nest.map_structure(lambda p: p.get(), all_promises)  # Wait promises
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 917, in map_structure
    structure[0], [func(*x) for x in entries],
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow/python/util/nest.py", line 917, in <listcomp>
    structure[0], [func(*x) for x in entries],
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/download_manager.py", line 770, in <lambda>
    res = tf.nest.map_structure(lambda p: p.get(), all_promises)  # Wait promises
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 512, in get
    return self._target_settled_value(_raise=True)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 516, in _target_settled_value
    return self._target()._settled_value(_raise)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 226, in _settled_value
    reraise(type(raise_val), raise_val, self._traceback)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/six.py", line 719, in reraise
    raise value
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/promise/promise.py", line 844, in handle_future_result
    resolve(future.result())
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/concurrent/futures/_base.py", line 391, in __get_result
    raise self._exception
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/concurrent/futures/thread.py", line 58, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/downloader.py", line 217, in _sync_download
    with _open_url(url, verify=verify) as (response, iter_content):
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/downloader.py", line 279, in _open_with_requests
    _assert_status(response)
  File "/home/marik/anaconda2/envs/myenv/lib/python3.9/site-packages/tensorflow_datasets/core/download/downloader.py", line 306, in _assert_status
    raise DownloadError('Failed to get url {}. HTTP code: {}.'.format(
tensorflow_datasets.core.download.downloader.DownloadError: Failed to get url https://image-net.org/small/train_32x32.tar. HTTP code: 404.
@marikgoldstein marikgoldstein added the bug Something isn't working label Jan 18, 2023
@marikgoldstein marikgoldstein changed the title downsampled_imagenet broken (important for reproducing research) downsampled_imagenet broken Jan 18, 2023
@marikgoldstein
Copy link
Author

I also reached out to the imagenet moderators to hear their input and will post any response here.

@marikgoldstein
Copy link
Author

@Kim-Dongjun provided a good explanation and shared the location of the torrent that people use for the original data from pixel rnn. Here is Dongjun's explanation of the discrepancy (which also coincides with things I've heard from some authors at talks/conferences):

  • There is a downsampled ImageNet dataset, which I call it "small".
  • The small dataset was widely used in the community of generative models for long time
  • but StyleGAN-XL, Efficient-VDVAE, or other large-scale papers tend to use ILSVRC12 dataset for their report on ImageNet 32x32 or ImageNet 64x64.
  • The small dataset, however, is unattainable officially. It is available at this torrent link
  • it is strange that we have to use "torrent" for the research, but as far as I know, there is no other websites that we can download the downsampled "small" ImageNet dataset.
  • The signal is that the downsampled dataset has 49999 validation data, whereas the original ILSVRC12 dataset has 50000 validation data.
  • The downsampled dataset is from Pixel RNN paper.

@marikgoldstein
Copy link
Author

marikgoldstein commented Jan 19, 2023

Here is a summary.

For imagenet 32x32, some papers use an "old" version and some use a "new" version. My understanding is:

  • the "new" one is the one current available here
  • the "old" one was previously available here
  • the "old" one is still unofficially available at this torrent link. I downloaded this and can share it more directly in case the torrent is too slow.
  • TFDS has two imagenet 32x32's : "downsampled_imagenet" and "imagenet_resized"
  • tfds "resized" is a different dataset unrelated to this discussion + tfds docs already do a good job at warning that the dataset differs
  • tfds "downsampled" currently gives a 404 error because it goes to the "old" link
  • unfortunately, it's not always clear in papers who used "old", "new". or "resized", and it affects likelihoods / ability to reproduce research

My proposals are

  • tfds choose a default to fix the 404 (maybe "new" since it officially available)
  • consider whether it also makes sense to host the "old" one to help reproduce old research. If so, which would be the ground truth source? the torrent? the original pixelrnn authors?

Thanks, curious about others' take on this issue and for others to confirm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant