Skip to content

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets(uncompressed).

Notifications You must be signed in to change notification settings

AdamRain/YFCC15M_downloader

Repository files navigation

YFCC15M_downloader

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets.


We followed the dataset preparation process of DeCLIP here.

  1. First, Download DeCLIP's YFCC15M label file 'yfcc15m_clean_open_data.json' at Google Driver.

  2. Extract the URL from the JSON file and split it into several URL list files for download using split_download_task.py.

  3. Crawl the image by the URL dirctely using auto_download.bat (Here, we use Wget, you may need to install that). The bat file is for Windows, and you may need to rewrite a shell file if using Linux. Or, simply download from the links below!

    • You can stop the process and start over afterward if something is wrong. Wget will skip the downloaded files and clean log files.
    • The error will be recorded in log files. Before re-start the download, it is recommended to run clean_err_file_from_logs.py to filter and delete the wrong files.
  4. Check the downloaded images using check_images.py.


Dataset infos:

  • The dataset should contains 15,388,848 images.
  • We managed to crawl 15,061,747 of them.
  • Total space occupied: 867.73G.

Web Drive links:


If the link fails, please leave a message in the issue.

2024-11-13 update: You may use the bypy tools to download the files from Baidu Yun Web Drive.

About

A subset of YFCC100M. Tools, checking scripts and links of web drive to download datasets(uncompressed).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published